Go to the first, previous, next, last section, table of contents.

Tuning Prosodic Models

This chapter describes how speaker duration and intonation models may be created from labeled databases. These techniques are still very experimental and only appropriate to some databases--discretion is advised in their use.

Hopefully the following is not just a step-by-step instruction to train models, but also gives an insight to the sort of investigations which are possible with the CHATR system.

PhonoForm Utterance Types

There are many types of information which are pertinent to the generation of prosody in speech synthesis. As anyone who has wished to build models from data will be deeply aware of, getting the appropriate information from a database in the right format is a time consuming and error-prone task. To try to combat that, the following models extract their information from a common structure which is built from information in a speech database. Building that structure is still non-trivial, but once built all systems may access it in a well defined uniform manner thus reducing effort and errors.

Currently we have built PhonoForm utterances for the databases f2b and MHT in `chatr/pf/'.

Building PhonoForm Utterance Types

The object is to create a CHATR utterance which contains all information that you wish to use in generating training and test data for building models. The PhonoForm utterance type is designed for that. It allows explicit specification of

Segments
Duration
Power
Pitch
Syllables (or mora in Japanese)
Lexical stress (or accent in Japanese)
ToBI labels (accents, phrase accents and boundary tones)
Break indices
Words

A PhonoForm structure may be created for each utterance in the database from a set of XWAVES-labeled files of the form described below. Some of these files already exist (made during the speech synthesis database creation process), and others can be trivially produced from that information. There are some however, that require significant effort to obtain, e.g. hand labeled. Those required are

phoneme labels: These are directly available from the `lab/' directory.
pitch: These can be created from the information generated during the building of a speech synthesis database using the script db_utils/make_pf_pitch.
power: These can be created from the information generated during the building of a speech synthesis database using the script db_utils/make_pf_power.
syllables: These identify the ends of syllables and also specify if the syllable is `stressed' or not. Note the definition of `stressed' is database dependent. The value from this is used in the PhonoForm utterance in the same field as lexical stress. Hence both values should at least try to match. Although there are algorithms that can syllabify phonemes from any source, it is much more useful if this syllabification matches that produced by the lexicon that is to be used in text-to-speech. It is up to the user to provide this file. In Japanese it is easier. A phoneme-label to syllable (actually mora) script is provided in make_pf_jsyl. `Stressed' and `unstressed' in Japanese should be lexical accents (i.e. what we mark with a single quote). Getting that information into the syllable file is the responsibility of the user.
tones: ToBI tones. Although named ToBI, to be fair this would not necessarily need to be any particular ToBI labeling system--it currently works well with English or Japanese ToBI. Hence any intonation labels should be appropriate. Note that these labels will be syllable aligned in the PhonoForm structure. Any fine marking of position within a syllable (i.e. the `<' in JToBI) or fine positioning (i.e. the `HiF0' in ToBI) is currently ignored. Another problem is the accuracy of the labeling, especially in ending or starting tones. Ending tone positions sometimes fall within a syllable boundary (i.e. the last phoneme in the syllable) and sometimes they are just over. A similar, but reversed situation happens with starting tones. An attempt to compensate for this error is made in the final construction, but it cannot guarantee to work in all cases.
breaks: Tobi break indices. These should be values between 0 and 4.
words: These identify word boundaries. In Japanese this is not so clear, but these boundaries should reflect what word boundaries would occur in actual synthesis, which is probably `bunsetsu'.

Assuming the existence of these files in the database directory in the directory `others/', they may be collected into one file, one for each fileid. This file is like an XWAVES-labeled file, but has no header and it's `color' is set to a type (word, phone, tones etc). The file is called `FILEID.labs'. At this time small adjustments are made to some labels to try to ensure they appear on the correct side of boundaries. It may be necessary to change these `fiddle factors' to get the right results.

Once created, the `.labs' file may be converted into a CHATR utterance input form (bracketed structure). The following script attempts to do this automatically

     db_utils/make_pfs

The result should be a set of files in `chatr/pf/' that describe an utterance in the database, one for each fileid.

Using PhonoForm Utterances

The simplest way to use PhonoForm utterances is as follows

     (test_pf FILEID)

like the other test functions (test_seg), this excludes the named file from the database and then synthesizes this from the remaining database units. In this form the result should be the same as a simple test_seg as the segments, durations, power and pitch are the same.

By default a PhonoForm utterance is simply loaded and then the waveform synthesizer is called. A more interesting way of using this utterance is to load it, then call your own module, then call the waveform synthesis routine. This way you find out how well (or badly) your module is affecting the synthesis. The easiest way to do this is define your own synthesis routine. Suppose we wish to check a new duration module and hence just call duration, in the context of natural phrasing, accents, pitch, and segments. We could define

     (define dur_synth (utt)
        (Input utt)     ;; do all the load of the utterance
        (Duration utt)  ;; just run my new duration module
        (Synthesis utt) ;; generate a new waveform
     )

Now we can test our duration module with everything else natural.

     (dur_synth (lpf FILEID))

The function lpf simply loads the PhonoForm utterance.

NOTE the above example is a little simple as the pitch target points may be adversely affected by the fact that the durations have changed. You may want to call Int_target to regenerate the intonation pitch targets again. Also, lpf does not exclude FILEID from the search. See `lib/data/udb.ch' for the definition of test_pf if you wish to. Note it may be better to modify the original waveform if you are testing duration prediction. If you use PSOLA just on the original waveform, then you get to hear the distortion introduced by the duration module rather than the distortion introduced by the duration module plus that introduced by the unit selection.

Extracting Features

Another use of the PhonoForm utterance structure is to extract data for building prediction models. The function Feats_Out will dump a set of features for a given utterance. For example, suppose we wish for all syllables to know their mid-pitch and features we believe can be used to predict that value, namely, lexical stress, tobi accent, boundary after syllable, number of syllables in from the start of phrase and number of syllables to end. For a PhonoForm utterance we can simply do

     (test_pf FILEID)
     (Feats_Out utt_pf 'Syl 
      '(syl_f0 stress tobi_accent bi syl_in syl_out)
      (strcat "feats/" FILEID ".sylinfo"))

This will dump the information in a file called `feats/FILEID.sylinfo', one line per vector. Of course, we really want to do this for the whole database, so let us assume that the Lisp variable files is set to a list of all fileids in a database, then the following would achieve that

     (Parameter Synth_Method NONE)
     (set required_feats '(syl_f0 stress tobi_accent ...))
     (define get_feats (name)
        (print name)   ;; so we can see the progress
        (test_pf name)
        (Feats_Out utt_pf 'Syl required_feats 
                 (strcat "feats/" FILEID ".sylinfo")))
     (mapc get_feats files)

The first line means that no waveform synthesis occurs, that making this dump substantially faster. The following section gives a large example of using this technique to collect information from a database for the building of models.

Available features are defined in `src/chatr/feats.c', of may be listed in Lisp by calling Feats_Out with no arguments.

Training a New Duration Model

This section describes how to extract information from a database (using PhonoForm utterances) and build a linear regression model for predicting duration. A duration model like this is already included in the CHATR distribution `lib/data/f2b_lrdur.ch' built from the BU FM Radio data corpus speaker f2b. This shows how to train for a new speaker.

The example lisp code discussed here is given in `lib/examples/train_lrdur.ch'.

The first stage is to extract data from the database in the form of features. In this model we are going to build two models, one for vowels and one for consonants. As these two models use different features, we have to create two sets of feature files. Two lists of features are defined in the variables C_durfields for consonants and V_durfields for vowels. (Available features are defined in `src/chatr/feats.c'.)

After the loading and setup of the database, the list of all fileids in the database is set in the variables files. We will deal with all files here and do the split of training and testing data later. Next come the definition of the dumping functions. Because we cannot (or least not very easily) dump the features based on vowel, consonant, we will dump them all and split them later. At the same time we dump the individual phonemes in a separate file. The dump function get_feats_utt works for a single fileid calling the function dumpfeats will map over the whole database.

     (dumpfeats}

The function will dump three files for each fileid in the database, each in the `feats/' subdirectories. These files contain, the vowel features (defined by V_durfields), the consonant features (defined by C_durfields), and the phoneme.

The next task is to collate the fields into training and test data files for the linear regression software. An example shell script to do this is given in /lib/data/examples/make_lrdat. The script first collates all the consonant data together and removes all vectors that are actually for pauses, breaths and vowels. It then splits the feature sets into train and test sets by simply adding parentheses round the vectors. It then does the same for the vowel data.

The third stage is to build the linear regression models from the feature vectors. This can be done (from the `dur/' directory) in CHATR with `train_lrdur.ch' loaded. The function dolrall takes a file name and a list of features and builds a linear regression model, then tests that model with the training and test data. Two calls are necessary, one for the consonants and a second for the vowels.

     (dolrall "datC01" C_durfields)
     (dolrall "datV01" V_durfields)

Note the mean errors are in z scores so they are not immediately recognizable as durations.

You should look at the files `datC01.info' and `datV01.info' which contain the detailed results of linear regression. Interesting points are of course the correlation, the stepwise model showing relative contributions of each feature and the dropped section which shows features which make no contribution, often this is because they are completely predictable from some other feature (or set of features).

Once you are happy with the prediction capability you build an actual duration model from this data and test it in the system. A call to the function save_lrmodel with arguments of consonant model name, vowel model name and output file name

     (save_lrmodel "datC01" "datV01" "lrmodel01.ch")

Note normally you would want to edit the created file `lrmodel01.ch' to add at least some comments about where this model came from. Also by default this sets the variable dur_lr_model. It is more likely that some other variable should be set and that dur_lr_model only be set when this model is actually selected. See `db_utils/DBNAME_synth.ch' for the typically use of dur_lr_model.

Now that we have built a duration model we can actually use it to predict durations. Again we wish to run through the whole database and load in the `pf' utterances, save the actually duration and then run our new duration on that utterance and save our predicted duration. We can do this with the function test_new_dur_model defined in `train_lrdur.ch'. For the main database directory call

     (test_new_dur_model "dur/lrmodel01.ch")

Now in `feats/' for each fileid there are files `.accdur' and `.preddur'. It is left as an exercise to the reader to use these files to find the mean error and correlation for the overall model (vowels and consonants).

Training a ToBI-Based F0 Prediction Model

Here we will give an example of building an F0 prediction model using linear regression with ToBI labels as input parameters. First it is assumed that your database from which the model is to be built is labeled with ToBI (or similar) labels, and a set of PhonoForm utterances as been created. See section PhonoForm Utterance Types, for more information.

We can build a new F0 prediction model in a similar way to how a new duration model is built. See section Training a New Duration Model. Again, you need to decide on the features. In this case three models are required. However, as they are for syllables rather than phonemes, the amount of data is much smaller than for the duration case.

Training a New Reduction Model

The reduction model used in speaker f2b and some other English speakers was trained from the f2b database. Again the PhonoForm utterances were used. The feature pf_reduced was used as the value that was to be predicted. This value was is calculated for each syllable. The word the syllable is in is looked up in the lexicon then an attempt to align the syllables in the lexicon version with the actual version is made. If it does line up, then it checks to see if the syllable's vowel is the same as the vowel in the corresponding syllable in the lexical entry. If different, it checks a list of schwa pairs, and if the actual vowel is listed as a schwa version of the lexical vowel, the syllable is marked as reduced.

A set of vectors was collected including this reduced value plus other pertinent features. From that information a CART decision tree was built to try to predict reduction. The result is the small tree now in `lib/data/reduce.ch'. It seems relevant and sounds reasonable.

This technique does effectively train to a particular lexicon, as various lexicons often make different decisions about the amount of vowel reduction in their lexical entries. This particular example was trained using the CMU lexicon, but when the same reduction tree is used with the speaker who used the BEEP lexicon, the results are not so good.

Go to the first, previous, next, last section, table of contents.