This chapter describes how speaker duration and intonation models may be created from labeled databases. These techniques are still very experimental and only appropriate to some databases--discretion is advised in their use.
Hopefully the following is not just a step-by-step instruction to train models, but also gives an insight to the sort of investigations which are possible with the CHATR system.
There are many types of information which are pertinent to the generation of prosody in speech synthesis. As anyone who has wished to build models from data will be deeply aware of, getting the appropriate information from a database in the right format is a time consuming and error-prone task. To try to combat that, the following models extract their information from a common structure which is built from information in a speech database. Building that structure is still non-trivial, but once built all systems may access it in a well defined uniform manner thus reducing effort and errors.
Currently we have built PhonoForm utterances for the databases f2b and MHT in `chatr/pf/'.
The object is to create a CHATR utterance which contains all
information that you wish to use in generating training and test data
for building models. The PhonoForm
utterance
type is designed for that. It allows explicit specification of
A PhonoForm
structure may be created for each utterance in the
database from a set of XWAVES-labeled files of the form described
below. Some of these files already exist (made during the speech
synthesis database creation process), and others can be trivially
produced from that information. There are some however, that require
significant effort to obtain, e.g. hand labeled. Those required are
phoneme labels
pitch
db_utils/make_pf_pitch
.
power
db_utils/make_pf_power
.
syllables
PhonoForm
utterance in the same field as lexical stress. Hence both values
should at least try to match. Although there are algorithms that can
syllabify phonemes from any source, it is much more useful if this
syllabification matches that produced by the lexicon that is to be
used in text-to-speech. It is up to the user to provide this file.
In Japanese it is easier. A phoneme-label to syllable (actually
mora) script is provided in make_pf_jsyl
. `Stressed' and
`unstressed' in Japanese should be lexical accents (i.e. what we mark
with a single quote). Getting that information into the syllable
file is the responsibility of the user.
tones
PhonoForm
structure. Any fine
marking of position within a syllable (i.e. the `<' in JToBI) or
fine positioning (i.e. the `HiF0' in ToBI) is currently ignored.
Another problem is the accuracy of the labeling, especially in
ending or starting tones. Ending tone positions sometimes fall
within a syllable boundary (i.e. the last phoneme in the syllable)
and sometimes they are just over. A similar, but reversed situation
happens with starting tones. An attempt to compensate for this error
is made in the final construction, but it cannot guarantee to work in
all cases.
breaks
words
Assuming the existence of these files in the database directory in the directory `others/', they may be collected into one file, one for each fileid. This file is like an XWAVES-labeled file, but has no header and it's `color' is set to a type (word, phone, tones etc). The file is called `FILEID.labs'. At this time small adjustments are made to some labels to try to ensure they appear on the correct side of boundaries. It may be necessary to change these `fiddle factors' to get the right results.
Once created, the `.labs' file may be converted into a CHATR utterance input form (bracketed structure). The following script attempts to do this automatically
db_utils/make_pfs
The result should be a set of files in `chatr/pf/' that describe an utterance in the database, one for each fileid.
The simplest way to use PhonoForm utterances is as follows
(test_pf FILEID)
like the other test functions (test_seg
), this excludes
the named file from the database and then synthesizes this from
the remaining database units. In this form the result should be
the same as a simple test_seg
as the segments, durations,
power and pitch are the same.
By default a PhonoForm utterance is simply loaded and then the waveform synthesizer is called. A more interesting way of using this utterance is to load it, then call your own module, then call the waveform synthesis routine. This way you find out how well (or badly) your module is affecting the synthesis. The easiest way to do this is define your own synthesis routine. Suppose we wish to check a new duration module and hence just call duration, in the context of natural phrasing, accents, pitch, and segments. We could define
(define dur_synth (utt) (Input utt) ;; do all the load of the utterance (Duration utt) ;; just run my new duration module (Synthesis utt) ;; generate a new waveform )
Now we can test our duration module with everything else natural.
(dur_synth (lpf FILEID))
The function lpf
simply loads the PhonoForm utterance.
NOTE the above example is a little simple as the pitch target points
may be adversely affected by the fact that the durations have
changed. You may want to call Int_target
to regenerate the
intonation pitch targets again. Also, lpf
does not exclude
FILEID from the search. See `lib/data/udb.ch' for the
definition of test_pf
if you wish to. Note it may be better
to modify the original waveform if you are testing duration
prediction. If you use PSOLA just on the original waveform, then you
get to hear the distortion introduced by the duration module rather
than the distortion introduced by the duration module plus
that introduced by the unit selection.
Another use of the PhonoForm utterance structure is to extract data
for building prediction models. The function Feats_Out
will
dump a set of features for a given utterance. For example, suppose
we wish for all syllables to know their mid-pitch and features we
believe can be used to predict that value, namely, lexical stress,
tobi accent, boundary after syllable, number of syllables in
from the start of phrase and number of syllables to end. For
a PhonoForm utterance we can simply do
(test_pf FILEID) (Feats_Out utt_pf 'Syl '(syl_f0 stress tobi_accent bi syl_in syl_out) (strcat "feats/" FILEID ".sylinfo"))
This will dump the information in a file called
`feats/FILEID.sylinfo', one line per vector. Of course, we
really want to do this for the whole database, so let us assume that
the Lisp variable files
is set to a list of all fileids in a
database, then the following would achieve that
(Parameter Synth_Method NONE) (set required_feats '(syl_f0 stress tobi_accent ...)) (define get_feats (name) (print name) ;; so we can see the progress (test_pf name) (Feats_Out utt_pf 'Syl required_feats (strcat "feats/" FILEID ".sylinfo"))) (mapc get_feats files)
The first line means that no waveform synthesis occurs, that making this dump substantially faster. The following section gives a large example of using this technique to collect information from a database for the building of models.
Available features are defined in `src/chatr/feats.c', of may be
listed in Lisp by calling Feats_Out
with no arguments.
This section describes how to extract information from a database (using PhonoForm utterances) and build a linear regression model for predicting duration. A duration model like this is already included in the CHATR distribution `lib/data/f2b_lrdur.ch' built from the BU FM Radio data corpus speaker f2b. This shows how to train for a new speaker.
The example lisp code discussed here is given in `lib/examples/train_lrdur.ch'.
The first stage is to extract data from the database in the form of
features. In this model we are going to build two models, one for
vowels and one for consonants. As these two models use different
features, we have to create two sets of feature files. Two lists of
features are defined in the variables C_durfields
for
consonants and V_durfields
for vowels. (Available features
are defined in `src/chatr/feats.c'.)
After the loading and setup of the database, the list of all fileids
in the database is set in the variables files
. We will deal
with all files here and do the split of training and testing data
later. Next come the definition of the dumping functions. Because
we cannot (or least not very easily) dump the features based on
vowel, consonant, we will dump them all and split them later. At the
same time we dump the individual phonemes in a separate file. The
dump function get_feats_utt
works for a single fileid calling
the function dumpfeats
will map over the whole database.
(dumpfeats}
The function will dump three files for each fileid in the database,
each in the `feats/' subdirectories. These files contain, the
vowel features (defined by V_durfields
), the consonant features
(defined by C_durfields
), and the phoneme.
The next task is to collate the fields into training and test
data files for the linear regression software. An example
shell script to do this is given in
/lib/data/examples/make_lrdat
.
The script first collates all the consonant data together and removes
all vectors that are actually for pauses, breaths and vowels. It
then splits the feature sets into train and test sets by simply adding
parentheses round the vectors. It then does the same for the
vowel data.
The third stage is to build the linear regression models from the
feature vectors. This can be done (from the `dur/' directory)
in CHATR with `train_lrdur.ch' loaded. The function
dolrall
takes a file name and a list of features and builds
a linear regression model, then tests that model with the training
and test data. Two calls are necessary, one for the consonants and
a second for the vowels.
(dolrall "datC01" C_durfields) (dolrall "datV01" V_durfields)
Note the mean errors are in z scores so they are not immediately recognizable as durations.
You should look at the files `datC01.info' and `datV01.info' which contain the detailed results of linear regression. Interesting points are of course the correlation, the stepwise model showing relative contributions of each feature and the dropped section which shows features which make no contribution, often this is because they are completely predictable from some other feature (or set of features).
Once you are happy with the prediction capability you build an actual
duration model from this data and test it in the system. A call
to the function save_lrmodel
with arguments of consonant
model name, vowel model name and output file name
(save_lrmodel "datC01" "datV01" "lrmodel01.ch")
Note normally you would want to edit the created file
`lrmodel01.ch' to add at least some comments about where this model
came from. Also by default this sets the variable dur_lr_model
.
It is more likely that some other variable should be set and that
dur_lr_model
only be set when this model is actually selected.
See `db_utils/DBNAME_synth.ch' for the typically use of
dur_lr_model
.
Now that we have built a duration model we can actually use it to
predict durations. Again we wish to run through the whole database
and load in the `pf' utterances, save the actually duration and
then run our new duration on that utterance and save our predicted
duration. We can do this with the function test_new_dur_model
defined in `train_lrdur.ch'. For the main database directory
call
(test_new_dur_model "dur/lrmodel01.ch")
Now in `feats/' for each fileid there are files `.accdur' and `.preddur'. It is left as an exercise to the reader to use these files to find the mean error and correlation for the overall model (vowels and consonants).
Here we will give an example of building an F0 prediction model using linear regression with ToBI labels as input parameters. First it is assumed that your database from which the model is to be built is labeled with ToBI (or similar) labels, and a set of PhonoForm utterances as been created. See section PhonoForm Utterance Types, for more information.
We can build a new F0 prediction model in a similar way to how a new duration model is built. See section Training a New Duration Model. Again, you need to decide on the features. In this case three models are required. However, as they are for syllables rather than phonemes, the amount of data is much smaller than for the duration case.
The reduction model used in speaker f2b and some other English
speakers was trained from the f2b database. Again the PhonoForm
utterances were used. The feature pf_reduced
was used as the
value that was to be predicted. This value was is calculated for each
syllable. The word the syllable is in is looked up in the lexicon
then an attempt to align the syllables in the lexicon version with the
actual version is made. If it does line up, then it checks to see
if the syllable's vowel is the same as the vowel in the corresponding
syllable in the lexical entry. If different, it checks a list of schwa
pairs, and if the actual vowel is listed as a schwa version of the
lexical vowel, the syllable is marked as reduced.
A set of vectors was collected including this reduced value plus other pertinent features. From that information a CART decision tree was built to try to predict reduction. The result is the small tree now in `lib/data/reduce.ch'. It seems relevant and sounds reasonable.
This technique does effectively train to a particular lexicon, as various lexicons often make different decisions about the amount of vowel reduction in their lexical entries. This particular example was trained using the CMU lexicon, but when the same reduction tree is used with the speaker who used the BEEP lexicon, the results are not so good.
Go to the first, previous, next, last section, table of contents.