A number of mechanisms exist within the CHATR to predict duration of segments in synthesis. This chapter discusses each in turn.
Different duration methods are selected via the Parameter
command. e.g.
(Parameter Duration_Method KLATT_DUR)
A global stretch parameter is available to modify the overall speed of predicted durations. Note it is simply a factor by which the durations of each segment are multiplied -- no segment reduction takes place. It is set by the following command
(Parameter Duration_Stretch 1.2)
The default value is 1.0. A value of 0.0 or less is not allowed. The value is automatically reset to 1.0 whenever a new duration method is selected.
Note that in all cases pause/silence durations are predicted with a
different mechanism than non-pause phonemes. A better pause duration
prediction system is probably required, but it is already separated
from the various existing duration modules. Pause durations are
based on the prosodic boundary level of the word ending with the
segment immediately preceding the pause. A table of pause lengths
based on boundary level may given through the Stats
command.
A typical example (as defined in the library file
`lib/data/rp_pause.ch') is
(Stats Pause ( (discourse 400) (sentence 250) (clause 100) (phrase 50) ))
The prediction of the duration of a pause at the beginning of an utterance is a problem not worth consideration. In general we do not know what has gone before (except in the text-to-speech case), so cannot predict how much pause is required. It seems fair to assume that utterances do consist of a complete prosodic phrase, so a small pause is not unreasonable. Currently a pause of 50ms is always generated if a duration module is called.
When using a DATLINK as an audio output device, there is a significant delay before the playing of the waveform starts. Hence the Stats Pause values may need to be reduced. For text-to-speech (synthesizing in sentence-sized chunks), a smaller value for the `sentence' level pause is recommended, as the pause generated within the DATLINK between playing waves may be as much as 750 milliseconds. Of course there may be ways to stop the DATLINK from doing this.
This is an implementation of the Klatt duration rule system as
described in Allen 87[Ch. 9]. It follows the 10 rules as
closely as possible. This module requires initialization via the
Stats
function. This takes the form
(Stats Klatt_dur <phone_set> ( <phone0_stats> <phone1_stats> ... ))
The phone_set
is optional. If specified, it must be the name
of a currently defined phoneme set. If omitted, the current input
phoneme set is assumed. Individual phoneme statistics consist of a
triplet; a phoneme name, an inherent duration (in milliseconds) and a
minimum duration (in milliseconds).
As an example, a partial description for the `mrpa' phoneme set is
(Stats Klatt_dur mrpa ( ( 120 60) ; AX (@ 180 80) ; ER (a 230 80) ; AE (aa 240 100) ; AA (ai 250 150) ; AY (au 240 100) ; AO (b 85 60) ; BB ... )
A full `mrpa' definition is listed in the library file `lib/data/rp_dur.ch'.
This has been extracted from the NUUTALK code as its own standalone CHATR duration module. It is specific to the Japanese phoneme set `nuuph' (and not very robust with alternatives). It is viewed as a stop-gap to allow Japanese to be synthesized in a more general way, that depends on internal NUUTALK code. No parameters are available for modification.
The Kaiki duration method is selected by the following command
(Parameter Duration_Method KAIKI_DUR)
Using some of the ideas from Campbell 92 this duration method basically breaks the task into two levels. First syllable durations are predicted and then based on that value the phonemes within that syllable are predicted.
In this implementation both the syllable durations and the phoneme durations are predicted by a neural net, (the following section describes how to use the neural net system within CHATR).
The nets and a description of their inputs are given to this module
through the Lisp variable nnd_nets
. Its value may be of
length 2 or 4. A net is described by two items a list of atomic
input features and a net itself (as generated by the function
NN_Train
. If two nets are given (length=4) the first net is
used to predict syllable durations while the second is used to
generate phoneme durations. If only one net (length=2) then it is
used to predict phoneme durations directly.
Silence durations are not predicted using these nets, a separated pause duration mechanism is used based on phrase break level.
Two examples are included in the distribution, the file `lib/data/f2b_dur_nnet.ch' offers a syllable net plus a phoneme net, while `lib/data/f2b_phnet.ch' offers direct phoneme prediction.
Features used as input to the neural net are obtained via the feature function mechanism. See section Feature Functions, for a full description.
One example of NNet duration data is included in the library file `lib/data/f2b_dur_nnet.ch'. This has been trained from the BU FM Radio database from the female radio announcer f2b. The syllable net inputs are
(ppblvl pblvl blvl nblvl nnblvl pcoda coda paccented accented naccented ppbprom pbprom bprom nbprom nnbprom ppstress pstress stress nstress nnstress ppvtypeN pvtypeN vtypeN nvtypeN nnvtypeN onset nonset foot remssyl remssylsent psyl_type syl_type nsyl_type )
Note that all these features return character strings of digits. Thus when the are concatenated together them form the input to the net. The best way to find the definitions of these features is by looking at the code in `src/chatr/feats.c'
Parameters to this module are set in the Lisp variables nnd_params.
A method for using linear regression for duration predict is also
included. The CHATR Lisp function Linear_Regression
may
be used to build linear regression models. Once created they may
used in duration prediction as follows. The method is selected by
the command
(Parameter Duration_Method LR_DUR)
Once set the module takes its input from the Lisp variable
dur_lr_model
. Its value should be pair of linear regression
models, each consisting of a list of pairs listing feature name and
weight. The first value in the mode should be the intercept. The
model should predict z scores of absolute durations in milliseconds
(this may change).
A second variable dur_lr_targ_stats
should contain a list of
phonemes plus means and standard deviations (in milliseconds) for the
target speaker.
Thus this method allows a degree of speaker independence (though no formal test have been made about how well cross prediction works).
Examples are given in `lib/data/f2b_lrdur.ch' (English) and
`lib/data/mht_lrdur.ch' (Japanese). The target statistics may
be created for a database at database creation time using the script
db_utils/make_lrdurstats
.
A second linear regression module exists following the Campbell method more closely but has yet to be fully tested. It is selected using
(Parameter Duration_Method LR_DUR_SYL)
In this case dur_lr_model
should contain a single linear
regression model for prediction syllable duration (from syllable
cells). The second variable dur_lr_targ_stats
should contain
each phones with means and standard deviations on log
durations.
The syllable duration is predicted then the phonemes in it are summed and a factor is found to modify them in a fraction of their standard deviations. Again this should have a degree of speaker independence in it following Campbell's original work Campbell 92 but full tests have not yet been made. However, in this case we are not using neural nets which may cause the success of the method to differ.
Go to the first, previous, next, last section, table of contents.