A number of different intonation theories have already been implemented within CHATR. By `intonation theory' we mean some symbolic representation (possibly with continuous parameters) that can be used to generated an underlying F0 in synthesis. There are other ways to specify F0 within CHATR, apart from an intonation system, such as by specifying values for frames throughout the utterance.
In an intonation system, information is contained within the `Intone' stream. This is primarily related to syllables in English, and mora in Japanese. Intonation parameters must be of the same type for a whole utterance. The intonation parameter may be specified directly in some input methods, or predicted by some higher-level part of CHATR---typically the HLP rules.
This was the first intonation system to be implemented within CHATR and hence is both the simplest and probably the most stable. It is however rather limited. The work is based on that in Taylor 92. Basically, syllables may be marked with one of four elements. H (high), L (low), C (continuation), or B (boundary). In addition, these elements may be followed by features. The features may be individually defined (as can the elements), but in our examples the defined features are
H early late downstep L early late downstep C rise B initial
Elements and features define values and modifications of values for a fixed number of continuous parameters. They are used in the prediction of the RFC, a lower level, more explicit representation of the F0 contour. These definitions may be tuned for a particular speaker's pitch range.
Definitions are made through the Stats Intonation
command.
HLCB is selected by either of the following commands
(Parameter Int_Method CSTR) (Parameter Int_Method RFC)
An implementation of the English ToBI system is included in the system. (See Silverman 92.) As with the other intonation systems included in CHATR, it consists of three sub-parts:
Stage 1 is done before any duration information is available, as duration prediction methods need to know accent information accents and tones must be predicted before the duration module is run. The second stage is called after durations have been predicted and hence can deal with absolute positioning.
Parameters for ToBI may be set through the variable
ToBI_params
. Its value should be a lisp a-list (a list of
pairs) consisting of name and value. The current names supported are
pitch_accents
(pitch_accents H* !H* L* L+H* L*+H L+!H* H+!H* HiF0 X*? *?)These are the actual accents that appear in speaker f2b of the Boston University's Radio News corpus. (See Ostendorf 95.)
phrase_accents
(phrase_accents H- L-)
boundary_tones
(boundary_tones H-H% L-H% L-L% H-L%)
target_method
target_f0mean
target_f0std
If target method is LR, a list of three linear regression models
should be set to the variable tobi_lrf0_model
. These predict
the start, mid-vowel and end values for a syllable. The feature name
weight `pairs' may optionally have a third argument specifying a
feature map. Feature maps allow category valued features to
be mapped to binary ones. If the value returned by a feature is in
the named feature map then the value is 1 otherwise it is 0. Example
linear regression models are in `lib/data/f2b_lrf0.ch' for
English and `lib/data/mht_lrf0.ch' for Japanese (JToBI). The
following parameters are only used if the target_method
is
APL. Currently no mechanism is provided to automatically tune these
parameters.
topval
refval
for maximum sized accents
(speaker-dependent).
baseval
refval
for minimum sized accents
(speaker-dependent).
refval
h1
topval
to position step before H
accents (who knows if this is speaker-dependent or not).
l1
baseval
to position step before
L accents (who knows if this is speaker-dependent or not).
prom1
topval
to position top of H
accents.
prom2
topval
(or baseval
) to
position top of !H accents, H accents in compound accents, H and L in
phrase accents
prom3
topval
(or baseval
) to
position end of phrase accents.
HiF0_factor
decline_range
hamwin_size
The actual method used in the implementation was strongly influenced by example code (incomplete) from AT & T Bell Labs, with significant input from Mary Beckman. Hence it follows their model (and parameter names) very closely. The APL technique is also described in Anderson 84.
The ToBI intonation method is selected by the following command
(Parameter Int_Method ToBI)
JToBI is an implementation of the work description in Pierre
Humbert 88b. The implementation was done in conjunction with Mary
Beckman. The following parameters may be set through the variable
mb_params
.
Although many parameters are available for controlling the prediction
of F0 target points, the same linear regression method used by the
English ToBI system produces better results, and more importantly can
be trained. A linear regression model consists of three separated
models for predicting the start, mid-vowel and end target points for
syllable. A forth item in the variable tobi_lrf0_model
is the
source mean F0 and standard deviation, to allow F0 pitch mapping
between speakers. The format is exactly the same as used for the
English ToBI. A Japanese (JToBI) LR model example is given in
`lib/data/mht_lrf0.ch'.
The JToBI intonation method is selected by the following command
(Parameter Int_Method JToBI)
An implementation of the Fujisaki model Fujisaki 83 is
available (for Japanese). It is still experimental but does produce
F0 contours. Parameters are set through the variable
fujisaki_model
. Details of the parameters and their values
may be found by looking at the actual code in
`src/intonation/fujisaki.c'.
The Fujisaki intonation method is selected by the following command
(Parameter Int_Method Fujisaki)
Using the work described in Taylor 93b, this model offers a labeling system which may be automatically derived from waveforms or phoneme labels.
The Tilt intonation method is selected using the command
(Parameter Int_Method Tilt)
The most difficult part about adding a new speaker is labeling the data. Once the data is in the form that CHATR requires, everything else is simple.
CHATR requires a syllable utterance type description for each utterance. This is comprised of a list of phrases, each with a start F0. Within each phrase is a list of syllables and each may have one or more events marked. An example is given below
(Utterance (Syllable (space rfc)(format feature)(dimen num)) ( ( (:C () ((hh 60) (eh 65) ((E))) ((l 33) (ow 207) ()) ) (:C () ((dh 27) (ih 56) ((E))) ((s 75) (ih 56) (z 44) ()) ((dh 42) (ax 36) ()) ((k 95) (aa 129) (n 44) ((E))) ((f 77) (r 36) (en 57) ()) ((s 77) (ao 156) ()) ((f 83) (eh 105) (s 203) ()) ) ))
In this type of description, only the presence of an event need be marked.
In addition, an RFC input description is required. An example is given below
(Utterance RFC( (sil 303 ( ( sil 0 166 ) )) (hh 60 ()) (eh 65 ( ( fall 21 166 ))) (l 33 ()) (ow 207 ( ( conn 67 125 ) ( sil 197 120 ))) (sil 155 ()) (dh 27 ( ( rise 0 149 ))) (ih 56 ()) (s 75 ( ( fall 60 173 ))) (ih 56 ()) (z 44 ()) (dh 42 ( ( conn 4 151 ))) (ax 36 ()) (k 95 ()) (aa 129 ()) (n 44 ( ( fall 5 142 ))) (f 77 ()) (r 36 ()) (en 57 ()) (s 77 ( ( conn 74 95 ))) (ao 156 ()) (f 83 ()) (eh 105 ()) (s 203 ( ( sil 91 91 ))) (sil 524 ()) ))
The CHATR user function train_input
takes these two
utterance descriptions and produces a syllable description in the RFC
event space
(Utterance (Syllable (space rfc)(format num)(dimen linear)) ( (:C ((Start 166)) ((hh 60) (eh 65) ((C 0.00) (E 0.00 0.00 -41.00 144.00 21.00))) ((l 33) (ow 207) ((C 0.00) )) ) (:C ((Start 149)) ((dh 27) (ih 56) ((C 0.00) (E 24.00 143.00 -22.00 119.00 116.00))) ((s 75) (ih 56) (z 44) ((C -9.00) )) ((dh 42) (ax 36) ()) ((k 95) (aa 129) (n 44) ((E 0.00 0.00 -47.00 283.00 134.00))) ((f 77) (r 36) (en 57) ((C -4.00) )) ((s 77) (ao 156) ()) ((f 83) (eh 105) (s 203) ()) ) ))
Next, the function Rfc_to_Tilt
is called, which transforms
this into tilt space. With a sufficient number of utterances in tilt
space, statistics can be collected on each of the 4 tilt parameters
and the phrase start F0 parameter. The mean and standard deviations
need to be calculated, which can be done using S or any other
utility. The tilt descriptions can be derived from the utterance
file. Alternatively, using the Int_Stats
function returns a
list of all the events or connections (or both) for an utterance. By
calling mapc
, one can obtain all the statistics for a
database.
Once the statistics have been collected, a speaker table can be constructed by entering the mean and standard deviations in the appropriate places. A typical speaker file is given below
(Stats Intonation ( (Element E (def tilt E)( (amp = 47 Hz) (dur = 291 ms) (tilt = 0.0 rel) (peak_pos = 59 ms) )) (Element E (var tilt E)( (amp = 31 Hz) (dur = 141 ms) (tilt = 0.75 rel) (peak_pos = 136 ms) )) (Element C (def any C)( (amp = 0.0 Hz) )) (Element C (var any C)( (amp = 10 Hz) )) (Element P (def any P)( (amp = 151.0 Hz) )) (Element P (var any P)( (amp = 20 Hz) )) ))
A little care needs to be taken here as the system will accept inappropriate feature sets but become confused by them.
An example feature set is
(Stats Intonation ( (Feature rise (binary tilt C)( (amp += 10 rel) )) (Feature fall (binary tilt C)( (amp -= 10 rel) )) (Feature amp (scalar tilt E)( (amp += 1 rel) (dur += 1 rel) )) (Feature early (binary tilt E)( (peak_pos -= 1.1 rel) )) (Feature late (binary tilt E)( (peak_pos += 1.1 rel) )) (Feature rise (binary tilt E)( (tilt += +1 rel) )) (Feature fall (binary tilt E)( (tilt += -1 rel) )) ) )
Feature headers are defined in the form
<name> ( <type > <space> <element>)
The name
need only be unique to the space and element, so
connection and event features can have the same name without
confusion. type
refers to scalar or binary. space
refers to rfc or tilt, though presently only tilt is fully
implemented. element
refers to whether the feature should
operate on an event or a connection.
Feature bodies are defined in the form
(variable operator value dimension)
variable
specifies which tilt variables are to be affected.
operator
should always be +=
or -=
. Note
value
is a standard deviation, so very large values are
inadvisable. dimension
is not presently used.
Go to the first, previous, next, last section, table of contents.