Go to the first, previous, next, last section, table of contents.

Intonation

A number of different intonation theories have already been implemented within CHATR. By `intonation theory' we mean some symbolic representation (possibly with continuous parameters) that can be used to generated an underlying F0 in synthesis. There are other ways to specify F0 within CHATR, apart from an intonation system, such as by specifying values for frames throughout the utterance.

In an intonation system, information is contained within the `Intone' stream. This is primarily related to syllables in English, and mora in Japanese. Intonation parameters must be of the same type for a whole utterance. The intonation parameter may be specified directly in some input methods, or predicted by some higher-level part of CHATR---typically the HLP rules.

HLCB and RFC

This was the first intonation system to be implemented within CHATR and hence is both the simplest and probably the most stable. It is however rather limited. The work is based on that in Taylor 92. Basically, syllables may be marked with one of four elements. H (high), L (low), C (continuation), or B (boundary). In addition, these elements may be followed by features. The features may be individually defined (as can the elements), but in our examples the defined features are

     H      early      late      downstep

     L      early      late      downstep

     C      rise

     B      initial

Elements and features define values and modifications of values for a fixed number of continuous parameters. They are used in the prediction of the RFC, a lower level, more explicit representation of the F0 contour. These definitions may be tuned for a particular speaker's pitch range.

Definitions are made through the Stats Intonation command.

HLCB is selected by either of the following commands

     (Parameter Int_Method CSTR) 
     (Parameter Int_Method RFC)

ToBI

An implementation of the English ToBI system is included in the system. (See Silverman 92.) As with the other intonation systems included in CHATR, it consists of three sub-parts:

Prediction of accents and tones.
Realization of these tones into F0 target points, taking into account speaker range and parameters.
Converting the set of target points to a sampled F0 at desired rate by smoothing the target points.

Stage 1 is done before any duration information is available, as duration prediction methods need to know accent information accents and tones must be predicted before the duration module is run. The second stage is called after durations have been predicted and hence can deal with absolute positioning.

Parameters for ToBI may be set through the variable ToBI_params. Its value should be a lisp a-list (a list of pairs) consisting of name and value. The current names supported are

pitch_accents

A list of supported pitch accents. Although this is a parameter it can only really have a fixed value, specific C code must exist for each accent to realise it as a set of target points. The value this should have is (or a subset of)

     (pitch_accents H* !H* L* L+H* L*+H L+!H* H+!H* HiF0 X*? *?)

These are the actual accents that appear in speaker f2b of the Boston University's Radio News corpus. (See Ostendorf 95.)

phrase_accents

A list of supported phrase accents. Although this is a parameter there is only really one set of values it can take. Its value should be

     (phrase_accents H- L-)

boundary_tones

A list of supported phrase accents. Although this is a parameter there is only really one set of values it can take. Its value should be

     (boundary_tones H-H% L-H% L-L% H-L%)

target_method

The method to generate F0 target points. Two values are possible. The first is APL (see Anderson 84), which predicts target values of syllables that are accented (or toned). The second is LR, which uses linear regression to predict start, mid-vowel and end target point for all syllables. APL is the default. This method uses the large number of parameters below for tuning the predicted value. The results of LR are closer to the natural F0, but at the cost of not being as general. The database building mechanism uses LR. An LR model may be mapped to a different speakers pitch range through the following two parameters

target_f0mean

The mean F0 for vowels for a speaker. This is used to map to a speaker's pitch range from the pitch range of the speaker used to create an LR F0 model.

target_f0std

The standard deviation of F0 for vowels for a speaker. This is used to map to a speaker's pitch range from the pitch range of the speaker used to create an LR F0 model.

If target method is LR, a list of three linear regression models should be set to the variable tobi_lrf0_model. These predict the start, mid-vowel and end values for a syllable. The feature name weight `pairs' may optionally have a third argument specifying a feature map. Feature maps allow category valued features to be mapped to binary ones. If the value returned by a feature is in the named feature map then the value is 1 otherwise it is 0. Example linear regression models are in `lib/data/f2b_lrf0.ch' for English and `lib/data/mht_lrf0.ch' for Japanese (JToBI). The following parameters are only used if the target_method is APL. Currently no mechanism is provided to automatically tune these parameters.

topval: Size in Hertz above refval for maximum sized accents (speaker-dependent).
baseval: Size in Hertz below refval for minimum sized accents (speaker-dependent).
refval: Size in Hertz of mid-value. For most speakers this is best set to the mean F0 of the speaker.
h1: Factor which is multiplied by topval to position step before H accents (who knows if this is speaker-dependent or not).
l1: Factor which is multiplied by baseval to position step before L accents (who knows if this is speaker-dependent or not).
prom1: Factor which is multiplied by topval to position top of H accents.
prom2: Factor which is multiplied by topval (or baseval) to position top of !H accents, H accents in compound accents, H and L in phrase accents
prom3: Factor which is multiplied by topval (or baseval) to position end of phrase accents.
HiF0_factor: Factor to increase H*'s when marked with HiF0 (default is 1.3).
decline_range: Value in Hertz of total decline to be make over a phrase.
hamwin_size: Size in milliseconds for smoothing window for target points used to produce smoothed F0. This is typically around 240 to 400 mS.

The actual method used in the implementation was strongly influenced by example code (incomplete) from AT & T Bell Labs, with significant input from Mary Beckman. Hence it follows their model (and parameter names) very closely. The APL technique is also described in Anderson 84.

The ToBI intonation method is selected by the following command

     (Parameter Int_Method ToBI)

JToBI

JToBI is an implementation of the work description in Pierre Humbert 88b. The implementation was done in conjunction with Mary Beckman. The following parameters may be set through the variable mb_params.

Although many parameters are available for controlling the prediction of F0 target points, the same linear regression method used by the English ToBI system produces better results, and more importantly can be trained. A linear regression model consists of three separated models for predicting the start, mid-vowel and end target points for syllable. A forth item in the variable tobi_lrf0_model is the source mean F0 and standard deviation, to allow F0 pitch mapping between speakers. The format is exactly the same as used for the English ToBI. A Japanese (JToBI) LR model example is given in `lib/data/mht_lrf0.ch'.

The JToBI intonation method is selected by the following command

     (Parameter Int_Method JToBI)

Fujisaki

An implementation of the Fujisaki model Fujisaki 83 is available (for Japanese). It is still experimental but does produce F0 contours. Parameters are set through the variable fujisaki_model. Details of the parameters and their values may be found by looking at the actual code in `src/intonation/fujisaki.c'.

The Fujisaki intonation method is selected by the following command

     (Parameter Int_Method Fujisaki)

Tilt Theory

Using the work described in Taylor 93b, this model offers a labeling system which may be automatically derived from waveforms or phoneme labels.

The Tilt intonation method is selected using the command

     (Parameter Int_Method Tilt)

Adding a New Speaker

The most difficult part about adding a new speaker is labeling the data. Once the data is in the form that CHATR requires, everything else is simple.

CHATR requires a syllable utterance type description for each utterance. This is comprised of a list of phrases, each with a start F0. Within each phrase is a list of syllables and each may have one or more events marked. An example is given below

     (Utterance  
        (Syllable (space rfc)(format feature)(dimen num)) (
        (
         (:C ()
           ((hh  60) (eh  65)                        ((E)))
           ((l   33) (ow 207)                        ())
         )
         (:C ()
           ((dh  27) (ih  56)                        ((E)))
           ((s   75) (ih  56) (z   44)               ())
           ((dh  42) (ax  36)                        ())
           ((k   95) (aa 129) (n   44)               ((E)))
           ((f   77) (r   36) (en  57)               ())
           ((s   77) (ao 156)                        ())
           ((f   83) (eh 105) (s  203)               ())
        )
        ))

In this type of description, only the presence of an event need be marked.

In addition, an RFC input description is required. An example is given below

     (Utterance RFC(
     (sil    303     ( ( sil 0 166 ) ))
     (hh     60      ())
     (eh     65      ( ( fall 21 166 )))
     (l      33      ())
     (ow     207     ( ( conn 67 125 ) ( sil 197 120 )))
     (sil    155     ())
     (dh     27      ( ( rise 0 149 )))
     (ih     56      ())
     (s      75      ( ( fall 60 173 )))
     (ih     56      ())
     (z      44      ())
     (dh     42      ( ( conn 4 151 )))
     (ax     36      ())
     (k      95      ())
     (aa     129     ())
     (n      44      ( ( fall 5 142 )))
     (f      77      ())
     (r      36      ())
     (en     57      ())
     (s      77      ( ( conn 74 95 )))
     (ao     156     ())
     (f      83      ())
     (eh     105     ())
     (s      203     ( ( sil 91 91 )))
     (sil    524     ())
     ))

The CHATR user function train_input takes these two utterance descriptions and produces a syllable description in the RFC event space

     (Utterance  (Syllable (space rfc)(format num)(dimen linear)) (
     (:C ((Start 166))
        ((hh  60) (eh  65)                       
                      ((C   0.00) (E 0.00 0.00 -41.00 144.00 21.00)))
        ((l   33) (ow 207)                        ((C   0.00) ))
     )
     (:C ((Start 149))
        ((dh  27) (ih  56)                       
                      ((C   0.00) (E 24.00 143.00 -22.00 119.00 116.00)))
        ((s   75) (ih  56) (z   44)               ((C  -9.00) ))
        ((dh  42) (ax  36)                        ())
        ((k   95) (aa 129) (n   44)              
                      ((E 0.00 0.00 -47.00 283.00 134.00)))
        ((f   77) (r   36) (en  57)               ((C  -4.00) ))
        ((s   77) (ao 156)                        ())
        ((f   83) (eh 105) (s  203)               ())
     )
     ))

Next, the function Rfc_to_Tilt is called, which transforms this into tilt space. With a sufficient number of utterances in tilt space, statistics can be collected on each of the 4 tilt parameters and the phrase start F0 parameter. The mean and standard deviations need to be calculated, which can be done using S or any other utility. The tilt descriptions can be derived from the utterance file. Alternatively, using the Int_Stats function returns a list of all the events or connections (or both) for an utterance. By calling mapc, one can obtain all the statistics for a database.

Once the statistics have been collected, a speaker table can be constructed by entering the mean and standard deviations in the appropriate places. A typical speaker file is given below

     (Stats Intonation (
             (Element E (def tilt E)(
                      (amp      = 47  Hz)
                      (dur      = 291 ms)
                      (tilt     = 0.0 rel)
                      (peak_pos = 59  ms)
                     ))
             (Element E (var tilt E)(
                      (amp      = 31  Hz)
                      (dur      = 141 ms)
                      (tilt     = 0.75 rel)
                      (peak_pos = 136 ms)
                     ))
             (Element C (def any C)(
                     (amp       = 0.0 Hz)
                     ))
             (Element C (var any C)(
                      (amp      = 10 Hz)
                     ))
             (Element P (def any P)(
                     (amp       = 151.0 Hz)
                     ))
             (Element P (var any P)(
                      (amp      = 20 Hz)
                     ))
     ))

Defining a New Feature Set

A little care needs to be taken here as the system will accept inappropriate feature sets but become confused by them.

An example feature set is

     (Stats Intonation (
             (Feature rise (binary tilt C)(
                      (amp       +=  10 rel)
                      ))
             (Feature fall (binary tilt C)(
                      (amp       -=  10 rel)
                      ))
             (Feature amp (scalar tilt E)(
                      (amp       +=  1 rel)
                      (dur       +=  1 rel)
                      ))
             (Feature early (binary tilt E)(
                      (peak_pos -= 1.1 rel)
                      ))
             (Feature late (binary tilt E)(
                      (peak_pos += 1.1 rel)
                      ))
             (Feature rise (binary tilt E)(
                      (tilt      += +1 rel)
                      ))
             (Feature fall (binary tilt E)(
                      (tilt      += -1 rel)
                      ))
             )
     )

Feature headers are defined in the form

     <name>  ( <type > <space> <element>)

The name need only be unique to the space and element, so connection and event features can have the same name without confusion. type refers to scalar or binary. space refers to rfc or tilt, though presently only tilt is fully implemented. element refers to whether the feature should operate on an event or a connection.

Feature bodies are defined in the form

     (variable operator value dimension)

variable specifies which tilt variables are to be affected. operator should always be += or -=. Note value is a standard deviation, so very large values are inadvisable. dimension is not presently used.

Go to the first, previous, next, last section, table of contents.