Go to the first, previous, next, last section, table of contents.

Waveform Synthesizers

The synthesis method is set using the Parameter command. This command has an argument of Synth_Method, followed by an atomic name for the synthesis type. Although simply calling this function changes the synthesis method, each different method may require further setting up. This section describes those dependencies.

A number of waveform synthesis methods are already available. The number should increase, and those current be improved. See section CHATR Commands, (Parameter Synth_Method) for a reasonably up-to-date list. That section is created from a source which can be found in file `~/src/chatr/commands.c' (and may even be up-to-date). Most modules are optional. To determine what is installed, look for the existance of the relevent database under `~/src/chatr/src/'.

Formant Synthesizer

This is selected using the command

     (Parameter Synth_Method FORMANT_SYN)

No other setup is required. This method uses a public domain version of a formant synthesizer based on techniques described in Allen 87. The output is surprisingly bad, which is perhaps partly due to a mis-match in phoneme names (phonemes are mapped ad-hoc in this method), or maybe actual bugs in the code (probably porting bugs). Whatever, it remains to say that the result is difficult to understand.

LPC Diphone Synthesizer

This is selected using the command

     (Parameter Synth_Method ISARD)

Before this method may be used, it is necessary to load the diphone index and the LPC representation of the diphones themselves. This is achieved using the Load_Isard command. This command takes two arguments, both filenames. The first is the index file and the second is the diphone data. The given files are accessed via the library load-path, so if they exist in a directory named in the list of library directories named in the value of the variable load-path, no absolute path names are required. In fact the values for the two arguments will almost always be the same.

     (Load_Isard "../dbs/isard/diphlocs.txt" 
                 "../dbs/isard/engcdn.stf")

This synthesizer was originally written by Steve Isard of Edinburgh University. It should be stressed this is NOT the `CSTR synthesizer'. The diphones are LPC encoded, allowing easy modification of pitch and duration at concatenation time. The diphones are British English (RP) and hence will not always sound good with American English pronunciation.

PSOLA Diphone Synthesizer

A second diphone synthesizer is also included. It is the waveform synthesizer developed at the University of Edinburgh's CSTR. The system is designed to use a number of different diphone sets, though currently only one is available. It allows use of these sets in different encodings.

Before this method can be used it is necessary to load the diphones. The main function to do this is

     (Load_Taylor)

This function acts on the values of certain Lisp variables. If set, the values should be

`T_Index_Name'

Full path name of diphone index.

`T_Dictionary_Name'

Full path name of diphone dictionary.

`T_Vox_Path'

Directory where waveform files are kept.

`T_Pm_Path'

Directory where pitchmark files are kept.

T_Sample_Rate

Sample rate of diphones in Hz.

T_Diphone_Storage

This may be either GROUPED, indicating all diphone waveforms and pitchmarks are compiled into a single dictionary, or SEPARATE indicating there is one waveform and pitchmark file per nonsense word.

T_Diphone_Type

Diphone waveforms can be coded in many ways, usually dictated by memory requirements and/or availability. Note that many of these options are now redundant or have not been fully incorporated into the system.

WAVEFORM: 16 bit pcm waveforms.
SHORTWAVEFORM: 16 bit pcm waveforms (not tested).
FRAMES: Stored as separate frames in file (not tested).
LPC: Stored as LPC coeffs (not tested).
CODED_4: 16 ==> 4 bit compression.
CODED_5: 16 ==> 5 bit compression (not tested).
CODED_6: 16 ==> 6 bit compression.
CODED_ALAW: 16 ==> 8 bit compression.
PITCH_LPC: Stored as pitch sync LPC coeffs (not tested).
RES_LPC: Stored as pitch sync residule LPC (not tested).

MAX_DIPHONES

This sets the size of the internal cache of frequently used diphones. Note this refers to the number of decoded diphones. For a large machine, 100 - 500 is reasonable.

AVAILABLE_DIPHONES

This sets the number of coded diphones in RAM.

Not all of the above variables need to be set. An adequate setting is

     (set T_Dictionary_Path "/usr/pi/data/diphones/gw/group/gw.vox.diph")
     (set T_Index_Path "/usr/pi/data/diphones/gw/dictionary/diphdic.grp")
     (set T_Sample_Rate "20000")
     (set T_Diphone_Type "WAVEFORM")
     (set T_Diphone_Storage "GROUPED")
     (Load_Taylor)
     (Parameter Synth_Method TAYLOR)

Unit Database Concatenative Synthesizer

This module is the most developed synthesis system within CHATR. Waveform synthesis is achieved by concatenating labeled units from a database of natural speech. Only general aspects are covered in this section, for a full description see section Unit Databases.

The UDB (Unit DataBase) module tries to deal with speech databases in a uniform abstract way. Once a database is described and loaded, it may be selected as the synthesis method using the command

     (Parameter Synth_Method UDB)

Unit selection strategy is usually set up at database definition time. It may be changed using the command

     (Database Set Strategy Simple)

There are several strategies available, though the two most usable are

Simple: Selects the longest phoneme match possible (ignoring all other criteria) in a left to right search.
Generic: Allows more detailed selection based on available database features.

After selection, units may be concatenated by a number of methods. See section Unit Concatenation Methods, for details.

NUUTALK (Japanese) Synthesizer

An initial port of the non-uniform unit concatenative synthesizer developed previously at ATR is also included in this version of CHATR. Note that the port is still a little buggy but is beginning to be functional. Note this synthesizer is completely different in code (though no so much in spirit) from the unit selection system described above. This system only supports Japanese, but different databases may be loaded.

Different databases may be selected at run time. Currently there are two available, one male (MHT) and one female (FKN). The high level NUUTALK module has MHT intonation statistics hardwired, so it does not synthesize a female voice with appropriate intonation, but it may be used to synthesize female voices from lower levels of input (e.g. `segF0').

The example speaker `nuu_mht' sets up synthesis for MHT using the NUUTLAK system

     (speaker_nuu_mht)

A typical romaji input for this synthesis method is

     (Utterance Nuutalk
        ((Ninput  arayuru geNjituwo,
                  subete, jibuNnohouhe 
                  nejimagetanoda)))

More examples are available in files `lib/utterance/jpex**.utt'. (Substitute numbers for **.) File `jpex02.utt' is a `segf0'-input example of an original MHT spoken phrase. Files `jpex03.utt' and `jpex04.utt' contain FKN `segf0' examples. There may still be problems with the CHATR port of female speech, as FKN does not sound as good as MHT.

After selection, the same concatenation methods as for standard unit selection (see section Unit Concatenation Methods) can be used, namely NUUCEP, PS_PSOLA, DUMB, DUMB+ and NULL. These are set through the Parameter Concat_Method commands.

Some parameter settings affect the process as follows

cep_dist/vq_dist

In unit selection, two distance measures are possible. The first, selected by the Lisp command

     (set NT_cost_type 'cep_dist)

causes each candidate unit to be checked against possible matches with a distance measure for whole cepstrum vectors. This typically causes the system to be slow, as many cepstrum files must be accessed. The second option is selected using

     (set NT_cost_type 'vq_dist)

This uses a set of vector quantizations for the cepstrum vectors (actually MFCC). It allows a much faster selection process. So far not much experimentation has been done in this area, but even this attempt produces similar results in half the processing time. In order to use this, a vector quantization table must be included in the data at build time.

garbage collection

The cepstrum files are read into a cache for the duration of an utterance. Some control is given over when (and if) this cache should be flushed. The control is through the Lisp variable NT_cep_gc_strategy as in

     (set NT_cep_gc_strategy 'NONE)
     (set NT_cep_gc_strategy '500)

If set to NONE, the cache is never flushed, which will cause you to run out of space after some time. If set to a number, after that number of cepstrum files have been loaded the cache will be completely flushed at the end of the following utterance. If set to any other value, the cache is fully flushed at the end of each utterance. When running with vq_dist, the cep cache is mostly useless. Only a few files are actually read, so the default is perfectly adequate. When the cep_dist strategy is used, the cache becomes more useful but has to be pretty large (hundreds) before it has any effect.

Phrase by Phrase Synthesis

Although CHATR can synthesize an utterance in less time then it takes to say it, if the utterance contains 30 seconds of speech you still need to wait around 20 seconds before the first word is heard. As utterances are typically `sentences', their size can vary quite drastically from a single word to a whole paragraph. A more practical method is to synthesize prosodic phrase by prosodic phrase rather than sentence by sentence. Prosodic phrases (assuming we can predict them adequately) do have an upper limit (based on the size of a speaker's lungs), so in general should not last for tens of seconds.

CHATR has an option (at waveform synthesis time) to synthesize the utterance in parts rather than as a whole, thus reducing the time until he first waveform is generated. It actually does not do this by prosody phrase, but sections the utterance into parts that are separated by silence, as predicted by higher levels of the system. Moreover the silences themselves may be generated by a number of options--this is because although it would be nice to select natural pauses from a database, practically our databases do not contain a good distribution of natural inter-sentential pauses.

There are disadvantages, however. When using the DATLINK as standard audio output (as many researchers do), this technique fails to produced natural sounding speech. This is because the DATLINK always introduces a substantial pause between waveforms, typically over a second and often longer. This length of pause may be acceptable at utterance major phrase boundaries, but not at those of minor phrases. The second disadvantage is that although the main utterance is split into sub-utterances, all the information that would normally be available in an utterance after synthesis is not copied from the sub-utterances back into the main utterance. In particular the units selected and the units/target costs. This second problem is less important, as phrase-by-phrase synthesis will normally only be used in time critical applications (such as text-to-speech), when investigating the details of the synthesis is not of interest.

Phrase by phrase synthesis is controlled through the parameter variable syn_params. As with other parameter variables, it takes a list of pairs as a value. Each pair consists of a parameter name followed by a parameter value. The parameter names are

phrase_by_phrase: If the value is non-nil, synthesis (by whatever method is selected in Synth_Method) will happen phrase by phrase. Default is `off'.
whole_wave: If the value is non-nil, all sub-utterances synthesized will have their waveforms copied back into a single whole wave. NOTE the Unit and Cand streams will not be filled. Default is `on'.
silence_method: If the value is zeros, the silences will not be synthesized by selecting units from the database, but by creating small waveforms of zeros.
hardware_silence: This is intended to inform CHATR how long a time a particular piece of audio hardware takes between waveforms. However, this parameter is currently ignored.

An example use of phrase by phrase synthesis is given in the Lisp function ntts defined in `lib/data/tts.ch'

If synth_hook is set, the use of this method is a little more complicated. Defined functions will be applied to the synthesized sub-utterances rather than the whole utterance.

Filter Selection

After synthesis, the output waveform may be passed through a number of filters. One of the most common filters is one that changes the volume. When multiple speakers are used in the same session, different inherent volumes in the database may make one speaker sound much quieter than another, so a change is desired.

There are other filters, including high and low pass.

The filters selection command is

     (Filter_Wave UTT FILTERNAME [optional arguments])

Calling Filter_Wave with no arguments gives a list of available filters and their arguments.

For an example try

     (set utt1 (Utterance Text "Good morning"))
     (Synth utt1)
     (Say (Filter_Wave utt1 'Chorus))
     (Say (Filter_Wave utt1 'Backwards))\

These filters destructively modify the waveform in their utterance argument.

Two utterance may also be combined using

     (set utt3 (Merge_Waves utt1 utt2))

Different sample rates are catered for automatically.

For volume control there is a specific function which will modify the volume between maximum and minimum

     (Regain_Wave utt1 '0.9)

Maximum volume is 1.0, minimum 0.0.

Waveforms may be changed to a different sample rate using the function

     (Resamp_Wave utt1 12000)

Note that all of these functions may be called on every utterance by using the synth_hook variable. If this contains a list of functions they will be automatically applied to the utterance after waveform synthesis.

Go to the first, previous, next, last section, table of contents.