Go to the first, previous, next, last section, table of contents.

Making a Speech Synthesizer Database

This chapter runs through an example of building the necessary files CHATR requires in order to build a synthesizer based on a speech base. The process is long and requires much disk space and cpu time. Although it is mostly automatic, there are number of stages where informed decisions need to be made. A familiarity with the operation will greatly aid you in successfully building a usable synthesis database.

Before proceeding further, a short explanation of some database-building specific terminology follows.

Each waveform file is identified by a short identifier, a fileid. This will typically be the name of the file minus any extension. For example, if the files are called

     sc001.wav
     sc002.wav
     sc003.wav
     ...

then the fileid's are

     sc001
     sc002
     sc003
     ...

Directory Structure

The following are brief descriptions of the contents of sub-directories created while making a synthesis database.

`wav/': The waveform files. These are unheadered, in native byte format, with `.wav' extensions.
`lab/': Phoneme labels. These are in XWAVES label format. Their names must match the waveform file names (without extensions).
`db_utils/': Shell scripts and programs used in creation of the following files. Starting versions should be copied from the CHATR src directories.
`stats/': Unit statistics files. These files contain duration, mean pitch, mean power and mean voicing for each unit in the database.
`units/': Unit descriptions containing all features to be used in a database.
`f0/': F0 files.
`pm/': Pitch mark files.
`cep/': Cepstrum parameter files.
`vq/': Various vector quantization setup files, and vector quantization for all files in the database.
`chatr/seg/': CHATR representations of the utterances in the database. These are used for resynthesis tests.
`index/': Where all the final generated files are gathered together and the eventual CHATR compiled index in made.

Preparation

This processing will eventually create a fully trained database with index files. To use the resulting database within CHATR, only one definition command needs to be executed. See section Defining a Speaker, for details. The construction process requires access to following programming systems

A compiled version of CHATR.
A pitch marking program called `fz_track'.
HTK
ESPS

Choose a short name for your database and create a directory for it somewhere. All files will be generated in that directory by default. Only one place in the ultimate database definition refers to this directory, so it may easily be moved afterwards.

In your newly created database directory, you will need

A set of waveform files. By default they are assumed to be unheadered, in native byte order format, have `.wav' extensions, and be located in a directory called `wav/'.
A set of phoneme label files. By default they are assumed to be in XWAVES format, have `.lab' extensions, and be located in a directory called `lab/'.

Copy the waveform and phoneme label files into the `wav/' and `lab/' directories respectively.

Copy all the files in `/DB/PI/chatr/db_utils/' to your own `db_utils/'. Then make all the files writable using

     chmod +w db_utils/*

These files are the scripts and binaries used to build a database. If all goes well you will not need to change any of these files, but then again you may have to.

Note you will need about 2.5 to 3 times the disk space of the `wav/' directory, plus space for the training distance files. These can be 2Mb to 1.5Gb, depending on the size of database. The amount is related to the square of the number of occurrences of each phoneme.

Create a list of all the waveform files in the database. Assuming only the waveform files are called `*.wav', you can use the following command

     ls wav/*.wav >files.all
     cp files.all files

In the following shell scripts, the file `files' is used to list all files to be processed. If (when?) things go wrong, `files' may be modified to isolate only the files you wish to process.

Three files need to be created before proceeding: `db_description', `index/DBNAME_synth.ch' and `index/DBNAME_train.ch'. Templates are available in `db_utils'.

Create the `db_description' file in the top directory. Copy the example from the `db_utils/' directory using the command

     cp db_utils/db_description db_description

Edit that file specifying the database name, the phoneme set, and the sample rate for the database. You may also wish to change the values in the second section depending on your environment.

It may be useful to modify the variable GET_F0_PARAMS. This specifies the minimum and maximum expected F0 for the speaker. It helps to set the range to the likely limits for a particular speaker database, so the defaults may not necessarily be suitable. This is especially so for male speakers, where the lowest value may be below the default minimum.

The two other configuration files must be created from templates in the `db_utils/' directory before commencing database make. See section Database Parameters File, for details about the `index/DBNAME_synth.ch' file. See section Training Setup, for details about the `index/DBNAME_train.ch' file.

It is highly recommended to thoroughly check the database before proceeding further. See section Checking the Database, for details.

Checking the Database

Databases are large and therefore very likely to contain errors. A number of specific tests are provided to try to detect the most likely problems. You are strongly advised to run these test and study the results. Remember, database errors are probably the most common cause of bad synthesis.

The first test is

     db_utils/check_labs

This function checks the unit labels, identifying the number of occurrences of each, and finding any with unusually short durations.

Do fix any problems before continuing.

The second test is

     db_utils/check_phoneset

This function checks that the labels are all in the defined phoneme set. This only works when using a CHATR standard phoneme set. See section Defining a New Phoneme Set, if you are defining your own phoneme set. Again, do fix any problems before continuing.

The third test is

     db_utils/check_align

This checks the location of labels within waveforms. It is intended to detect offsets in label files, mismatches of label files to waveforms, and possibly mistaken sample rate.

Once more, do fix any problems before continuing.

The fourth test is

     db_utils/check_labwav FILEID

Strictly speaking it is not actually a test, but uses XWAVES to display an example waveform and label file as described in the database. Check that the labels match the waveform, and that the waveform itself is of the right byte order. You should check all files using this method, but of course, you checked the database before you started this process, didn't you?

At least check three files--one near the beginning, one in the middle, and one near the end.

Before any processing occurs it is also wise to check that the waveform files in the database are of similar quality. Different waveform files often have quite different mean power, and it may be useful to normalize them. The CHATR function to do this is

     db_utils/normal_power

It may also be necessary to exclude some waveforms files because of extraneous noise -- background music, etc. In that case, remove the waveform file name from the `files.all' file and re-copy `files.all' to `files'.

Database Parameters File

All parameters in a database are described in the file `DBNAME_synth.ch'. This file describes both the database itself, plus other aspects of the voice, such as lexicon, intonation and duration parameters. There are many many options in CHATR, and it is not always easy to know which subsystems have parameters, let alone what values they should take.

Create a file called `DBNAME_synth.ch' in the directory `index/', where `DBNAME' is the name of the database you are building. A template example can be found in `db_utils/DBNAME_synth.ch'. This should be copied to the `index/' directory with the appropriate name.

This file must be edited.

Every occurrence of `<>' marks a part that requires specific attention.

The file can be viewed in two parts, initialization and selection. When loaded, this file should initialize and load all necessary parameters for the use of this database as a CHATR speaker. The function speaker_DBNAME (at the end of this file) should, when called, select the database and auxiliary parameters actually required. The idea is that users will call that function when changing between alternate speakers.

In general, change occurrences of DBNAME to the database name, PHONESET to the phoneme set name, and DICTNAME to the appropriate dictionary name.

Let us go through the main parts. Note the directory where the data is is defined when the function defspeaker is called. DBNAME_data_dir will be set before the file `DBNAME_synch.ch' is loaded.

First, decide on the phoneme set you wish to use and ensure it is loaded. If the phoneme set is a standard set (i.e. `radio2', `mrpa', `BEEP', or `nuuph') you may simply require the definition file. If not, you must define your phoneme set. Note all unit names in the database must be a member of this phoneme set. A commented-out example of loading a definition specific to a database is available. See section Defining a New Phoneme Set for more details.

The main database declaration is next. It defines the name of the index file, and the format of the waveform, pitch mark and cepstrum files. If you use the standard database as your platform, the example names will be satisfactory--but remember to remove the `<>' marks. This section also defines the wave file type and sample rate (you must set this), as well as the phoneme set.

The Silence definition is used in unit selection as a context for units which come at the end (or start) of a file. Ensure that the example silence entry has a reasonable value for all fields that exist in your database. It is assumed that there is an effective silence before and after each file in the database--although good database design should ensure there are actually some silence units in the waveform as well. Note that the phone used to represent silence must be declared in the phoneme set as a non-vowel and of consonant type 0 in the phoneme definition.

The next section of `DBNAME_synch.ch' defines which distance functions to be used (and trained). See section Training Setup, for the details.

The next significant section is defining nus_DBNAME_params, which defines some general parameters for the unit selection process. Their current values are probably acceptable, although the beam and candidate widths could possibly be reduced. See the Variables appendix for the details of their values.

The next section is not executed during training mode, as it is then that the files it loads are generated. The weight sets to be loaded should be selected depending on pruning choice. See section Training Setup, for the details.

Setting up the lexicon typically requires two steps: definition and selection. A number of lexicons are already built into CHATR

cmu: An American English lexicon, based on the CMU lexicon (0.1) containing about 100,000 words. It also uses the US Naval Research letter to sound rules for words not explicitly listed.
beep: A British English lexicon, based on the BEEP lexicon (0.5) containing about 160,000 words. It also uses the US Naval Research letter to sound rules for words not explicitly listed.
mrpa: A British English lexicon, developed at CSTR containing about 23,000 entries. It also uses the US Naval Research letter to sound rules for words not explicitly listed.
japanese: A lexicon actually containing no explicit words and completely depends on a set of letter to phoneme rules for changing romaji into nuuph phonemes.

For these lexicons there are set-up functions defined in `lib/data/lexicons.ch'. Remove the comment characters from the line for the lexicon you wish to use. For more details on building your own lexicons see section Lexicon.

For duration set-up a number of choices exist. A linear regression model can be used which is trainable for multiple languages. Examples built from f2b (American English) and MHT (Japanese) exist and can be used with other databases (of the same language). A neural net based model is also available but does not train well, though the example included from f2b is acceptable. For Japanese, a built in model exists. It has no external parameters and requires no set up. See section Duration, for more details, including training new duration models. Specifically for an example of tuning a linear regression model, see section Training a New Duration Model.

The second part of the duration definition defines pause durations at phrase boundaries. An appropriate definition must be loaded.

There are a number of intonation systems built into CHATR. The most stable ones are based around ToBI and hence for building working speech synthesis voices ToBI is recommended. The appropriate parameters should be set for that speaker. See the appendix on variables about the appropriate values for ToBI_params and mb_params. For more details on building intonation models see section Intonation. One stable method for predicting F0 from ToBI labels is a model using linear regression. Two models have been included with the system, one for English (from f2b) and one for Japanese (from MHT). These models can be mapped to other speaker's F0 range given the target speakers F0 mean and standard deviation. These speaker specific parameters are generated by the script make_tobif0_params during the building of a database.

The final part of this file defines a function that when called will select the appropriate parameters to cause the synthesizer to use that voice. For efficiency you should try to ensure everything is loaded into CHATR and this function need only set some variables. Again follow the comments and modify (comment out/uncomment) the sections appropriate to the database you are building.

Note that if you set any other variables in your speaker_DBNAME function you have to ensure that the values are reset when the synthesizer switches to another speaker. In order to do this, without editing all other speaker synth files, you can redefine the speaker_reset function. See section Speaker Reset Function, for an example.

Training Setup

This section describes the setup for the automatic training for unit selection. Two set-up files need modification. The first is `index/DBNAME_synth.ch'. See section Database Parameters File, for a full description of this file -- here we will discuss the points that specifically affect training. The second file is `index/DBNAME_train.ch', which describes the training parameters themselves.

First ensure that the declaration of the coefficients files is correct.

     (Database Set CoefFileSkeleton (strcat DBNAME_data_dir
"cep/%s.mcep3f02"))
     (Database Set CoefType HTK)

A silence entry is necessary for this method of training so that distances can still be taken for contexts even when selected units at the end or start of files. Check that a silence entry exist and has the right number (and type) of fields.

Make sure the appropriate cepstrum control parameters are loaded, and then selected. This should be done within the speaker_DBNAME function.

     (set cep_dist_params DBNAME_cep_dist_params)

Define clusters of phonemes which will share Discrete distance functions and weights for all distance functions. Groups with similar articulatory characteristics work well. Note the names and number of clusters is arbitrary. For example, in the nuuph (Japanese) phoneme set, a possible clustering is

     (set DBNAME_PhoneSets '(
          (nasal_n (n ny))
          (nasal_m (m my))
          (nasal_N (N))
          (bilabial (b by p pp py ppy))
          (aveolar (d dy dd t tt ts tts ch cch j))
          (velar (g gy k kk ky kky))
          (fricative1 (ff f h hy))
          (fricative2 (sh ssh s ss z))
          (glide (r ry w y))
          (a (a))
          (e (e))
          (i (i))
          (o (o))
          (u (u))
          (PAU (PAU))
        ))

This is of course phoneme-set dependent. Note that all unit names in the database must be in at least one class. It is important that the groups have a reasonable number of members. If there are too few then training will not be possible, likewise if there are too many occurrences within a group then it may require too much disk space (and swap space) to calculate.

The distance functions fall into two classes: shared and table. Table distance functions are discrete fields which will be trained in the phonetic groups defined above. Shared distance functions are in general continuous. The default set of shared distance definitions is

     (set DBNAME_SharedDFs
        '(
          (p_phone_ident -1 phone   ident eql)
          (n_phone_ident  1 phone   ident eql)
          (duration       0 dur_z   ident abs)
          (pitch          0 pitch_z ident abs)
          (p_pitch       -1 pitch_z ident abs)
          (n_pitch        1 pitch_z ident abs)
          ))

While default set of table distance functions is

     (set DBNAME_TableDFs
       '(
         (p_vc      -1 phone ph_vc      2)
         (p_height  -1 phone ph_height  4)
         (p_length  -1 phone ph_length  6)
         (p_front   -1 phone ph_front   4)
         (p_v_rnd   -1 phone ph_v_rnd   2)
         (p_c_type  -1 phone ph_c_type  7)
         (p_c_place -1 phone ph_c_place 7)
         (p_c_vox   -1 phone ph_c_vox   2)
 
         (n_vc       1 phone ph_vc      2)
         (n_height   1 phone ph_height  4)
         (n_length   1 phone ph_length  6)
         (n_front    1 phone ph_front   4)
         (n_v_rnd    1 phone ph_v_rnd   2)
         (n_c_type   1 phone ph_c_type  7)
         (n_c_place  1 phone ph_c_place 7)
         (n_c_vox    1 phone ph_c_vox   2)
         ))

These lists are automatically expanded into actual distance function definitions. See section Distance Functions, for a full description. Their fields are: distance name, offset (-1 previous phone, 0 current, 1 next phone), the field name in the database to apply it too, the mapping function (ident, log or map name for table functions), and the distance type to use (or table size).

Further down the `DBNAME_synth.ch' file you should check the parameters in the variable nus_DBNAME_params. Two values are relevant to training (but ignored during normal synthesis).

     (dur_penalty 1.0)
     (endpoint_weight 0.0)

Training produces a number files which need to be loaded at synthesis time but not during training. In the example `DBNAME_synth.ch' file, a set of commands are only executed in non-training mode. But you need to select which set of weights to include when this file is actually used. The first one is mandatory.

     (set DBNAME_DiscTables (load (strcat DBNAME_data_dir 
                                   "index/DiscreteTables.ch")))

If no pruning is to be done, only the 0 level weights are needed.

     (set DBNAME_Weights (load (strcat DBNAME_data_dir 
                                   "index/weights0.ch")))

If pruning is required, instead load the weights appropriate for the pruning level you desire. See section Pruning, for more information.

     (set DBNAME_Weights (load (strcat DBNAME_data_dir 
                                   "index/weights2.ch")))

Note: only load the level that is required (probably weights0).

The next functions set up the trained weights for synthesis.

     (SetTableDFs DBNAME_PhoneSets DBNAME_TableDFs DBNAME_DiscTables)
     (SetSharedDFs DBNAME_SharedDFs)
     (Database Set Weights DBNAME_Weights)

The second file you must set up before training is `index/DBNAME_train.ch'. A template of this file is included in `db_utils/DBNAME_train.ch'. A number of configuration parameters exist within that file which you should consider. First, all occurrences of DBNAME should be replaced with your database name. General comments about other configuration issues are given throughout the script.

The configuration of the udb_train_params is primarily a research issue and beyond the scope of this manual.

Training of a database is a computationally expensive process. It can take from 20 minutes for a small database (e.g. gsw200 with 14 minutes of speech) to over 10 hours (e.g. f3a with 2.5 hours of speech). The most CPU intensive process is the calculation of the acoustic distance tables (or phoneme tables). These tables are calculated in the first major training step. The DISTFILE_FILEBASE variable defines where the copies of the tables will be stored on disk (by default `dist/DBNAME_'). You should ensure the there is a lot of free space in that partition. The disk space requirement increases roughly with the square of the number of units in the database: e.g.

     gsw: approximately 8700  units ==> 12.7Mb
     f2b: approximately 41000 units ==> 243Mb
     f3a: approximately 97200 units ==> 1300Mb

Once the data is stored on disk, it can be reloaded quickly to speed up multiple training runs and the multiple stages in the training script given above.

Note: do NOT use the `/tmp' directory--it is not big enough.

If the clean_up parameter is set in the udb_train_params LISP variable, then the memory copy of the distance tables will be deleted after each time it is used. When the distance table is next required it can be calculated again from scratch (very slow), or loaded from the disk copy (strongly recommended). If the clean_up variable is not set, then the training procedure will keep a copy of all the distance tables in memory. The internal distance tables are twice the size of those stored on disk (e.g. 2.6Gb for `Ef3a'), so you may need lots of swap space. Except for the smallest databases, the clean_up parameter should be set.

If no training is possible for some reason, a weights file should still be created to name the distance functions that are to be used. Reasonable guesses for weights are possible. The format of the weights is a list of weights for each phoneme class. Each weight consists of a single phone or list of phones in the class, followed by a list of pairs distance function and weight. A special phone named any may be used to cover all phonemes not otherwise specified. One suitable default weights file might contain

     (quote
       ((any
        (p_phone_ident 0.3)
        (n_phone_ident 0.3)
        (duration 0.5)
        (pitch 1.0)
        (p_pitch 0.5)
     )))

Making the Database

The script db_utils/make_db shows the main subprocesses involved in the process of building. If everything is set up properly this script would build a fully trained database. It is best called (in BASH or SH) by

     db_utils/make_db >make.log 2>&1

However, there are usually problems and it will be necessary to go through each stage by hand, especially the first time a database is built. This section describes each of these steps and problems that may occur during those stages. In general the order is significant unless otherwise explicitly stated.

First make all the directories that are used in the process.

     db_utils/make_alldirs

Although this creates a directory called `dist/' to contain the unit distances used in training, you may want to change this to a symbolic link pointing to another partition with lots of free space.

The next stage is to do the basic signal processing of the database: pitch extraction and mel-cepstrum parameter (MFCC) calculation. These could be run in parallel on different machines. Use the commands

     db_utils/make_melcep
     db_utils/make_f0s

Vector quantization of MFCC parameters, pitch and power are generated for 10ms frames for the whole database. Use the command

     db_utils/make_acoustic_params

Pitch marks for each file are generated by the following script

     db_utils/make_pitchmarks

Depending on the method used for generating pitch marks (fz_track or other), this may be run in parallel with the creation of the F0 and MFCC files. Note that warnings of the form `No peak found: N N' may be generated, but they can be disregarded. More important is that fz_track may crash if the pitch of the waveform being tracked moves outside the specified range. This can happen particularly with male speech, where the default is 70Hz to 228Hz. It is uncommon but not impossible for the pitch to go as low as 30Hz. You should specify an operations file appropriate for the speaker.

The MFCC files are merged with F0 information for training with the script

     db_utils/make_traincep

Next, the label files may be processed to produce unit description files. That is one line per unit with all fields specified.

     db_utils/make_units

If new fields are to be added to a database, they should be added to the files in the `units/' directory at this point. See section Adding a New Feature to a Database for details.

If the ToBI F0 prediction by linear regression is desired, but a full training is not possible (i.e. your database does not have ToBI labels), mapping parameters are required. The following script generates those parameters for loading in the `DBNAME_synth.ch' file.

     db_utils/make_tobif0_params

If the linear regression model is to be used to predict durations, the following script will create the parameters to map the model durations to the target's duration range.

     db_utils/make_lrdurstats

Now that we have the unit descriptions, a CHATR representation of them can be made using

     db_utils/make_unitindex

Now that all the information has been collected together, a binary representation of the full database index, pitch marks, acoustic parameters etc. may be created using

     db_utils/make_indexout

For the testing of a database with natural targets, a CHATR representation of each utterance is required for the test_seg function. This is done using

     db_utils/make_segs

The final stage is the training of weights for unit selection. This requires that both the `index/DBNAME_synth.ch' and `index/DBNAME_train.ch' files be created and edited. Training can take some time and may use lots of database space. The time to train is related to the square of the size of the database, so the bigger the database the much longer the time it takes to train. For example, the gsw, 200 sentence English database takes about 20 minutes to train, while the Japanese 503 sentence database takes about 6 hours.

     db_utils/make_training

Note: If training fails during the making of the distance tables, you should delete the last made table from `dist/'. It may be incomplete and hence reloading it later will cause an error.

A fully trained and described database should now exist. Before it can be used by CHATR it must be defined. See section Defining a Speaker, for details.

To use this newly created database, call the function

     (speaker_DBNAME)

This will autoload your `DBNAME_synth.ch' file and execute the speaker_DBNAME function defined in that file.

Initial tests of the database are best made using natural targets. After defining the speaker in CHATR, you can test it with a command like

     (Say (test_seg "fileid1"))

where fileid is a fileid from your newly created database.

Once a database is proven to be stable, it's defspeaker definition may be added to the file `lib/data/itlspeakers.ch' in the CHATR distribution so others may use it. Initial tests should be done directly in a users own installation of CHATR (i.e. from your `.chatrrc', or directly at the command line).

Minor Customization

Defining a New Phoneme Set

It may be desired to define a new phoneme set particular to a new database. This has been considered and some support is given. First you must create a CHATR file in the `index/' directory defining the phoneme set, called `PHONESET_def.ch'. See section Phoneme Sets, for details about how to define a phoneme set.

The desired phoneme set must be loaded in `DBNAME_synth.ch'. A line, commented out, shows the format.

Note that when a new phoneme set is used, that database will not work with the higher levels of the system directly. A new lexicon and possibly a new intonation module and duration module will be required. Especially if this is a new language. Of course natural target resynthesis will work without any of these higher levels. In this case simply do not define any lexicon, intonation and duration in the `DBNAME_synth.ch' file.

Alternatively, a phoneme map may be defined between an existing phoneme set and the new phoneme set. The Phoneme Internal set can be an existing one and a mapping will occur automatically. Although this will work the mapping system is probably not powerful enough to get the best results thus this should only be used as an intermediate step.

Pruning

This method of building a speech synthesis database allows for the pruning of units from the database which are found to be unpredictable. There are two reasons for pruning, first to reduce the size of the database so synthesis will be faster, and second to remove units which have properties which do not reflect the features they are labeled with. Pruning is still very much in its initial stages, this area deserves much more work before it can improve databases as much as we feel it is possible.

The training algorithm provides the options for levels of pruning. See the setting of train_level near the top of `DBNAME_synth.ch'. Setting the variable to non-nil will cause training to do levels of pruning. Pruning parameters are set in the variable udb_train_params set further down the training file.

Once a set of units to be pruned is generated (they will be saved in `index/DBNAME_prune*.ch', the index must be rebuilt without the pruned units. This done via the following command

     db_utils/make_pruning LEVEL

Note that it is only the index from which the pruned units are removed the actually entries themselves still exists within the database but will never be selected. They must remain because their neighbors may require information about their context and hence refer to these pruned units.

More serious pruning, e.g. removal of whole bad files, should really be done before CHATR processes the data.

Pruning does not happen by default in building databases as currently we feel the advantage from it is minimal, more experimentation is really required.

Changing Format of Waveform Files

It is possible to reduce the size of a database significantly by resampling the waveforms files. For example, changing the waveform files from a 16kHz sampling 16 bit linear database to 8K ulaw will take only one quarter of the space of the original. If the eventual output is to be played on a low quality audio system (e.g. Sun's /dev/audio) very little loss in quality will occur. Likewise if a higher sample rate version is available you could use that.

The format of the waveform files may be changed without recompilation of any part of the database index. All information is time based not sample based (even pitch mark files).

Given the example template of `DBNAME_synth.ch' for a 16kHz, 16-bit linear waveform, we would have a declaration like

     (Database Set WaveFileType raw)
     (Database Set WaveSampleRate 16000)
     (Database Set WaveEncoding lin16MSB)
     (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.wav"))

To change to a database of 8K ulaw first convert all the files in `wav/' to 8K ulaw (using some external program, or using CHATR). Then change the above lines in `DBNAME_synth.ch' to

     (Database Set WaveFileType raw)    ;; i.e. unheadered
     (Database Set WaveSampleRate 8000)
     (Database Set WaveEncoding ulaw)
     (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.au"))

See the description of the command Database in the commands appendix for details of the formats supported.

Speaker Reset Function

All speaker functions defined in `DBNAME_synth.ch' call the function speaker_reset. Currently that function is defined but does nothing. The use for the function is to reset any variables set for a particular speaker before another speaker is selected. Of course, all of the speakers description files could be edited, but that would be a lot of work. So instead you should change the speaker_reset function defined in `lib/data/speakutils.ch'.

If you don't have access to that function or don't wish to modify it, you can still get the same effect by redefining the function speaker_reset in your own `DBNAME_synth.ch' file. In case someone else has already done that, the following method is recommended. In this instance, you define a new version of speaker_reset which calls the existing definition and also includes your own reset information. If everyone uses this technique resets will happen properly.

Suppose your new speaker `zaphod' requires the variable spareheads to be set to one, but that needs to be nil for all other speakers. In `zaphod_synth.ch', after the definition of speaker_zaphod (which sets spareheads to one), you should add

     (set zaphod_previous_speaker_reset speaker_reset)

     (define speaker_reset ()
        "New speaker reset that adds resets for speakers after
calling zaphod"
     (zaphod_previous_speaker_reset) ;; previously defined speaker_rest
     (set spareheads nil)
     )

Adding a New Feature to a Database

A common requirement may be the addition of a new feature to an existing database. This would be most common within our own research group where we wish to test the suitability of some new feature in the selection process. This section offers a walk through of what you have to change in an existing database to achieve this.

A new field may be added, trained and tested without any change to the CHATR C source code. However, if this field is to be added to the full synthesis process, you must of course modify the C source code in order to be able to predict this field.

The first stage is to generate the values for the new field(s) for each unit in the database. This unfortunately is not quite as easy as it sounds. You must ensure that the fields generated align with the units labels in the `lab/*.lab' files. You should take an adequate amount of time to ensure this is the case.

Note the following process is destructive, in that it modifies the database already existing in a database directory. This only modifies the files in directories `units/', `chatr/seg/' and `index/', so a set of shadow links can be set up if desired. You should of course not be experimenting with a database that others may be currently using.

First create files in `units/', one for each fileid in the system with an extension. The files may contain more than one new field. These fields can be pasted to the end of the existing units files in that directory by the command

     db_utils/add_newfields <newfield_fileextention>

This will modify all `.units' files in that directory appending the new fields.

Now create the file `index/DBNAME_extrafields' containing the field declaration for the new fields you wish to add to the database. Fields can be floats, ints, or categories. For example, if two news fields are added, one for ToBI accents and ToBI ending tones, the file `DBNAME_extrafields' may look like

     (tobi_accent (NONE H* !H* L+H* L+!H* L* L*+H OTHER))
     (tobi_tone (NONE L-L% L-H% H- L- H-H% OTHER))

Note it is necessary (for a later shell script) to have leading spaces on the above lines.

Now a new index can be created with those new fields.

     db_utils/make_unitindex

Then compiled by

     db_utils/make_indexout

If problems occur in making this index you will need to fix them before continuing.

Next a CHATR utterance representation of the database entries should be created (i.e. for use with test_seg). The format of these files includes all fields in a database entry even if there is no way to currently predict a field's value during test to speech.

     db_utils/make_segs

Also you will need to amend the silence entry definition in `DBNAME_synth.ch' to give values for the new fields you have created. For example

     (Database Set Silence 
        ("pau" 0 67 0.0 120 0.0 0.210 0.0 5.369 0.0 NONE NONE 0))

Note your new fields occur one before the end.

Training of the new fields is also automatic. You need to edit `index/DBNAME_synth.ch' to define new distance functions for the new fields (and possibly delete existing distance functions if you do not wish them). Note a database may contain more fields than are actually used in selection, therefore if comparing competing fields the same compiled index may be used but different training (and hence different weights files) is all that need change.

For full details of distance functions see section Distance Functions. Here we will only deal with a limited form of customization. There are two major classes of distance functions: continuous (float or int) and categorical. These are trained differently. New continuous distance functions should be listed in the variable DBNAME_SharedDFs, while categorical distance functions are listed in DBNAME_TableDFs. These lists are expanded automatically into full distance function definitions during training.

A continuous listing consist of 5 fields

distance name: Must be unique.
position offset: -1 means previous, 0 means current and 1 means next unit.
field name: The field name (for new fields it is the name introduced in DBNAME_extrafields).
mapping: The mapping function use. For continuous fields names this can currently be ident, i.e. no mapping, or log, for logarithm.
difference measure: This defines the function used to give the distance between the target and database unit fields. eql returns 0 if the two values are equal, and 1 otherwise (this is only reasonable for int valued fields). abs means the absolute different between the two values and sqr means the squared difference.

A categorical listing also consists of 5 fields

distance name: Must be unique.
position offset: -1 means previous, 0 means current and 1 means next unit.
field name: The field name (for new fields it is the name introduced in DBNAME_extrafields).
mapping: The mapping function use. For new fields this is probably ident, but if some further quantization is desired it may be achieved by defining a new Discrete and Map. See section Discretes and Maps.
size: The number of members in the category. If the mapping is ident, this is number of members in the field declaration. If the mapping is something other than ident, this is the number of items in the category being mapped to.

Note that categorical distance functions will be trained in phone groups. This will rarely be wrong but sometimes have more differentiation than is necessary.

In our example we have two new fields we wish to train. Both are category fields so we have to add their descriptions to the variable DBNAME_TableDFS. The additions would look like

     (tobi_accent   0 tobi_accent ident 8)
     (tobi_tone     0 tobi_tone ident 7)

That is (for the first line) the new distance function is called tobi_accent, it is to apply to the current phone (0), using the field name tobi_accent, with no mapping (ident) and has 8 members.

If we wished to have a distance function on not only the current phone but also on the context, we could add distance functions that include the tobi accents of the left and right context

     (p_tobi_accent   -1 tobi_accent ident 8)
     (n_tobi_accent    1 tobi_accent ident 8)

Thus then the distance name and the offset field changes, but the field name of course remains the same.

Once the new distance measures have been defined, you can train the new weights. The standard database script can be used

     db_utils/make_training

The distance measures calculated in `dist/' by default are reused if they exist, so keeping them is useful when training on different fields, as it makes retraining much faster. Note if you change the phoneme set you must delete the old distance files and re-create them.

After training you should check the training log file in `index/DBNAME_train.log' to see the contribution of the new fields you have introduced.

Modifying CHATR to Predict a New Field

Modifying CHATR to Predict a New Field

Once you decide that a new field is worth predicting, you will need to modify CHATR to actually predict it. All target fields are generated using functions in the file `src/udb/udb_targfuncs.c'. These functions return a Lisp cell (to generically deal with the appropriate type, float, int or categorical) from a segment stream cell. An entry should be added to the table df_targ_val_name2func relating the new fieldname to a function. The function may simply access a field in the segment stream cell (or one related to it), or do some calculation. It may be that a feature function already exists to generate the appropriate value and hence may simply be called with an easy wrap around function (cf. udb_tf_sylpos.

Objective Distance Measure

The object of training is to find the weighting that minimizes the distance of the selected units from the original. We do not yet know the ideal distance measure. The ideal measure would be a signal processing measure that would directly follow humans' perceptions of good and bad synthesis. However, approximations of this measure are possible and CHATR supports a mechanism for choosing what measure to use. The distance method used is defined through the variable cep_dist_parms (as it will most likely involve some form of cepstrum parameters).

The assumption is that a set of parameters is defined for each frame (at some increment) in the database. This is specified through the CoefFileSkeleton setting of the Database command. The format of these files may vary but HTK headered and ATR improved cepstrum files are currently supported. Remember to add new formats to `file/cep_io.c'.

The distance measures themselves are defined in `chatr/cep_dist.c'. Currently supported are Euclidean and weighted Euclidean. Two alignment options are also provided: naive which does no time alignment between the selected units' cepstrum vectors and the original (just taking the shortest), and tw which linear interpolates the selected units' cepstrum vectors to the original.(8)

Go to the first, previous, next, last section, table of contents.