This chapter runs through an example of building the necessary files CHATR requires in order to build a synthesizer based on a speech base. The process is long and requires much disk space and cpu time. Although it is mostly automatic, there are number of stages where informed decisions need to be made. A familiarity with the operation will greatly aid you in successfully building a usable synthesis database.
Before proceeding further, a short explanation of some database-building specific terminology follows.
Each waveform file is identified by a short identifier, a
fileid
. This will typically be the name of the file minus any
extension. For example, if the files are called
sc001.wav sc002.wav sc003.wav ...
then the fileid's are
sc001 sc002 sc003 ...
The following are brief descriptions of the contents of sub-directories created while making a synthesis database.
`wav/'
`lab/'
`db_utils/'
`stats/'
`units/'
`f0/'
`pm/'
`cep/'
`vq/'
`chatr/seg/'
`index/'
This processing will eventually create a fully trained database with index files. To use the resulting database within CHATR, only one definition command needs to be executed. See section Defining a Speaker, for details. The construction process requires access to following programming systems
Choose a short name for your database and create a directory for it somewhere. All files will be generated in that directory by default. Only one place in the ultimate database definition refers to this directory, so it may easily be moved afterwards.
In your newly created database directory, you will need
Copy the waveform and phoneme label files into the `wav/' and `lab/' directories respectively.
Copy all the files in `/DB/PI/chatr/db_utils/' to your own `db_utils/'. Then make all the files writable using
chmod +w db_utils/*
These files are the scripts and binaries used to build a database. If all goes well you will not need to change any of these files, but then again you may have to.
Note you will need about 2.5 to 3 times the disk space of the `wav/' directory, plus space for the training distance files. These can be 2Mb to 1.5Gb, depending on the size of database. The amount is related to the square of the number of occurrences of each phoneme.
Create a list of all the waveform files in the database. Assuming only the waveform files are called `*.wav', you can use the following command
ls wav/*.wav >files.all cp files.all files
In the following shell scripts, the file `files' is used to list all files to be processed. If (when?) things go wrong, `files' may be modified to isolate only the files you wish to process.
Three files need to be created before proceeding: `db_description', `index/DBNAME_synth.ch' and `index/DBNAME_train.ch'. Templates are available in `db_utils'.
Create the `db_description' file in the top directory. Copy the example from the `db_utils/' directory using the command
cp db_utils/db_description db_description
Edit that file specifying the database name, the phoneme set, and the sample rate for the database. You may also wish to change the values in the second section depending on your environment.
It may be useful to modify the variable GET_F0_PARAMS. This specifies the minimum and maximum expected F0 for the speaker. It helps to set the range to the likely limits for a particular speaker database, so the defaults may not necessarily be suitable. This is especially so for male speakers, where the lowest value may be below the default minimum.
The two other configuration files must be created from templates in the `db_utils/' directory before commencing database make. See section Database Parameters File, for details about the `index/DBNAME_synth.ch' file. See section Training Setup, for details about the `index/DBNAME_train.ch' file.
It is highly recommended to thoroughly check the database before proceeding further. See section Checking the Database, for details.
Databases are large and therefore very likely to contain errors. A number of specific tests are provided to try to detect the most likely problems. You are strongly advised to run these test and study the results. Remember, database errors are probably the most common cause of bad synthesis.
db_utils/check_labs
This function checks the unit labels, identifying the number of occurrences of each, and finding any with unusually short durations.
Do fix any problems before continuing.
db_utils/check_phoneset
This function checks that the labels are all in the defined phoneme set. This only works when using a CHATR standard phoneme set. See section Defining a New Phoneme Set, if you are defining your own phoneme set. Again, do fix any problems before continuing.
db_utils/check_align
This checks the location of labels within waveforms. It is intended to detect offsets in label files, mismatches of label files to waveforms, and possibly mistaken sample rate.
Once more, do fix any problems before continuing.
db_utils/check_labwav FILEID
Strictly speaking it is not actually a test, but uses XWAVES to display an example waveform and label file as described in the database. Check that the labels match the waveform, and that the waveform itself is of the right byte order. You should check all files using this method, but of course, you checked the database before you started this process, didn't you?
At least check three files--one near the beginning, one in the middle, and one near the end.
Before any processing occurs it is also wise to check that the waveform files in the database are of similar quality. Different waveform files often have quite different mean power, and it may be useful to normalize them. The CHATR function to do this is
db_utils/normal_power
It may also be necessary to exclude some waveforms files because of extraneous noise -- background music, etc. In that case, remove the waveform file name from the `files.all' file and re-copy `files.all' to `files'.
All parameters in a database are described in the file `DBNAME_synth.ch'. This file describes both the database itself, plus other aspects of the voice, such as lexicon, intonation and duration parameters. There are many many options in CHATR, and it is not always easy to know which subsystems have parameters, let alone what values they should take.
Create a file called `DBNAME_synth.ch' in the directory `index/', where `DBNAME' is the name of the database you are building. A template example can be found in `db_utils/DBNAME_synth.ch'. This should be copied to the `index/' directory with the appropriate name.
This file must be edited.
Every occurrence of `<>' marks a part that requires specific attention.
The file can be viewed in two parts, initialization and
selection. When loaded, this file should initialize and load
all necessary parameters for the use of this database as a CHATR
speaker. The function speaker_DBNAME
(at the end of this
file) should, when called, select the database and auxiliary
parameters actually required. The idea is that users will call that
function when changing between alternate speakers.
In general, change occurrences of DBNAME to the database name, PHONESET to the phoneme set name, and DICTNAME to the appropriate dictionary name.
Let us go through the main parts. Note the directory where the data
is is defined when the function defspeaker
is called.
DBNAME_data_dir
will be set before the file
`DBNAME_synch.ch' is loaded.
First, decide on the phoneme set you wish to use and ensure it is
loaded. If the phoneme set is a standard set (i.e. `radio2', `mrpa',
`BEEP', or `nuuph') you may simply require
the definition
file. If not, you must define your phoneme set. Note all unit names
in the database must be a member of this phoneme set. A
commented-out example of loading a definition specific to a database
is available. See section Defining a New Phoneme Set for more details.
The main database declaration is next. It defines the name of the index file, and the format of the waveform, pitch mark and cepstrum files. If you use the standard database as your platform, the example names will be satisfactory--but remember to remove the `<>' marks. This section also defines the wave file type and sample rate (you must set this), as well as the phoneme set.
The Silence
definition is used in unit selection as a context
for units which come at the end (or start) of a file. Ensure that
the example silence entry has a reasonable value for all fields that
exist in your database. It is assumed that there is an effective
silence before and after each file in the database--although good
database design should ensure there are actually some silence units
in the waveform as well. Note that the phone used to represent
silence must be declared in the phoneme set as a non-vowel and of
consonant type 0 in the phoneme definition.
The next section of `DBNAME_synch.ch' defines which distance functions to be used (and trained). See section Training Setup, for the details.
The next significant section is defining nus_DBNAME_params
,
which defines some general parameters for the unit selection process.
Their current values are probably acceptable, although the beam and
candidate widths could possibly be reduced. See the Variables
appendix for the details of their values.
The next section is not executed during training mode, as it is then that the files it loads are generated. The weight sets to be loaded should be selected depending on pruning choice. See section Training Setup, for the details.
Setting up the lexicon typically requires two steps: definition and selection. A number of lexicons are already built into CHATR
cmu
beep
mrpa
japanese
For these lexicons there are set-up functions defined in `lib/data/lexicons.ch'. Remove the comment characters from the line for the lexicon you wish to use. For more details on building your own lexicons see section Lexicon.
For duration set-up a number of choices exist. A linear regression
model can be used which is trainable for multiple languages.
Examples built from f2b
(American English) and MHT
(Japanese) exist and can be used with other databases (of the same
language). A neural net based model is also available but does not
train well, though the example included from f2b
is
acceptable. For Japanese, a built in model exists. It has no
external parameters and requires no set up. See section Duration, for
more details, including training new duration models. Specifically
for an example of tuning a linear regression model, see section Training a New Duration Model.
The second part of the duration definition defines pause durations at phrase boundaries. An appropriate definition must be loaded.
There are a number of intonation systems built into CHATR. The
most stable ones are based around ToBI and hence for building working
speech synthesis voices ToBI is recommended. The appropriate
parameters should be set for that speaker. See the appendix on
variables about the appropriate values for ToBI_params
and
mb_params
. For more details on building intonation models see
section Intonation. One stable method for predicting F0 from ToBI
labels is a model using linear regression. Two models have been
included with the system, one for English (from f2b) and one for
Japanese (from MHT). These models can be mapped to other speaker's
F0 range given the target speakers F0 mean and standard deviation.
These speaker specific parameters are generated by the script
make_tobif0_params
during the building of a database.
The final part of this file defines a function that when called will select the appropriate parameters to cause the synthesizer to use that voice. For efficiency you should try to ensure everything is loaded into CHATR and this function need only set some variables. Again follow the comments and modify (comment out/uncomment) the sections appropriate to the database you are building.
Note that if you set any other variables in your
speaker_DBNAME
function you have to ensure that the values are
reset when the synthesizer switches to another speaker. In order to
do this, without editing all other speaker synth files, you can
redefine the speaker_reset
function. See section Speaker Reset Function, for an example.
This section describes the setup for the automatic training for unit selection. Two set-up files need modification. The first is `index/DBNAME_synth.ch'. See section Database Parameters File, for a full description of this file -- here we will discuss the points that specifically affect training. The second file is `index/DBNAME_train.ch', which describes the training parameters themselves.
First ensure that the declaration of the coefficients files is correct.
(Database Set CoefFileSkeleton (strcat DBNAME_data_dir "cep/%s.mcep3f02")) (Database Set CoefType HTK)
A silence entry is necessary for this method of training so that distances can still be taken for contexts even when selected units at the end or start of files. Check that a silence entry exist and has the right number (and type) of fields.
Make sure the appropriate cepstrum control parameters are loaded, and
then selected. This should be done within the speaker_DBNAME
function.
(set cep_dist_params DBNAME_cep_dist_params)
Define clusters of phonemes which will share Discrete distance
functions and weights for all distance functions. Groups with
similar articulatory characteristics work well. Note the names and
number of clusters is arbitrary. For example, in the nuuph
(Japanese) phoneme set, a possible clustering is
(set DBNAME_PhoneSets '( (nasal_n (n ny)) (nasal_m (m my)) (nasal_N (N)) (bilabial (b by p pp py ppy)) (aveolar (d dy dd t tt ts tts ch cch j)) (velar (g gy k kk ky kky)) (fricative1 (ff f h hy)) (fricative2 (sh ssh s ss z)) (glide (r ry w y)) (a (a)) (e (e)) (i (i)) (o (o)) (u (u)) (PAU (PAU)) ))
This is of course phoneme-set dependent. Note that all unit names in the database must be in at least one class. It is important that the groups have a reasonable number of members. If there are too few then training will not be possible, likewise if there are too many occurrences within a group then it may require too much disk space (and swap space) to calculate.
The distance functions fall into two classes: shared and table. Table distance functions are discrete fields which will be trained in the phonetic groups defined above. Shared distance functions are in general continuous. The default set of shared distance definitions is
(set DBNAME_SharedDFs '( (p_phone_ident -1 phone ident eql) (n_phone_ident 1 phone ident eql) (duration 0 dur_z ident abs) (pitch 0 pitch_z ident abs) (p_pitch -1 pitch_z ident abs) (n_pitch 1 pitch_z ident abs) ))
While default set of table distance functions is
(set DBNAME_TableDFs '( (p_vc -1 phone ph_vc 2) (p_height -1 phone ph_height 4) (p_length -1 phone ph_length 6) (p_front -1 phone ph_front 4) (p_v_rnd -1 phone ph_v_rnd 2) (p_c_type -1 phone ph_c_type 7) (p_c_place -1 phone ph_c_place 7) (p_c_vox -1 phone ph_c_vox 2) (n_vc 1 phone ph_vc 2) (n_height 1 phone ph_height 4) (n_length 1 phone ph_length 6) (n_front 1 phone ph_front 4) (n_v_rnd 1 phone ph_v_rnd 2) (n_c_type 1 phone ph_c_type 7) (n_c_place 1 phone ph_c_place 7) (n_c_vox 1 phone ph_c_vox 2) ))
These lists are automatically expanded into actual distance function definitions. See section Distance Functions, for a full description. Their fields are: distance name, offset (-1 previous phone, 0 current, 1 next phone), the field name in the database to apply it too, the mapping function (ident, log or map name for table functions), and the distance type to use (or table size).
Further down the `DBNAME_synth.ch' file you should check the
parameters in the variable nus_DBNAME_params
. Two values are
relevant to training (but ignored during normal synthesis).
(dur_penalty 1.0) (endpoint_weight 0.0)
Training produces a number files which need to be loaded at synthesis time but not during training. In the example `DBNAME_synth.ch' file, a set of commands are only executed in non-training mode. But you need to select which set of weights to include when this file is actually used. The first one is mandatory.
(set DBNAME_DiscTables (load (strcat DBNAME_data_dir "index/DiscreteTables.ch")))
If no pruning is to be done, only the 0 level weights are needed.
(set DBNAME_Weights (load (strcat DBNAME_data_dir "index/weights0.ch")))
If pruning is required, instead load the weights appropriate for the pruning level you desire. See section Pruning, for more information.
(set DBNAME_Weights (load (strcat DBNAME_data_dir "index/weights2.ch")))
Note: only load the level that is required (probably weights0
).
The next functions set up the trained weights for synthesis.
(SetTableDFs DBNAME_PhoneSets DBNAME_TableDFs DBNAME_DiscTables) (SetSharedDFs DBNAME_SharedDFs) (Database Set Weights DBNAME_Weights)
The second file you must set up before training is `index/DBNAME_train.ch'. A template of this file is included in `db_utils/DBNAME_train.ch'. A number of configuration parameters exist within that file which you should consider. First, all occurrences of DBNAME should be replaced with your database name. General comments about other configuration issues are given throughout the script.
The configuration of the udb_train_params
is primarily a
research issue and beyond the scope of this manual.
Training of a database is a computationally expensive process. It
can take from 20 minutes for a small database (e.g. gsw200 with 14
minutes of speech) to over 10 hours (e.g. f3a with 2.5 hours of
speech). The most CPU intensive process is the calculation of the
acoustic distance tables (or phoneme tables). These tables are
calculated in the first major training step. The
DISTFILE_FILEBASE
variable defines where the copies of the
tables will be stored on disk (by default `dist/DBNAME_'). You
should ensure the there is a lot of free space in that partition.
The disk space requirement increases roughly with the square of the
number of units in the database: e.g.
gsw: approximately 8700 units ==> 12.7Mb f2b: approximately 41000 units ==> 243Mb f3a: approximately 97200 units ==> 1300Mb
Once the data is stored on disk, it can be reloaded quickly to speed up multiple training runs and the multiple stages in the training script given above.
Note: do NOT use the `/tmp' directory--it is not big enough.
If the clean_up
parameter is set in the
udb_train_params
LISP variable, then the memory copy of the
distance tables will be deleted after each time it is used. When the
distance table is next required it can be calculated again from
scratch (very slow), or loaded from the disk copy (strongly
recommended). If the clean_up
variable is not set, then the
training procedure will keep a copy of all the distance tables in
memory. The internal distance tables are twice the size of those
stored on disk (e.g. 2.6Gb for `Ef3a'), so you may need lots of
swap space. Except for the smallest databases, the clean_up
parameter should be set.
If no training is possible for some reason, a weights file should
still be created to name the distance functions that are to be used.
Reasonable guesses for weights are possible. The format of the
weights is a list of weights for each phoneme class. Each weight
consists of a single phone or list of phones in the class, followed
by a list of pairs distance function and weight. A special phone
named any
may be used to cover all phonemes not otherwise
specified. One suitable default weights file might contain
(quote ((any (p_phone_ident 0.3) (n_phone_ident 0.3) (duration 0.5) (pitch 1.0) (p_pitch 0.5) )))
The script db_utils/make_db
shows the main subprocesses
involved in the process of building. If everything is set up
properly this script would build a fully trained database. It
is best called (in BASH or SH) by
db_utils/make_db >make.log 2>&1
However, there are usually problems and it will be necessary to go through each stage by hand, especially the first time a database is built. This section describes each of these steps and problems that may occur during those stages. In general the order is significant unless otherwise explicitly stated.
First make all the directories that are used in the process.
db_utils/make_alldirs
Although this creates a directory called `dist/' to contain the unit distances used in training, you may want to change this to a symbolic link pointing to another partition with lots of free space.
The next stage is to do the basic signal processing of the database: pitch extraction and mel-cepstrum parameter (MFCC) calculation. These could be run in parallel on different machines. Use the commands
db_utils/make_melcep db_utils/make_f0s
Vector quantization of MFCC parameters, pitch and power are generated for 10ms frames for the whole database. Use the command
db_utils/make_acoustic_params
Pitch marks for each file are generated by the following script
db_utils/make_pitchmarks
Depending on the method used for generating pitch marks
(fz_track
or other), this may be run in parallel with the
creation of the F0 and MFCC files. Note that warnings of the form
`No peak found: N N' may be generated, but they can be disregarded.
More important is that fz_track
may crash if the pitch of the
waveform being tracked moves outside the specified range. This can
happen particularly with male speech, where the default is 70Hz to
228Hz. It is uncommon but not impossible for the pitch to go as low
as 30Hz. You should specify an operations file appropriate for the
speaker.
The MFCC files are merged with F0 information for training with the script
db_utils/make_traincep
Next, the label files may be processed to produce unit description files. That is one line per unit with all fields specified.
db_utils/make_units
If new fields are to be added to a database, they should be added to the files in the `units/' directory at this point. See section Adding a New Feature to a Database for details.
If the ToBI F0 prediction by linear regression is desired, but a full training is not possible (i.e. your database does not have ToBI labels), mapping parameters are required. The following script generates those parameters for loading in the `DBNAME_synth.ch' file.
db_utils/make_tobif0_params
If the linear regression model is to be used to predict durations, the following script will create the parameters to map the model durations to the target's duration range.
db_utils/make_lrdurstats
Now that we have the unit descriptions, a CHATR representation of them can be made using
db_utils/make_unitindex
Now that all the information has been collected together, a binary representation of the full database index, pitch marks, acoustic parameters etc. may be created using
db_utils/make_indexout
For the testing of a database with natural targets, a CHATR
representation of each utterance is required for the test_seg
function. This is done using
db_utils/make_segs
The final stage is the training of weights for unit selection. This requires that both the `index/DBNAME_synth.ch' and `index/DBNAME_train.ch' files be created and edited. Training can take some time and may use lots of database space. The time to train is related to the square of the size of the database, so the bigger the database the much longer the time it takes to train. For example, the gsw, 200 sentence English database takes about 20 minutes to train, while the Japanese 503 sentence database takes about 6 hours.
db_utils/make_training
Note: If training fails during the making of the distance tables, you should delete the last made table from `dist/'. It may be incomplete and hence reloading it later will cause an error.
A fully trained and described database should now exist. Before it can be used by CHATR it must be defined. See section Defining a Speaker, for details.
To use this newly created database, call the function
(speaker_DBNAME)
This will autoload your `DBNAME_synth.ch' file and execute the
speaker_DBNAME
function defined in that file.
Initial tests of the database are best made using natural targets. After defining the speaker in CHATR, you can test it with a command like
(Say (test_seg "fileid1"))
where fileid is a fileid from your newly created database.
Once a database is proven to be stable, it's defspeaker
definition may be added to the file `lib/data/itlspeakers.ch' in
the CHATR distribution so others may use it. Initial tests
should be done directly in a users own installation of CHATR
(i.e. from your `.chatrrc', or directly at the command line).
It may be desired to define a new phoneme set particular to a new database. This has been considered and some support is given. First you must create a CHATR file in the `index/' directory defining the phoneme set, called `PHONESET_def.ch'. See section Phoneme Sets, for details about how to define a phoneme set.
The desired phoneme set must be loaded in `DBNAME_synth.ch'. A line, commented out, shows the format.
Note that when a new phoneme set is used, that database will not work with the higher levels of the system directly. A new lexicon and possibly a new intonation module and duration module will be required. Especially if this is a new language. Of course natural target resynthesis will work without any of these higher levels. In this case simply do not define any lexicon, intonation and duration in the `DBNAME_synth.ch' file.
Alternatively, a phoneme map may be defined between an existing phoneme set and the new phoneme set. The Phoneme Internal set can be an existing one and a mapping will occur automatically. Although this will work the mapping system is probably not powerful enough to get the best results thus this should only be used as an intermediate step.
This method of building a speech synthesis database allows for the pruning of units from the database which are found to be unpredictable. There are two reasons for pruning, first to reduce the size of the database so synthesis will be faster, and second to remove units which have properties which do not reflect the features they are labeled with. Pruning is still very much in its initial stages, this area deserves much more work before it can improve databases as much as we feel it is possible.
The training algorithm provides the options for levels of pruning.
See the setting of train_level
near the top of
`DBNAME_synth.ch'. Setting the variable to non-nil will cause
training to do levels of pruning. Pruning parameters are set in the
variable udb_train_params
set further down the training file.
Once a set of units to be pruned is generated (they will be saved in `index/DBNAME_prune*.ch', the index must be rebuilt without the pruned units. This done via the following command
db_utils/make_pruning LEVEL
Note that it is only the index from which the pruned units are removed the actually entries themselves still exists within the database but will never be selected. They must remain because their neighbors may require information about their context and hence refer to these pruned units.
More serious pruning, e.g. removal of whole bad files, should really be done before CHATR processes the data.
Pruning does not happen by default in building databases as currently we feel the advantage from it is minimal, more experimentation is really required.
It is possible to reduce the size of a database significantly by resampling the waveforms files. For example, changing the waveform files from a 16kHz sampling 16 bit linear database to 8K ulaw will take only one quarter of the space of the original. If the eventual output is to be played on a low quality audio system (e.g. Sun's /dev/audio) very little loss in quality will occur. Likewise if a higher sample rate version is available you could use that.
The format of the waveform files may be changed without recompilation of any part of the database index. All information is time based not sample based (even pitch mark files).
Given the example template of `DBNAME_synth.ch' for a 16kHz, 16-bit linear waveform, we would have a declaration like
(Database Set WaveFileType raw) (Database Set WaveSampleRate 16000) (Database Set WaveEncoding lin16MSB) (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.wav"))
To change to a database of 8K ulaw first convert all the files in `wav/' to 8K ulaw (using some external program, or using CHATR). Then change the above lines in `DBNAME_synth.ch' to
(Database Set WaveFileType raw) ;; i.e. unheadered (Database Set WaveSampleRate 8000) (Database Set WaveEncoding ulaw) (Database Set WaveFileSkeleton (strcat DBNAME_data_dir "wav/%s.au"))
See the description of the command Database
in the commands
appendix for details of the formats supported.
All speaker functions defined in `DBNAME_synth.ch' call the
function speaker_reset
. Currently that function is
defined but does nothing. The use for the function is to reset
any variables set for a particular speaker before another
speaker is selected. Of course, all of the speakers description
files could be edited, but that would be a lot of work. So
instead you should change the speaker_reset
function
defined in `lib/data/speakutils.ch'.
If you don't have access to that function or don't wish to modify it,
you can still get the same effect by redefining the function
speaker_reset
in your own `DBNAME_synth.ch' file. In
case someone else has already done that, the following method is
recommended. In this instance, you define a new version of
speaker_reset
which calls the existing definition and also
includes your own reset information. If everyone uses this technique
resets will happen properly.
Suppose your new speaker `zaphod' requires the variable
spareheads
to be set to one
, but that needs to be
nil
for all other speakers. In `zaphod_synth.ch', after
the definition of speaker_zaphod
(which sets spareheads
to one
), you should add
(set zaphod_previous_speaker_reset speaker_reset) (define speaker_reset () "New speaker reset that adds resets for speakers after calling zaphod" (zaphod_previous_speaker_reset) ;; previously defined speaker_rest (set spareheads nil) )
A common requirement may be the addition of a new feature to an existing database. This would be most common within our own research group where we wish to test the suitability of some new feature in the selection process. This section offers a walk through of what you have to change in an existing database to achieve this.
A new field may be added, trained and tested without any change to the CHATR C source code. However, if this field is to be added to the full synthesis process, you must of course modify the C source code in order to be able to predict this field.
The first stage is to generate the values for the new field(s) for each unit in the database. This unfortunately is not quite as easy as it sounds. You must ensure that the fields generated align with the units labels in the `lab/*.lab' files. You should take an adequate amount of time to ensure this is the case.
Note the following process is destructive, in that it modifies the database already existing in a database directory. This only modifies the files in directories `units/', `chatr/seg/' and `index/', so a set of shadow links can be set up if desired. You should of course not be experimenting with a database that others may be currently using.
First create files in `units/', one for each fileid in the system with an extension. The files may contain more than one new field. These fields can be pasted to the end of the existing units files in that directory by the command
db_utils/add_newfields <newfield_fileextention>
This will modify all `.units' files in that directory appending the new fields.
Now create the file `index/DBNAME_extrafields' containing the field declaration for the new fields you wish to add to the database. Fields can be floats, ints, or categories. For example, if two news fields are added, one for ToBI accents and ToBI ending tones, the file `DBNAME_extrafields' may look like
(tobi_accent (NONE H* !H* L+H* L+!H* L* L*+H OTHER)) (tobi_tone (NONE L-L% L-H% H- L- H-H% OTHER))
Note it is necessary (for a later shell script) to have leading spaces on the above lines.
Now a new index can be created with those new fields.
db_utils/make_unitindex
Then compiled by
db_utils/make_indexout
If problems occur in making this index you will need to fix them before continuing.
Next a CHATR utterance representation of the database entries
should be created (i.e. for use with test_seg
). The format of
these files includes all fields in a database entry even if there is
no way to currently predict a field's value during test to speech.
db_utils/make_segs
Also you will need to amend the silence entry definition in `DBNAME_synth.ch' to give values for the new fields you have created. For example
(Database Set Silence ("pau" 0 67 0.0 120 0.0 0.210 0.0 5.369 0.0 NONE NONE 0))
Note your new fields occur one before the end.
Training of the new fields is also automatic. You need to edit `index/DBNAME_synth.ch' to define new distance functions for the new fields (and possibly delete existing distance functions if you do not wish them). Note a database may contain more fields than are actually used in selection, therefore if comparing competing fields the same compiled index may be used but different training (and hence different weights files) is all that need change.
For full details of distance functions see section Distance Functions.
Here we will only deal with a limited form of customization. There
are two major classes of distance functions: continuous (float or
int) and categorical. These are trained differently. New continuous
distance functions should be listed in the variable
DBNAME_SharedDFs
, while categorical distance functions are
listed in DBNAME_TableDFs
. These lists are expanded
automatically into full distance function definitions during
training.
A continuous listing consist of 5 fields
distance name
position offset
field name
DBNAME_extrafields
).
mapping
ident
, i.e. no mapping, or log
, for
logarithm.
difference measure
eql
returns 0 if the two
values are equal, and 1 otherwise (this is only reasonable for int
valued fields). abs
means the absolute different between the
two values and sqr
means the squared difference.
A categorical listing also consists of 5 fields
distance name
position offset
field name
DBNAME_extrafields
).
mapping
ident
, but if some further quantization is desired it may be
achieved by defining a new Discrete and Map. See section Discretes and Maps.
size
ident
, this is number of members in the field declaration. If
the mapping is something other than ident
, this is the number
of items in the category being mapped to.
Note that categorical distance functions will be trained in phone groups. This will rarely be wrong but sometimes have more differentiation than is necessary.
In our example we have two new fields we wish to train. Both are
category fields so we have to add their descriptions to the variable
DBNAME_TableDFS
. The additions would look like
(tobi_accent 0 tobi_accent ident 8) (tobi_tone 0 tobi_tone ident 7)
That is (for the first line) the new distance function is called
tobi_accent
, it is to apply to the current phone (0), using
the field name tobi_accent
, with no mapping (ident) and has 8
members.
If we wished to have a distance function on not only the current phone but also on the context, we could add distance functions that include the tobi accents of the left and right context
(p_tobi_accent -1 tobi_accent ident 8) (n_tobi_accent 1 tobi_accent ident 8)
Thus then the distance name and the offset field changes, but the field name of course remains the same.
Once the new distance measures have been defined, you can train the new weights. The standard database script can be used
db_utils/make_training
The distance measures calculated in `dist/' by default are reused if they exist, so keeping them is useful when training on different fields, as it makes retraining much faster. Note if you change the phoneme set you must delete the old distance files and re-create them.
After training you should check the training log file in `index/DBNAME_train.log' to see the contribution of the new fields you have introduced.
Once you decide that a new field is worth predicting, you will need
to modify CHATR to actually predict it. All target fields are
generated using functions in the file `src/udb/udb_targfuncs.c'.
These functions return a Lisp cell (to generically deal with the
appropriate type, float, int or categorical) from a segment stream
cell. An entry should be added to the table
df_targ_val_name2func
relating the new fieldname to a
function. The function may simply access a field in the segment
stream cell (or one related to it), or do some calculation. It may
be that a feature function already exists to generate the appropriate
value and hence may simply be called with an easy wrap around
function (cf. udb_tf_sylpos
.
The object of training is to find the weighting that minimizes the
distance of the selected units from the original. We do not yet know
the ideal distance measure. The ideal measure would be a signal
processing measure that would directly follow humans' perceptions of
good and bad synthesis. However, approximations of this measure are
possible and CHATR supports a mechanism for choosing what
measure to use. The distance method used is defined through the
variable cep_dist_parms
(as it will most likely involve some
form of cepstrum parameters).
The assumption is that a set of parameters is defined for each frame
(at some increment) in the database. This is specified through the
CoefFileSkeleton
setting of the Database
command. The
format of these files may vary but HTK headered and ATR improved
cepstrum files are currently supported. Remember to add new formats
to `file/cep_io.c'.
The distance measures themselves are defined in
`chatr/cep_dist.c'. Currently supported are Euclidean and
weighted Euclidean. Two alignment options are also provided:
naive
which does no time alignment between the selected units'
cepstrum vectors and the original (just taking the shortest), and
tw
which linear interpolates the selected units' cepstrum
vectors to the original.(8)
Go to the first, previous, next, last section, table of contents.