Go to the first, previous, next, last section, table of contents.
This appendix contains a list of variables used by CHATR,
with a description of values and functions. Note that in many
cases these variables are unset, which means usually they have
no effect. Obviously if set they will affect CHATR's
operation.
cep_dist_params
-
Contains parameters for the cepstrum distance functions used in unit
selection weight training (and possibly selection too). If set, this
variable should contain an a-list of parameter names and values. The
possible parameter names are
v_start (0)
-
Which parameter to start from within a vector.
v_end (16)
-
While parameter to end at from within a vector.
cep_no_db (0)
-
If 1, the following filename given to Compare_Cepstrums is a full
pathname.
filetype (NUUTALK)
-
The file type of the cepstrum files (may also be HTK).
align_type (naive)
-
What time aligning should be doing to two cepstrum string vectors.
naive
-
No time alignment, match to shorter one.
tw
-
Interpolate selected to target.
dtw
-
Dynamic time warping (not yet implemented).
frame_sds
-
Standard deviations for each parameter in a frame. Used in weighted
Euclidean distance.
frame_weights
-
Weight for each parameter in a frame. Used in weighted Euclidean
distance.
dist_type (euclidean)
-
Distance metric used for frame comparisons.
euclidean
-
Simple Euclidean distance (squared error).
weighted_euclidean
-
Difference is divided by sd squared and multiplied by weight. HACK
param 0 is assumed to be f0 where 0 means unvoiced.
mahalanobis
-
not yet implemented.
chatr_confirm_exit
-
If value is set and non-nil, CHATR will prompt for confirmation
before exiting.
chatr_hush_startup
-
If value is set and non-nil, CHATR will not display the startup
copyright message.
chatr_max_clients
-
If set to an integer, limits the number of clients to that number.
If unset (or nil), no limit is specified.
chatr_secure_functions
-
A list of function names that may be called at top level while in
server mode. If set to non-nil value, only those functions named in
this list may be called by a client program. This is for security,
as although CHATR should be run as user nobody in server mode, it should
only be used for synthesis--not for devious things. Note that `set'
cannot be included in the list, as then you could simply change it.
Neither should defvar, define, system or Audio be included. Some would
say even this is not secure enough.
chatr_server_portnum
-
If value is set and an integer it is used as the port, CHATR listens
in server mode. Note this has to be set either in init.ch, the
users .chatrrc file or a file loaded on the command line, as the number
is used before anything is read interactively.
dumbplus_params
-
If a unit selection synthesizer is being used and the concat method
is set to DUMB+, this variable defines various parameters for
the DUMB+ module. The value should be an a-list of parameter
names and values. Currently supported parameters are (defaults
shown in parentheses)
strategy (mds)
-
Defines join point strategy.
z_crossings
-
Join at zero crossings (ignoring direction).
mds
-
Minimal distance splicing. Look at a short window of samples to find
closest fit. (This is quite good.)
dumb
-
Just butt them, *no* modification.
breaks
-
Add break_size in mS (default 100) between each unit.
mds_search_window_ms (4.0)
-
Number of mS of unit to look for mds join point.
mds_diff_wind_samp (7)
-
Number of sample points to test for minimal distance.
pm_align (OFF)
-
If ON will first align unit edges to pitch marks before join strategy
is applied.
break_size (100)
-
Size of pause in mS between units when breaks strategy is used.
dur_lr_model
-
When Duration_Method is set to LR_DUR, this variable should hold
a pair, a list of features and a decision tree whose leaves are a
linear regression model. The number of weights in the model should
be equal to the number of characters when the values of all the
features are concatenated. They of course must all be digits.
etc-dir
-
A directory containing CHATR specified executables. This is initialized
to the <chatr_lib>/.../etc, though for certain implementations may
need to be changed.
*features*
-
Contains a list of atoms identifying features about this particular
installation (cf. Common Lisp's *features* variable). This variable
can be used to check the availability of various features, e.g.
NIST SPhere support, (direct) DAT-Link support, CSTR diphones etc.
feature_maps
-
A rather crude way that some systems can map category features to
binary ones. This consists of a list of names and a list of members.
If a feature map is applied to a feature value, and that value is a
member of the specified set, then the feature value becomes 1, otherwise
0. This is used for F0 prediction in the lr models.
An example is
(set feature_maps
'((tobi_accent_0 H*)
(tobi_accent_1 !H*)
(tobi_accent_2 L*)
(tobi_accent_3 L+H* L+!H* H+!H* L*+!H L*+H)
(tobi_accent_4 *? * X*?)))
f0_no_jitter
-
If set to non-nil, no jitter will be added to generated F0s. The
method used to generate jitter is random but not quite in the
right way. Most speakers set this to 't, so no jitter is generated.
HLP_Pattern
-
HLP_Patterns (and HLP_Rules) must be set before HLP functions will
work. HLP functions are used to predict intonation parameters from
discourse labeled input. HLP_Patterns add features (based on IFT
feature values) on categories dominated by (CAT S) labeled nodes.
HLP_Patterns consist of a list of pattern rules. Refer to the User
Guide for more details.
HLP_prosodic_strategy
-
The value of this variable defines the strategy to be used to predict
intonation event positions. This is used for discourse input, including
tts. If unset the strategy defaults to Hirschberg. Possible values are
Hirschberg
-
Predict accent position based on a heuristic based algorithm.
Monaghan
-
A phrase based algorithm (not as fully implemented as Hirschberg).
DiscTree
-
Use a decision tree to predict accent position (0.7 old version of
trees).
None
-
Do not do any prediction. Either features already exist in the input
that can be used to realise accents in input, or you just don't want
any predicted.
HLP_phr_disc_tree
-
A decision tree for predicting phrase boundaries when
HLP_phrase_strategy is set to DiscTree. It should return values
0, 1, 2, 3 or 4, for a given word. Note that at this point in the
synthesis process only some features may work. No syllable, phoneme, or
intonation prediction has taken place. This process must happen before
the following can succeed.
HLP_phrase_strategy
-
The value of this determines which phrase prediction algorithm
should be used in HLP processing and TTS. Possible values are
Bachenko_Fitzpatrick
-
Use the Bachenko and Fitzpatrick algorithm.
DiscTree
-
Use a decision tree to predict boundaries. See hlp in HLP_phr_disc_tree
for the tree.
None
-
Don't predict anything. (Though input in HLP mode may already contain
explicit phrase marking.)
HLP_realise_strategy
-
Defines method used to realise predict prominence as accents.
If value is Simple_Rules, uses HLP_Rules (and HLP_Patterns) to
realise prominence and phrasing into the specified accents.
If unset, still uses this method but ToBI and JTobi ignore it
and re-predict accents based on their own methods. Setting this
to Simple_Rules and using PhonoWord input allows you to completely
by-pass the ToBI (and JToBI) accent and boundary prediction methods,
thus allowing hand control of them.
HLP_Rules
-
HLP_rules (and HLP_Pattern) must be set before HLP functions will
work. HLP functions are used to predict intonation parameters from
discourse labeled input. HLP_Rules are the first stage in adding
default features to existing feature categories in the Sphrase
structure. The value of HLP_Rules should be a list of defaults.
Defaults are of the form
( FLIST1 -> FLIST2 )
where FLIST1 and FLIST2 are feature lists. If a category contains
all the features in FLIST1, the features in FLIST2 are added if they
do not already exist in the category. Later extensions should allow
variables and conditions in these patterns.
jlts_no_unvoiced
-
For Japanese text to speech. If set to true, U and I in romaji input
are converted to their voiced versions u and i. If unset or nil,
the unvoiced phones are left as is. (This implies udb synth method or
whatever can deal with them.)
KDD_full_kan2rom
-
If set to non-nil, the romaji input for Japanese synthesis is assumed
to come from the KDD conversion program. The consequence is that
in the KDD case, numerals are treated as break levels rather than
numbers.
load-path
-
A list of directories that are searched for files when using
load_library (and some other standard functions). This follows the
usage of load-path in Emacs. It is by default initialized to the
standard library directory as defined when CHATR is installed.
lexicon_syllabify
-
If non-nil during lexicon compilation, the entries are assumed to be
unsyllabified, and vowels terminated by 0 1 or 2 denoting stress.
CHATR will automatically syllabify and extract the stressing.
This is to deal with the format in which the BEEP and CMU lexicons are
distributed.
mb_params
-
Parameters for Beckman and Pierrehumbert Japanese Intonation module.
Users can set values through this variable to affect the operation.
The value should be an a-list of parameter names and values. Currently
supported parameter names are (defaults shown in parentheses)
phrase_top (180Hz)
-
refval (90Hz)
-
hamwin_size (240ms)
-
Length of smoothing window.
PhrHProm (0.8)
-
Default prominence for H-.
WeakLParam (0.85)
-
Prominence of weak L% relative to strong L%.
UPHRASELProm (1.0)
-
Default prominence of an utt-final L%.
DPHRASELProm (1.0)
-
Default prominence of an absolute utt-final L%.
IphrLProm (0.9)
-
default prominence of a medial Iphrase boundary L%
AccPLProm (0.8)
-
Default prominence of a mere acc phrase boundary L%.
KernLProm (0.7)
-
Relative prominence of L in H*+L accent.
declinAmount (0.01)
-
Declation amount over utterance.
finLowAmount (0.1)
-
Final lowering constant stated as ratio to reduce by (i.e. default means
reduce to 90% otherwise expect by end).
PhraseDownstep (0.8)
-
Down step amount between AccentP.
target_method (rule)
-
Defined if rules or linear regression (lr) are used to predict F0
target values.
target_f0mean
-
If target_method is lr, this value is used to map the model pitch
range onto the target speakers pitch range. This should be the mean F0
for all vowels in a significant example of speech (or the whole
database if possible).
target_f0std
-
The standard deviation of speakers f0 pitch range taken from all
vowels. This is used to mapped the lr f0 model pitch range to a
particular speakers range.
An example would be
(set mb_params
'((phrase_top 355)
(refval 185)
(hamwin_size 300)))
Parameters not given a value will be set to their default.
nn_params
-
Parameters for neural network training. If set, this should be an
a-list of parameter names and values. Current parameters are
n_hidden N
-
Number of hidden units (default 10).
check_pt N
-
Number of iterations between check points.
check_pt_func func
-
Lisp function to be run (no arguments) at check point.
check_pt_actions LIST
-
What to do when a check point occurs. There are three possible
actions, all or none may be selected
save
-
Save the current net in the output file.
error
-
Display the mean error at this point.
list
-
Display one cycle of input and output vectors.
start_net NNet
-
Lisp description of a net. This is used as a starting point. It also
allows training to start with a partially trained net. Example use is
(set nn_params '((n_hidden 5) (check_pt 1000)
(check_pt_action save error)
(i_type binary))).
nnd_nets
-
Neural net descriptions and types for predicting syllable and phoneme
durations. This should be a list of four items
SYL_ITYPE
-
List of features (for syls) defining input to SYL_NET.
SYL_NET
-
Neural network, as saved by NN_Train, for syllable durations.
PH_ITYPE
-
List of features (for segs) defining input to PH_NET.
PH_NET
-
Neural network for phoneme durations, as saved by NN_Train.
nnd_params
-
Parameters for neural network duration module. If set, this should be
an a-list of parameter names and values. Current parameters are
syl_stretch N
-
Number (float) multiplied to predicted syllable duration for globally
changing durations.
phoneonly 0
-
If 1, the syllable net is ignored and it is assumed the phone net alone
can do the work.
no_smooth
-
If set to non-nil, an intonation system that uses smoothing (ToBI and
JToBI) will not smooth the target values.
NT_cep_gc_strategy
-
In original NUUTALK, this variable controls the garbage collection
strategy for unit cepstrum files. This is not very useful when
acoustic costs are done in a vq table. When cepstrum distance
measurements are used, many cepstrum files are read. This
variable specifies the size of cache for keeping cepstrum files.
It many be set to a number or NONE (the default), if no caching
is required. A typical value is 500.
NT_cost_type
-
In original NUUTALK unit selection, this variable determines the
acoustic cost function used in unit selection. If set to
cep_dist, it uses a Euclidean distance. If set to vq_dist it
uses vector quantization matching. Vector quantization is much
faster than cepstrum distance, but the current database must have
vq information for this to work.
nuu_female_f0
-
If set to non-nil value, causes NUUTALK f0 values to be increased
by a factor of 1.7. This is a quick solution to generating Japanese
female intonation.
nus_params
-
In UDB unit synthesis in NUS mode, value of this variable (an a-list)
specifies the weighting for the cost function for scoring unit
selections. The possible values are
exclude_list <list of file ids>
-
Units from these files are excluded from the selection process.
beam_width <num>
-
Number of candidates to carry forward at each segment.
cand_width <num>
-
Number of new candidates to consider at each segment.
context_wt <num>
-
Weighting for segmental context.
join_wt <num>
-
Weighting for acoustic join in unit score function.
pros_wt <num>
-
Weight for overall prosody (power pitch and duration).
power_wt <num>
-
Weighting for power.
dur_wt <num>
-
Weighting for segmental duration.
pitch_wt <num>
-
Weighting for F0 pitch.
zdist_fact 1.0
-
When using zscores, the targets are multiplied by this factor to reduce
the extreme cases and hopefully result in more average units. A value
of 0.0 causes selection of mean pitch power and duration. Larger
factors allow the actual targets to influence the selection.
All of the above have defaults if left unspecified. The overall cost
function is
(((dur + pitch + power)/3 + context)/2 + join)
Thus if a weighting is set to 0.0, that feature is ignored. If
increased with respect to other weights, the feature will count for
more.
nus_phones
-
This variable is used when compiling unit database indexes (using
the command Database Units). The value should be a list of entries,
one for each phoneme in the database phoneme set. The entries
contain phone-name mean and standard deviations for duration, pitch,
voicing and power. Each entry should consists of the following fields
phone_name mean_dur dur_sd mean_pitch pitch_sd
mean_voice voice_sd mean_power power_sd
The entries are optional but are necessary for any database to use
the Generic selection strategy (which means they are pretty mandatory).
pause_prediction_method
-
Determines the method used for pause prediction. Possible values are
by_phrase_break
-
Pauses will be inserted after phrase breaks if the pause size for the
level set by Stats Pause is non-zero.
disctree
-
Use the decision tree in variable pause_prediction_tree.
pause_prediction_tree
-
A decision tree used when pause_prediction_method is set to disctree.
This tree predicts existence of a pause or not for words. An example
is in lib/data/tobi.ch.
power_modify
-
If set to a non-nil, value will cause power modification of select units
when synth method is UDB. Note this will only modify power target
units that have values other that 0.0. If no power prediction module is
used in synthesis, it will only be useful when values are provided by
other means, e.g. natural units.
ps_params
-
This sets up parameters for the PS_PSOLA module used for unit
concatenation. The value should be an a-list of parameter names and
values. Currently supported parameters are (defaults shown in
parentheses)
x_pitch (1.0)
-
Global pitch modification factor.
x_duration (1.0)
-
Global duration modification factor.
modify_power (no)|yes
-
Modify power through a segment (poor).
pitch_min_delta (0.0)
-
If pitch change is less than this ratio, make no pitch modification.
pitch_max_delta (1.0)
-
If pitch change is greater than this only make this amount of change
(1.0 == 100%).
pitch_delta (1.0)
-
Amount to change pitch by between target and selection. 1.0 means full
change, 0.0 mean no change, 0.5 means move 50% towards target value.
dur_min_delta (0.0)
-
If pitch change is less than this ratio make no pitch modification.
dur_max_delta (1.0)
-
If pitch change is greater than this only make this amount of change
(1.0 == 100%).
dur_delta (1.0)
-
Amount to change pitch by between target and selection. 1.0 means full
change, 0.0 mean no change, 0.5 means move 50% towards target value.
percent_win_dim (2.0)
-
Size of Hanning window in pitch periods.
reduce_tree
-
A decision tree for syllables to predict whether the vowel should be
reduced.
schwas
-
A mapping of full vowel to reduced form. This should consist of an
a-list indexed by phone set name to a-lists of full vowel to reduced
form. An example is in lib/data/reduce.ch.
syn_params
-
This sets up parameters for the waveform synthesis process. The value
should be an a-list of parameter names and values. Currently supported
parameters are (defaults shown in parentheses)
phrase_by_phrase (NIL)
-
Any non-nil value causes synthesis of an utterance to be chunk by
chunk. A chunk is defined as being a string of segments terminated by
a silence (or end of utterance). If this variable is nil (or unset),
the other parameters have no effect.
whole_wave (t)
-
Means the whole wave, made by concatenating the chunks (separated by
silences), is returned. In tts mode it is useful to set this to nil and
use synth_hook to say each individual chunk.
silence_method zeros
-
Generate the silence in a wave of zeros.
natural
-
Let the synthesizer method do it (i.e. in udb mode let them be selected
from the db).
delete
-
Don't create them at all.
noise
-
generate silence with some noise (not implemented).
hardware_silence 0
-
Length in milliseconds of the time it takes (roughly) for the audio
output system hardware to start playing a waveform after its been given
to it. DAT-Links, for example, add a delay of 750 mS. (This parameter
should affect the splitting and generating of silence but hasn't been
implemented yet.)
These parameters were designed to solve two problems, latency in
start-up time for the first sentence to be synthesized in TTS, and bad
distribution of silences in most of our databases.
synth_hook
-
If value is a function or list of functions, run these function(s)
on an utterance immediately after a waveform has been generated (i.e.
in the C function synthesis()). These function(s) should take
an utterance as its only argument. For example, if you wished to
normalize the gain and add an echo to all synthesized utterances
(define echo (utt) (Filter_Wave utt 'Delay))
(set synth_hook (list Regain echo))
ToBI_accent_tree
-
When ToBI intonation method is selected, this must contain a decision
tree for syllables to predict accents.
ToBI_boundary_tone_tree
-
When ToBI intonation method is selected, this must contain a decision
tree for syllables to boundary tones (and pitch accents).
tobi_lrf0_model
-
When ToBI intonation method is selected and target method is set (in
tobi_params), this should contain a four item list. The first three
items are linear regression models for predicting F0 at the start,
middle, and end points of syllables. Each model consists of a list of
elements. An element consists of a feature name, a weight, and
optionally a feature map name (see feature_maps). The fourth item in
the list is parameters. There are only two possible parameters,
model_f0mean and model_f0std. These should contain the overall mean
and standard deviation of the speaker from which this model was built.
This is to allow mapping to other speakers pitch ranges.
ToBI_params
-
When ToBI intonation method is selected, these parameters affect
various aspects of the F0 generation process. If set, this should be
an a-list of parameter names and values. Current parameters are
pitch_accents
-
A list of pitch accents valid for this version (these must be specified
even they are currently only used for validation).
phrase_accents
-
A list of phrase accents (H- L-).
boundary_tones
-
A list of boundary tones (actually phrase+boundary) H-H% L-H% L-L% H-L%.
target_method
-
If set to lr, linear regression is used to generate the F0 contour. If
set to apl, uses the multi-factor method (Andreson, Pierrehumbert and
Liberman).
target_f0mean
-
If target_method is lr, this value is used to map the model pitch range
onto the target speakers pitch range. This should be the mean F0 for
all vowels in a significant example of speech (or the whole database if
possible).
target_f0std
-
The standard deviation of speakers f0 pitch range taken from all
vowels. This is used to map the lr f0 model pitch range to a
particular speakers range.
The following are only used if target_method is unset, or set to apl
topval
-
Step (in Hz) above the reference line for use as reference for
calculating targets points (i.e. H* etc).
baseval
-
Step (in Hz) below the reference line for use as reference for
calculating targets points (i.e. L* etc).
refval
-
Start point (in Hz) for reference line.
h1
-
Factor of topval for uprise before H*.
l1
-
Factor of baseval for downstep before L*.
prom1
-
Factor (times top/baseval) for magnitude of H*/L* (above/below ref.
line). Also for endpoint in H-H%/L-L% boundary tones.
prom2
-
Factor (time top/baseval) for magnitude of H-/L- (above/below ref.
line), when immediately followed by opposite boundary tone (L%/H%).
prom3
-
Factor (time top/baseval) for magnitude of H-/L- (above/below ref.
line), when not immediately followed by boundary tone.
HiF0_factor
-
Factor to increase H*'s when marked with HiF0 (default is 1.3).
decline
-
Declination as drop factor per millisecond, i.e 0.01 means drop to 99%
every millisecond. (This only applies if decline_range has a value
other than 0.0.)
decline_range
-
Number of Hz ref. line should drop over a phrase (i.e phrase ending by
phrase accent or boundary tone).
This is used for all target_methods
hamwin_size
-
Size of Hamming Window used in smoothing the F0 made from the target
points.
The lr method of target generation is much easier to deal with than the
explicit setting of values. A typical example is
(set mb_params
'((target_method lr)
(target_f0mean 113)
(target_f0std 31)
(hamwin_size 240)))
These values can also be created automatically at database build time.
udb_all_means
-
If set to a non-nil value, segments are set with mean pitch power and
duration for that particular phoneme before unit selection.
udb_nus_phones
-
If set, this is included in UDB databases at compile time. It contains
a list of phonemes with mean and standard deviation values for
duration, pitch, power and voicing for each one. This table allows
Z score references in a database. There are two formats, one old
which you should not use. The new format is simply a list of entries,
one for each phoneme. Each entry consists of 9 fields as follow
phone_name duration_mean duration_sd
pitch_mean pitch_sd voicing_mean
voicing_sd power_mean power_sd
The phone_name should be a phoneme in the phoneme set of the database
to be compiled. The other values should be floating point type. An
example entry is (there should be entries one for each phone)
(a 89.612 29.045 108.090 41.588 0.876 0.211 1369.200 903.460)
udb_prune_units
-
List of unit numbers to be pruned from a database. Used at udb index
compile time. It should be an a-list of phones plus list of unit
(numbers) to be removed from an index. Note the entries themselves
will not be removed (as they are part of other units' contexts), but
will never be selected.
utt_hook
-
If value is a function or list of functions, run these function(s) on
an utterance as it is generated (via the Utterance function). For
example, if you wish all utterances to be synthesized and said
automatically without explicit calls, then
(set utt_hook (list Synth Say))
Go to the first, previous, next, last section, table of contents.