The purpose of a lexicon is to translate `words' (arbitrary atomic tokens) into syllables with pronunciation and stress. The lexicon system within CHATR, like many of the other parts of the system, is designed to be very powerful but currently only offers minimal functionality. The system is designed such that we can later replace this module with a more sophisticated one. This also allows users to incorporate a lexicon of their own choice if they so wish.
There are currently four lexicons utilized within CHATR. Switching between lexicons is easy (even within a session if required), thereby affecting language or dialect. Definitions in the file `lib/data/lexicons.ch' list which lexicons are selected for each speaker.
The lexicon names are
mrpa
beep
cmu
japanese
A lexicon contains a compiled set of entries and/or a (typically) small set of addenda items. There is also a flag to define what should be done if a word cannot be found in the lexicon--either fail or apply some form of letter-to-sound rules. Although a lexicon has a specific phoneme set, mapping may be performed between a lexicon's phoneme set and the currently selected CHATR internal phoneme set, if such a map is defined.
The basic form of a lexical entry is
(word (syllable~1... syllable~n) [ features ])
where those elements are defined as
word
syllable~1 - ~n
phoneme~1 - phoneme~n
(1 | 0)
features
feature-pair
feature-name
feature-value
A typical example is
(beautiful (((b y uu) (1)) ((t i ) (0)) ((f u l) (0))))
The `feature-pair' part of a lexical entry allows the specification of homographs (words with different pronunciation but same spelling). For example, consider the phonetic difference in the word `lives' between the sentences `Cats have nine lives' and `He lives in Japan'. In the lexicon the word is represented in two variants thus
(lives (((l ai v z) (1))) ((CAT N) (PLU +))) (lives (((l i v z) (1))) ((CAT V))))
Of course such entries could equally be distinguished with different citation forms, for example
(lives-n (((l ai v z) (1)))) (lives-v (((l i v z) (1))))
Currently there is no morphological analysis, which means all words and their inflections (and derivations) need to be explicitly included in the lexicon. This is tedious and is an area for future improvement.
A given set of entries may be used in one of two different ways, compiled or directly. For large lists of entries, compiling is highly recommended, as access will be significantly faster than by the direct specification method. Although accessing takes time, the loading of a full lexicon (tens of thousands of entries) is by far the bigger cost.
The function Lexicon Compile
takes two filenames as
arguments.(6) The first file should contain one s-expression. This
should be a function call to the function Lexicon Add
. It
should be of the form
(Lexicon Add [phoneme set] entry~1 entry~2... entry~n)
The function Lexicon Compile
takes a file containing the
single lexicon s-expression and generates a file suitable as an
argument to the function Lexicon Use
. Basically it checks the
format of the entries and sorts them, ensuring a binary search will
be possible. The exact format of the compiled form may change, so
you should not depend on the actual output.
If the Lisp variable lexicon_syllabify
is set, the entries can
be in a different format and CHATR will attempt to syllabify
them automatically. It will not be perfect, but does offer a way to
automatically deal with large imported lexicons where we have little
control over the input form. The format required for the phonemes is
not as bracketed syllables (the format in which CMU and BEEP lexicons
are distributed), but simply as a list of phonemes. If the digits 1
and 2 are appended to vowels, they are removed and the syllable they
are contained within marked as stressed. As an example, an input
entry like this
("abductive" (ae b d ah1 k t ih v))
would automatically be converted to
("abductive" (((ae b) (0)) ((d ah k) (1)) ((t ih v) (0))))
A distinct lexicon may be created using the command
(Lexicon Select name)
If the named lexicon already exists it is selected; if it does not, a new (empty) one is created.
Four items are required in a complete lexicon: a phoneme set, an addenda, a compiled lexicon, and an instruction of what to do if a word is not found in the lexicon.
After creation and compilation, the lexicon is accessed using the following four commands
(Lexicon Phone_Set phoneset-name)
(Lexicon Use file-name)
Lexicon Compile
command.
(Lexicon Add phoneset-name entries)
(Lexicon Fail fail_action)
Error
LTS
JLTS
It is possible to directly access a lexicon without creating an utterance. The command is
(Lexicon Lookup word)
If the currently selected speaker uses the `mrpa' phoneset, the system will respond with
(word (((w @@ d) (1))))
Of course substituting `word' in the above example for something else will cause details on that to be returned.
It may be that a particular phonetic rendering of a word doesn't suit an application. Features may need to change to represent a dialect or speech manner. This may be achieved using the command
(Lexicon Add [phoneset-name] entry~1 entry~2... entry~n)
where the terms are
phoneset-name
entry~1 - entry~n
Referring to the previous example, some might prefer a stronger sounding of the `r' in `word'. Such a new entry would be
(Lexicon Add mrpa (word (((w @@ r d) (1)))))
Note modifications must be performed in the phonetic codings utilized
by the phone-set used by the currently selected speaker. For
instance, entry of `@@'
(used by `mrpa' but not by `BEEP')
when a `BEEP'-coded speaker is selected will result in an error.
See section Phoneme Set Definitions, for information on obtaining phoneme
lists.
Modifications stay in effect until either changed again, a different speaker is selected, or the current session of CHATR is quit.
Go to the first, previous, next, last section, table of contents.