Go to the first, previous, next, last section, table of contents.

Lexicon

The purpose of a lexicon is to translate `words' (arbitrary atomic tokens) into syllables with pronunciation and stress. The lexicon system within CHATR, like many of the other parts of the system, is designed to be very powerful but currently only offers minimal functionality. The system is designed such that we can later replace this module with a more sophisticated one. This also allows users to incorporate a lexicon of their own choice if they so wish.

Current CHATR Lexicons

There are currently four lexicons utilized within CHATR. Switching between lexicons is easy (even within a session if required), thereby affecting language or dialect. Definitions in the file `lib/data/lexicons.ch' list which lexicons are selected for each speaker.

The lexicon names are

mrpa: The CSTR created lexicon of around 23000 entries.
beep: The Cambridge University BEEP lexicon consisting of around 163000 entries.
cmu: The CMU darpa lexicon (converted to the radio2 phoneme set) consisting of around 99000 entries.
japanese: A lexicon with no entries but a letter-to-sound function to convert from romaji to the nuuph phoneme set.

Lexicon Entries

A lexicon contains a compiled set of entries and/or a (typically) small set of addenda items. There is also a flag to define what should be done if a word cannot be found in the lexicon--either fail or apply some form of letter-to-sound rules. Although a lexicon has a specific phoneme set, mapping may be performed between a lexicon's phoneme set and the currently selected CHATR internal phoneme set, if such a map is defined.

The basic form of a lexical entry is

     (word (syllable~1... syllable~n) [ features ])

where those elements are defined as

word: An atom - the word to be defined.
syllable~1 - ~n: ((phoneme~1... phoneme~n) (1 | 0))
phoneme~1 - phoneme~n: A series of atoms describing the phonetic sound of that portion of the word.
(1 | 0): An atom indicating if the preceding phoneme should be stressed (1) or not (0).
features: (feature-pair...). Word category information to facilitate discernment of homographs.
feature-pair: (feature-name feature-value)
feature-name: An atom naming the feature to be defined, such as the grammatical category or numeric value of the word.
feature-value: An atom defining the value or status of the named category, such as `N' for `Noun' or `+' for `affirmative'.

A typical example is

     (beautiful (((b y uu) (1)) 
                 ((t i ) (0)) 
                 ((f u l) (0))))

The `feature-pair' part of a lexical entry allows the specification of homographs (words with different pronunciation but same spelling). For example, consider the phonetic difference in the word `lives' between the sentences `Cats have nine lives' and `He lives in Japan'. In the lexicon the word is represented in two variants thus

     (lives (((l ai v z) (1))) ((CAT N) (PLU +)))
     (lives (((l i v z) (1))) ((CAT V))))

Of course such entries could equally be distinguished with different citation forms, for example

     (lives-n (((l ai v z) (1))))
     (lives-v (((l i v z) (1))))

Currently there is no morphological analysis, which means all words and their inflections (and derivations) need to be explicitly included in the lexicon. This is tedious and is an area for future improvement.

Lexicon Compilation

A given set of entries may be used in one of two different ways, compiled or directly. For large lists of entries, compiling is highly recommended, as access will be significantly faster than by the direct specification method. Although accessing takes time, the loading of a full lexicon (tens of thousands of entries) is by far the bigger cost.

The function Lexicon Compile takes two filenames as arguments.(6) The first file should contain one s-expression. This should be a function call to the function Lexicon Add. It should be of the form

     (Lexicon Add [phoneme set] entry~1 entry~2... entry~n)

The function Lexicon Compile takes a file containing the single lexicon s-expression and generates a file suitable as an argument to the function Lexicon Use. Basically it checks the format of the entries and sorts them, ensuring a binary search will be possible. The exact format of the compiled form may change, so you should not depend on the actual output.

If the Lisp variable lexicon_syllabify is set, the entries can be in a different format and CHATR will attempt to syllabify them automatically. It will not be perfect, but does offer a way to automatically deal with large imported lexicons where we have little control over the input form. The format required for the phonemes is not as bracketed syllables (the format in which CMU and BEEP lexicons are distributed), but simply as a list of phonemes. If the digits 1 and 2 are appended to vowels, they are removed and the syllable they are contained within marked as stressed. As an example, an input entry like this

     ("abductive" (ae b d ah1 k t ih v))

would automatically be converted to

     ("abductive" (((ae b) (0)) ((d ah k) (1)) ((t ih v) (0))))

Creating a Lexicon

A distinct lexicon may be created using the command

     (Lexicon Select name)

If the named lexicon already exists it is selected; if it does not, a new (empty) one is created.

Four items are required in a complete lexicon: a phoneme set, an addenda, a compiled lexicon, and an instruction of what to do if a word is not found in the lexicon.

After creation and compilation, the lexicon is accessed using the following four commands

(Lexicon Phone_Set phoneset-name)

Define the phoneme set for the lexicon.

(Lexicon Use file-name)

Optional. Identify a file compiled using the Lexicon Compile command.

(Lexicon Add phoneset-name entries)

Optional. Add a word to the lexicon. The phoneset-name need not actually be the same as the current lexicon phone-set name.

(Lexicon Fail fail_action)

This identifies what will happen if a given word is not found in the lexicon. Possible actions are

Error: Signal an error (the default).
LTS: Use letter-to-sound rules to provide a pronunciation. The rules used by CHATR are those developed by the US Naval Research Laboratory, Washington DC.
JLTS: Use Japanese letter to sound rules. This assumes the word is in romaji.

Accessing a Lexicon

Lexicon Interrogation

It is possible to directly access a lexicon without creating an utterance. The command is

     (Lexicon Lookup word)

If the currently selected speaker uses the `mrpa' phoneset, the system will respond with

     (word (((w @@ d) (1))))

Of course substituting `word' in the above example for something else will cause details on that to be returned.

Lexicon Modification

It may be that a particular phonetic rendering of a word doesn't suit an application. Features may need to change to represent a dialect or speech manner. This may be achieved using the command

     (Lexicon Add [phoneset-name] entry~1 entry~2... entry~n)

where the terms are

phoneset-name: Optional. Uses the phoneme set of the currently selected speaker.
entry~1 - entry~n: List of phonemes and stress levels.

Referring to the previous example, some might prefer a stronger sounding of the `r' in `word'. Such a new entry would be

     (Lexicon Add mrpa (word (((w @@ r d) (1)))))

Note modifications must be performed in the phonetic codings utilized by the phone-set used by the currently selected speaker. For instance, entry of `@@' (used by `mrpa' but not by `BEEP') when a `BEEP'-coded speaker is selected will result in an error. See section Phoneme Set Definitions, for information on obtaining phoneme lists.

Modifications stay in effect until either changed again, a different speaker is selected, or the current session of CHATR is quit.

Go to the first, previous, next, last section, table of contents.