Go to the first, previous, next, last section, table of contents.


System Architecture

The CHATR system tries to be many things. It tries to be modular, portable, efficient and even produce good synthesis. These goals are not always complementary. However it is strongly felt by the authors that in order to ultimately improve on the classic `pipeline' TTS structure, it is necessary to introduce a new more flexible architecture.

Design Philosophy

CHATR can best be thought of as a general system in which modules act on utterances. Each utterance will have a complex internal structure. A module may access (read or write) any part of an utterance though typically they read many parts but write only one type of information. A typical module may be a waveform synthesizer, something that takes a description of phonemes, durations and fundamental frequency, and generates a waveform. Or it may be something that takes words and looks them up in a lexicon returning their pronunciation. Each module may have its own internal structure if necessary, however, it's communication with the rest of the system is via an utterance object.

There are many references in this manual to modules and functions, so the meaning of these titles as used in CHATR will now be defined.

A function is a piece of code which performs a small low-level task, such as byte-swapping, file-to-file transfer of phonemes or words (stream building), or removing unneeded phrase parts etc. Functions are called by modules. Functions can (and often do) call other functions.

A module is a collection of functions, each of which may be called by many other modules. It is a stand-alone unit which takes an input, performs a major task and supplies a complete output, ready for use by the user. Modules are called in the form of functions. A module does not call another module.

Since modules are defined in the form of Lisp functions, full control of flow may be specified within a Lisp function to synthesize. Similarly, testing of some functions may be controlled directly by a user without the need to recompile the system.

By default the Synth command decides which modules to be called based on the utterance type. See `src/chatr/chatr.c:chatr()' for the actual mapping.

There are various levels in which a system like CHATR can be used. At one extreme CHATR can simply be used as a black box that generates speech. Thus it can be used as a speech synthesizer for a general natural language processing system. At the other extreme, a user can add and change modules in the system adding new features to the synthesizer. Other levels exist, for instance, redefining the HLP rules or adding a new speech database is possible without recompiling the system. CHATR is designed to be both a speech synthesizer and a tool for researching speech synthesis.

In order to offer a uniform environment between the internals of CHATR and the outside world, almost all data and command i/o is done via Lisp s-expressions (bracketed structures). S-expressions offer a very simple but uniform representation for complex objects. This means we need only define one main function for reading and writing data. No special `read-intonation-stats' or `print-segment-stream' function (or syntax) is required. All non-binary data is conventionally represented in this form.

As in Lisp, commands have the generic form

     ( <command name> <arg~1> <arg~2> ... )

Commands are interpreted by an evaluator. A table of commands relates the Lisp-level command name to a C function which interprets that command. Again, this means there is a uniform method for specifying actions (playing data, saving data, setting parameters etc) within the system.

Processing Sequence

The processing sequence of CHATR is as follows

            --------------------- 
           |                     |
           |     Text Input      |-----------------
           |                     |                 |
            ---------------------  Voice Input     |
                   |         |                     |
                  \|/         ---------------------
            --------------------- 
           |                     |
           | Phoneme Conversion  |
           |                     |
            --------------------- 
                   |
                  \|/
            --------------------- 
           |                     |
           | Prosody Prediction  |-----------------
           |                     |                 |
            --------------------- Prosody Input    |
                   |         |                     |
                  \|/         ---------------------
            --------------------- 
           |                     |
           |   Unit Selection    |
           |                     |
            --------------------- 
                   |                        ______
                  \|/                      /      \
            ---------------------         /        \
           |                     |       |  Speech  |
           | Waveform Processing | <---> |          |
           |                     |       |   Data   |
            ---------------------         \        /
                   |                       \______/
                  \|/
            --------------------- 
           |                     | ___|\
           |    Audio Output     | ___  )  "Hello, I am CHATR."
      -----|                     |    |/
     |      --------------------- 
     |    Text Display     |
     |                     |
      ---------------------

There are several different methods for performing each process. Users may use system defaults (except for Audio Output (see section Audio Setup - Software)), or, using CHATR commands, select a particular prefered method. Present system defaults are

Text Input
Convert to HLP format.
Phoneme Conversion
Use `mrpa' phoneset.
Prosody Prediction
duration_method lr_dur.
int_method ToBI.
Unit Selection
Use CHATR Library.
Waveform Processing
synth_method UDB
concat_method DUMB+.
duration_stretch 1.0000
pitch_stretch 1.0000
Audio Output
No default. Must be selected in .chatrrc file. See section Audio Setup - Software, for details.

Text Input

There are many forms in which text can be presented to a system.

Voice Input

Voice input may come from several sources.

Utterance Representation

Each utterance is represented internally as an object, many of which may exist in the system at once. The synthesis of the utterance (ultimately a waveform) is generated with respect to various parameters set up beforehand. An utterance consists of a number of levels, called streams. The number of streams may vary depending on the type of synthesis being used. There is a method for declaring which currently defined streams are to be used for utterances. Each stream (at least notionally) consists of a number of ordered cells.(3) Each stream cell has contents which are dependent on the type of stream.

Streams can easily be added to the system, but in the version this manual supports we have the following

Input
Contains the input form of the utterance, perhaps as a simple list of labeled words, or as a phrase structure tree from a generation system. This level will always have a value, even though the actual form of it may vary from utterance to utterance.
Phrase
A normalized phrasal form of the given utterance. It is a standard tree structure whose leaves are word cells and nodes are phrase cells. Pointers allow algorithms to traverse this structure (up or down, left or right) as required.
Word
A simple left to right specification of `words'. Various syntactic and intonational features may also be specified on these words.
Syllable
An ordered list of syllables.
Phoneme
An ordered list of phonemes.
Intones
An ordered list of intonation features. The form depends on the intonation method being used.
RFC
An ordered list of RFC elements.
Segment
An ordered list of segments with full specification suitable for whatever synthesis method is to be used. Typically each segment will contain a phoneme, duration, pitch etc.
Wave
The waveform itself. The exact coding of the waveform is a function of the synthesis method used but the waveform is represented in a structure that contains enough information for the various play functions so they know how to deal with it.
Unit
An ordered list of speech database unit descriptions. These are used in the concatenative synthesis method to hold the descriptions of the sections that are to be joined (by a selected process) to form one whole waveform.

The basic architecture enables each cell on each level to be linked to any number of cells on any other level. As a result it is easy to find, for instance from a phoneme, which syllable it is part of. Through that (or even directly) which word it is within can be found. Likewise, which phonemes are in each word is also available by following a pointer. Importantly these levels are not in a simple hierarchy. Although there is an obvious hierarchical relationship between words, syllables and phonemes, there is not such an obvious relationship between intones and phonemes. Therefore such a strict hierarchy is not built in; any level may be related to any other level as required.

It is very important to state that the above levels are not strict. More could be added, or the existing ones ignored. Currently we have not fixed any, though it does appear the segment stream can be thought of as an important level between the high level aspects of synthesis and the underlying waveform generating synthesis method. We have not followed other systems, which are defined in a strict pipeline of processes, where each module feeds data to the next module in the pipe. Such a model would mean that one module can fix what information is available for later modules in the pipe. A requirement of more data in a later module might require changes in all previous modules so that the necessary information is available. Here all modules can access all levels (though typically do not), without any dependency on other modules.

Simply put, the overall synthesizer system takes an utterance object of which most levels are not yet filled in. Various modules are called (depending on parameters) that fill in these other levels of the utterance, eventually leading to a waveform (if requested) that can be played by several mechanisms.

Basic Utterance Types

Input to CHATR is in the form of an utterance created by the Utterance command. Several types of input may be specified at quite different levels, varying from raw text to a simple waveform. The current possibilities are

Text
At the highest and most abstract level, simple strings of words may be given as utterances. These are in the form
     (Utterance
      Text
      "You can pay for the hotel with a credit card.")
Of course, with such a high level input, little control may be exercised over the prosodic form. This is, however, the simplest input type.
HLP
A high level `linguistic' structure. Basically a tree where each node is labeled with a feature structure, a set of feature value pairs. Leaf nodes represent words. See section HLP Processing, for more detail on this form of input. An example is
     (Utterance 
      HLP
      (((CAT S) (IFT Statement))
       (((CAT NP) (LEX you)))
       (((CAT VP))
        (((CAT Aux) (LEX can)))
        (((CAT V) (LEX pay)))
        (((CAT PP))
         (((CAT Prep) (LEX for)))
         (((CAT NP))
          (((CAT Det) (LEX the)))
          (((CAT N) (LEX hotel)))))
        (((CAT PP))
         (((CAT Prep) (LEX with)))
         (((CAT NP))
          (((CAT Det) (LEX a)))
          (((CAT Adj) (LEX credit) (Focus +)))
          (((CAT N) (LEX card))))))))
PhonoWord
A lower level representation, which explicitly states prosodic phrase, pitch range and intonation features for a set of words. This format is intended to capture the level of information available in a ToBI labeled utterance. The format is again a tree, though its depth is limited by the number of phrase levels (currently defined to be four: Discourse, Sentence, Clause, and Phrase). Each phrase may have an optional PitchRange feature. Also, each word may be labeled with intonational features. A typical example is
     (Utterance
      PhonoWord
      (:D ()
          (:S ()
              (:C ()
                  (my (B (i)) )
                  (sister (H (l)))
                  (who)
                  ((lives (CAT V)))
                  (in)
                  (edinburgh (H(d))(B ())))
              (:C ((PitchRange one))
                  (knows (B(i)))
                  (an)
                  (electrician (H (d)) ))
              )
          )
      ))
If the intonation method is set to ToBI then it is possible to specify ToBI-like utterances in this form. No direct representation of break levels is currently possible in this mode, but the bracketed four-level structure offers the same level of representation as numbered break levels. Note that in order to stop the ToBI (and JToBI) modules from ignoring your specification, you must set HLP_realise_strategy to Simple_Rules.
     (Utterance
      PhonoWord
      (:D ()
          (:S ()
              (:C ()
                  (marianna (H*))
                  (made)
                  (the)
                  (marmalade (H*) (L-L%))))))
PhonoForm
This format allows specification of prosodic phrases, words, syllables, intonation labels, segments, duration, power and pitch. In fact almost everything that CHATR itself might predict during synthesis of a text string. This form is typically used to represent database information. See section PhonoForm Utterance Types, for information on how to build such representations automatically. An example is
     (Utterance
      PhonoForm
      (:D nil
       (:S ((PauseLength 65))
        (Word Attorney nil
         (Syl ax () (Phoneme ax 70 8.5100 ((187.0000 35))))
         (Syl t.er ((Stress 1) (Intones HiF0 H*))
          (Phoneme t 110 7.1200 ((242.0000 55)))
          (Phoneme er 80 8.7500 ((255.0000 40))))
         (Syl n.iy nil
          (Phoneme n 50 8.6700 ((233.0000 25)))
          (Phoneme iy 60 8.3400 ((193.0000 30)))))
        (Word General ((Break 1))
         (Syl d.jh.eh.n ((Stress 1) (Intones !H*))
          (Phoneme d 60 7.7300 ((173.0000 30)))
          (Phoneme jh 40 7.4100 ((226.0000 20)))
          (Phoneme eh 110 8.4300 ((205.0000 55)))
          (Phoneme n 30 8.2700 ((196.0000 15))))
         (Syl axr () (Phoneme axr 130 8.2600 ((158.0000 65))))
         (Syl el ((Intones L-H%))
          (Phoneme el 110 7.9000 ((180.0000 55))))))
       (:S nil
        (Word James nil
         (Syl d.jh.ey.m.z ((Stress 1) (Intones H*))
          (Phoneme d 70 7.0900 ((182.0000 35)))
          (Phoneme jh 50 7.0800 ((184.0000 25)))
          (Phoneme ey 150 8.2200 ((154.0000 75)))
          (Phoneme m 100 7.7600 ((143.0000 50)))
          (Phoneme z 30 6.7700 ((200.0000 15)))))
        (Word Shannon ((Break 1))
         (Syl sh.ae.n ((Stress 1) (Intones HiF0 H*))
          (Phoneme sh 90 6.9900 ((200.0000 45)))
          (Phoneme ae 150 8.3700 ((172.0000 75)))
          (Phoneme n 80 8.1000 ((144.0000 40))))
         (Syl ax.n ((Intones L-L%))
          (Phoneme ax 30 7.4400 ((104.0000 15)))
          (Phoneme n 50 7.1100 ((145.0000 25))))))))
Segment
A low level specification of an utterance is also possible. The utterance format allows the specification of segments (phonemes) with durations and F0 target values. This input type is also the same format as that generated by the save segment command. This allows fine control over what is actually to be synthesized. The fields in each segment are: segment name, duration in milliseconds, power, and a list of F0 targets. Each target consists of a frequency in Hz, followed by an index in milliseconds into the segment at which that target frequency is desired. An example is
     (Utterance 
      Segment
      (
       ( #    50      0       ((80 0)))
       ( m    58      0       ((80 0)))
       ( ai   148     0       ((140 42) (135 101)))
       ( s    105     0       ())
       ( i    94      0       ((205 18)))
       ( s    61      0       ((145 44)))
       ( t    45      0       ())
       (     60      0       ())
       ( h    59      0       ())
       ( uu   140     0       ())
       ( l    80      0       ())
       ( i    97      0       ())
       ( v    51      0       ())
       ( z    60      0       ())
       ( i    97      0       ((80 78)))
       ( n    58      0       ())
       ( e    115     0       ((130 23)))
       ( d    54      0       ((80 28)))
       ( i    43      0       ())
       ( n    39      0       ())
       ( b    74      0       ())
       ( uh   100     0       ())
       ( r    22      0       ((80 2)))
       (     80      0       ())
       ( #    200     0       ((130 0)))))
SegF0
A lower level representation more suitable for representing natural utterances. This allows the F0 to be specified as a separate file (i.e. as generated by a pitch tracker). An example is
     (Utterance
      SegF0
      ("MHT01.f0"
       (
       ( PAU 255 0 () )
       ( a 85 0 () )
       ( r 10 0 () )
       ( a 95 0 () )
       ( y 45 0 () )
       ( u 105 0 () )
       ( r 20 0 () )
       ( u 95 0 () )
       ( g 55 0 () )
       ( e 85 0 () )
       ( N 100 0 () )
       ( j 45 0 () )
       ( i 75 0 () )
       ( ts 125 0 () )
       ( u 90 0 () )
       ( o 177.5 0 () )
       ( PAU 447 0 () )
       ( s 117 0 () )
       ( u 42 0 () )
       ...
       )
      )
     )
In this case the F0 is specified in a separate file, with F0 points specified one per line. Each line should consist of two numbers, a position in milliseconds for the start of the utterance, and the desired F0 value in Hz. Each may optionally be surrounded by parentheses. Alternatively, the F0 may be specified directly inline in the utterance. Instead of a file name, that part may be a list of bracketed pairs of positions in milliseconds and Hz values. The pairs in an explicit list of a file need not be at regular intervals, but should be in order.
RFC
Similar to the segment type, but F0 information is represented as RFC elements. RFC elements can be automatically extracted from natural speech, thus we can impose natural intonation onto a string of segments. Each segment contains: the segment name, duration in milliseconds, and the list of RFC elements for that segment. An example is
     (Utterance
      RFC (
      (sil    335     ( ( sil 0 135 ) ))
      (hh     48      ( ( conn 28 135 )))
      (ax     23      ())
      (l      30      ( ( fall 21 138 )))
      (ow     224     ( ( conn 192 82 )))
      (sil    327     ( ( sil 56 87 )))
      (ls     77      ( ( conn 74 121 )))
      (ih     90      ( ( rise 47 119 )))
      (z      42      ())
      (dh     29      ())
      (ih     56      ())
      (s      72      ( ( fall 58 163 )))
      (dh     32      ())
      (iy     54      ())
      (h#     103     ( ( conn 54 111 )))
      (ao     22      ())
      (f      54      ())
      (ax     35      ())
      (s      66      ())
      (f      53      ())
      (er     45      ())
      (dh     23      ())
      (ax     32      ())
      (k      90      ())
      (aa     87      ())
      (n      26      ())
      (f      59      ())
      (r      45      ())
      (ax     44      ())
      (n      45      ( ( sil 16 115 )))
      (s      134     ())
      (sil    542     ()) ))
Syllable
As many aspects of intonation are syllable aligned, another input method allows explicit syllable representation with associated intonational information. (This method currently only really supports Tilt intonation, but will eventually be changed to support all intonation methods.) An example is
     (Utterance 
      (Syllable (space rfc) (format feature))
      (
       (:C ((PitchRange one) (start top))
         ((hh  48)  (ax  23)           ())
         ((l   30)  (ow 224)           ((H (ds))))
       )
       (:C ((PitchRange one))
         ((ih 90)  (z  42)             ())
         ((dh 29) (ih 56) (s 72)       ((H (us))))
         ((dh 32) (iy 54)              ())
         ((ao 126)                     (H (ds)))
         ((f  54) (ax 35) (s  66)      ((C (r))))
         ((f  53) (er 45)              ())
         ((dh 23) (ax 32)              ())
         ((k  90) (aa 87)(n  26)       ((H (ds))))
         ((f  59) (r  45) (ax 44) (n  45) (s 134)      ((B ())))
       )
      ))
This example also shows how the utterance type may include other features identifying sub-type information. If the type is non-atomic, it may include a feature list which may be accessed later during synthesis.
Wave
Waveforms may also be specified as input. Of course, when synthesis of an utterance of this type is requested, nothing significant happens as the waveform already exists. The third argument in the input type consists of the filename of the waveform, followed by a list of features identifying the waveform type. Features are file_type, sample_rate, and coding. If no features are specified, the value of the global wave file-type (set by the command Wave_Filetype) is used. An example is
     (Utterance Wave 
        ("/usr/pi/data/cmu/maem/wav/C01.03.wav"
         (file_type "nist")))
If file_type is `raw', a sample rate and coding type should be specified. If no sample rate is given, the global rate (set by the command Sampling_Rate) is used. No default is available for coding, so use one of either
lin16MSB
linear 16 bits, most signification byte first--sparcs, 68000, hppa etc.
lin16LSB
linear 16 bits, least signification byte first--DEC mips, alpha, intel 386 etc.
An example is
     (Utterance Wave 
        ("/usr/pi/data/cmu/maem/wav/C01.03.raw"
         (file_type "raw") (sample_rate 16000) (coding lin16LSB)))

Utterances are created by the Utterance command. They may be saved in variables (using the set command) and then given to other commands as arguments. Thus one can do

     (set utt1 (Utterance HLP ...))
     #<Utt 349078>
     (Synth utt1)
     #<Utt 349078>
     (Say utt1)

Or you can pass the result of the utterance command directly in as an argument to another function thus

     (Say (Synth (Utterance HLP ...)))
     #<Utt 349078>

However, it is often useful to save the utterance in a variable so it may referenced later. A common form is

     (set utt1 (Utterance HLP ...))
     #<Utt 349078>
     (Say (Synth utt1))
     #<Utt 349078>

There is a notion of a current utterance. Many commands that take an utterance as an argument will use the current utterance if no argument is actually given. The current utterance is the utterance generated by the most recent Utterance command--irrespective of any other utterances that have been referenced in between.

There is a system which allows an utterance to be synthesized and played, and any other arbitrary function to be called with the newly created utterance as an argument. This follows the ideas of EMACS by offering hooks. If the variable utt_hook is set to either a function name or a list of function names, these functions are called, in order, with the new utterance as an argument. For example, if you wish all new utterances to be synthesized and played at the time they are compiled, you may use the command

     (set utt_hook (list Synth Say))

A similar hook (synth_hook) also exists for use after the full waveform is synthesized by the Synth or Synthesize commands. This is intended for low-level waveform manipulations to be specified, such as altering the gain or sample frequency.

Text-to-Speech Input

Another utterance input method is by a very simple text-to-speech CHATR module. Text in files (or from standard input) may be directly synthesized via the Textfile command. It takes one argument, a filename. The file is assumed to be a text file. This will read in `sentences' and build HLP utterances from them (but a little crudely!). Sentences are defined as a string of tokens terminated by a full stop, question mark, exclamation mark or blank line. This input is really too low level for normal use, and hence a number of wrap-around functions are offered. The functions tts, jtts and mtts offer English, Japanese and mixed text to speech.

All tts functions take a single file name as an argument. If the file name `-' is given, CHATR will read from standard input. When in this interactive sub-mode, a different prompt is used. See section The Command-line Prompt, for an example. To exit from keyboard tts mode, enter an empty sentence. That is, after finishing a sentence, enter a single full stop.

The CHATR tts system synthesizes on a sentence-by-sentence basis. Ends of sentences are identified by blank lines, full stops, question or exclamation marks. Note that CHATR does not yet support Japanese input in Kana or Kanji form while in interactive mode. Romaji may be used. Japanese is fully supported from files, however, and also by using the EMACS interface.

A simple example is

     (tts "-")
     Hello, this is a speech synthesis system.
     Sentences may be broken over lines,
     but will not be synthesized until the end
     of the actual sentence.
     .
     (tts "War-and-Peace")

Note that this sub-system is still pretty minimal. There is work to be done in adding a parser to the text-to-CHATR sub-system, and later work will add better treatment of numbers, acronyms etc. At present, although it works, there are still many things that could be done to improve it.

In text mode, audio output is asynchronous, allowing synthesis of the next utterance while the previous one is still being played. This goes some way to reduce the pauses between utterances. This incremental form of synthesis is still a little crude, but is quite adequate.

Multi-lingual Text Processing


     input text        _________ 
      _________       |         |
     |         |      |........ |
     |.........|      |         |  filters
     |...      |      |         |
     |........ |      |......   |\   ___
     |.........|      |.......  | \ |   |
     |.......  |     /|_________|  \|L1 |________________________
     |.....    |    /               |___|        \               \
     |         |   /   _________                  \               \
     |......   |  /   |.........|    ___           \               \
     |........ | /    |         |   |   |           \               \
     |........ |/     |         |---|L2 |     _ _ _ _\_ _ _ _ _ _ _ _\_ _ _
     |.........|------|..       |   |___|\   |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
     |.......  |\     |         |         \___/____________/_____/
     |.....    | \    |         |    ___                  /
     |         |  \   |_________|   |   |________________/
     |.......  |   \                |L3 |
     |........ |    \  _________   /|___|             phone sequence
     |...      |     \|         | /
     |    .    |      |         |/   ___                   /
     |    .    |      |....     |   |   |                 /
     |    .    |      |         |   |Ln |                /
     |         |      |         |   |___|               /
     |_________|      |         |                      /
                      |_________|    /                /
    (multilingual)                  /                /
                                   /                /
                                  /  word          /  speaker-specific
                                 /   sequence     /   phone mappings
                                /                /
                               /                /
              ________________/____________    /
             |___|_______|_|___________|___|  /
                                             /
              _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /
             |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|

Utterance Modules

Several modules are called which take an utterance and additional information as parameters, based on the utterance itself and various global settings. The modules are called from a high level function initiated when the synthesis of an utterance is requested. There are many modules, but each is designed to be self-contained (though typically depends on various lower level architecture access routines). Some of the modules are

HLP
The High Level Prosodic module. Converts arbitrary feature structure descriptions of utterances to a structure with lower level prosodic aspects filled in (particularly, tune, phrasing and pitch range). The conversion is rule based, automatically adding intonation and phrasing information.
Lexicon
Looks up the `words' in a lexicon to provide syllable and phoneme information.
Phoneme-to-Segment
Creates the segment stream from the phoneme stream. This module deals with various results of assimilation that happen when the phonemes from individual words are concatenated. Performs reduction of vowels (and syllables) and unaccented syllables.
Intonation
Constructs an F0 from an abstract representation, such as ToBI labels, Tilt Labels, or whatever.
Duration
Currently five versions of this module are provided. These provide durations for segments, one based on the Klatt work from MITalk, one using Neural Nets, two using linear regression and another for predicting duration for Japanese.
Synthesis
Generates the waveform for the utterance. This may call a number of different sub-modules. It may use the formant synthesizer, or concatenation of units from a speech database. There are a number of options to select from here, including non-uniform unit selection.
Audio
Deals with the playing of the synthesized waveform. Again, there are a number of options available that allow the waveforms to be played by different mechanisms depending on your environment.

There are other modules too. The point to be made here is that they can vary, and synthesis researchers may wish to add their own. See section Developing New Modules for CHATR, for how to modify the system and add your own modules.


Go to the first, previous, next, last section, table of contents.