Recognition of Dialog Speech
-- A Robust Spoken Dialog System --

Seiichi NAKAGAWA, Mikio YAMAMOTO, and Atsuhiko KAI

Department of Information and Computer Sciences, Toyohashi University of Technology
Tempaku-cho, Toyohashi, Aichi, 441, JAPAN
e-mail: nakagawa@say1gw.tutics.tut.ac.jp

We study robust recognition and interpretation methods for spontaneous speech. Many speech recognition systems employ syntactic constraints to reduce the search space of string candidates corresponding to speech. This is a good approach for read speech. In spontaneous speech, however, the syntactic constraint is much weaker than in read speech. Also, interjections, repairs and so on make speech recognition more difficult.
We can classify our research into three parts. (1) We compare some recognition methods for spontaneous speech to examine the robustness of each method. (2) We make experiments in order to estimate the number of vocabulary for recognizing spontaneous speech and to observe human ability of error correction for misrecognition results. (3) We develop the robust spoken dialog system whose interpreter receives the recognition results that may include recognition-errors. The interpretation system is based on human's strategy of error correction in the above experiment.
While studies on spontaneous speech recognition and understanding have been done extensively, the several main approaches which have been used in the conventional systems for realizing the analysis and verification method of the spoken language have not been sufficiently evaluated for spontaneous speech. We compare the speech understanding systems, which have different recognition strategies in terms of analysis and verification method of the speech input and which can process several significant spontaneous speech phenomena, by using the equivalent syntactic and semantic constraints. The island-driven parsing strategy showed comparable sentence understanding rate in compared with the left-to-right parsing strategy when the worse acoustic model is used. However, the One-Pass (left-to-right parsing) method consistently obtained better phrase recognition accuracy and showed significant superiority when the better acoustic model is used. As a result, we found that the more refined acoustic model and the more optimized verification process between utterance and the concatenation of acoustic models on the assumed linguistic constraints is important for spontaneous speech with weak syntactic constraints, as well as for read speech with strong constraints.
We make two experiments concerning the spontaneous speech dialog systems. First experiment is about the number of vocabulary appeared in the recognition of spontaneous speech. We examine the relationship between the number of different words and the total number of input sentences. Experimental results show that the system-initiative dialog system requires the reasonable number of vocabulary, though the user-initiative dialog system requires the unlimited number of vocabulary.
The purpose of second experiment is to observe human ability of error correction for misrecognition results. Experimental results show that human can correctly understand many misrecognized sentences. In particular, if human can refer to the context in which the utterance is generated, he can correct about half of misrecognized sentences. Also we can say that human can easily correct misrecognition of post-position, but it is very difficult to correct misrecognition of content word.
The interpretation system that receives the recognition results has two difficulties. (1) Since spontaneous speech is not well-formed sentence, even if the recognition result is correct, the interpretation of the result is difficult. (2) Also the recognition results of spontaneous speech may have recognition errors. Accuracy of the recognition system that is used in the above experiments is about 50%. The about half of inputs have some recognition errors. The interpretation part in the dialog system has to identify misrecognized words, correct them and extract the correct meaning representation.
We developed the robust interpretation method and applied it to the dialog system. The interpretation method uses some heuristics for omissions of post-position and inversions and top-down strategy on context knowledge.
The interpretation system interprets recognition result by the following steps.

  • Morphological analysis: is the JUMAN system developed by Kyoto University.
  • Bunsetsu phrase analysis: translates from morpheme sequence to bunsetsu phrase sequence.
  • Syntactic analysis: uses a chart based parser with Japanese KAKARI-UKE rules.
  • Semantic analysis: translates from parsing tree to semantic representation.


    Heuristics for interpretation of spontaneous speech and error correction are used in the syntactic analysis part (step 3). Heuristics can be divided to three kinds such as post-position and inversion rules, filtering rules and key-word analysis rules.
    The post-position and inversion rules cope with omissions and substitutions of post-position and inversions. We extract these rules from the corpus of spoken dialog transcript. These rules can analyze ninety percent of omissions of post-position and inversions of spontaneous speech.
    The filtering process receives the semantic representation and translates it to the correct representation or reject it if needed. If the input representation is correct, the filtering process does nothing. The filtering rules have triggering patterns and correction methods.
    The key-word analysis process is invoked when all other methods fail to interpret recognition result. The key-word analysis process receives the chart data-base that stores the analyzing results for the part of input sentence. The process decides the whole meaning of input without syntactic relations. The key-word analysis rules have key-word list and skeleton of meaning.
    We applied this robust spoken dialog understanding system to the task of "sight-seeing guidance for Mt.Fuji." The vocabulary size of the system is about 250 words and the perplexity of the grammar is about 70. The developed interpreter in the system indicated almost same performance as human being.

    Keywords: spontaneous speech recognition, robustness, ill-formed sentence, spoken dialog system,