Recognition of Dialog Speech
-- A Robust Spoken Dialog System --
Seiichi NAKAGAWA, Mikio YAMAMOTO, and Atsuhiko KAI
Department of Information and Computer Sciences,
Toyohashi University of Technology
Tempaku-cho, Toyohashi, Aichi, 441, JAPAN
e-mail: nakagawa@say1gw.tutics.tut.ac.jp
We study robust recognition and interpretation methods for spontaneous
speech. Many speech recognition systems employ syntactic constraints
to reduce the search space of string candidates corresponding to speech.
This is a good approach for read speech. In spontaneous speech, however,
the syntactic constraint is much weaker than in read speech. Also,
interjections, repairs and so on make speech recognition more difficult.
We can classify our research into three parts. (1) We compare some
recognition methods for spontaneous speech to examine the robustness of
each method. (2) We make experiments in order to estimate the number of
vocabulary for recognizing spontaneous speech and to observe human
ability of error correction for misrecognition results. (3) We develop
the robust spoken dialog system whose interpreter receives the
recognition results that may include recognition-errors. The
interpretation system is based on human's strategy of error correction
in the above experiment.
While studies on spontaneous speech recognition and understanding have
been done extensively, the several main approaches which have been
used in the conventional systems for realizing the analysis and
verification method of the spoken language have not been sufficiently
evaluated for spontaneous speech. We compare the speech understanding
systems, which have different recognition strategies in terms of
analysis and verification method of the speech input and which can
process several significant spontaneous speech phenomena, by using the
equivalent syntactic and semantic constraints. The island-driven parsing
strategy showed comparable sentence understanding rate in compared with
the left-to-right parsing strategy when the worse acoustic model is used.
However, the One-Pass (left-to-right parsing) method consistently
obtained better phrase recognition accuracy and showed significant
superiority when the better acoustic model is used. As a result, we
found that the more refined acoustic model and the more optimized
verification process between utterance and the concatenation of acoustic
models on the assumed linguistic constraints is important for spontaneous
speech with weak syntactic constraints, as well as for read speech with
strong constraints.
We make two experiments concerning the spontaneous speech dialog systems.
First experiment is about the number of vocabulary appeared in the
recognition of spontaneous speech. We examine the relationship between
the number of different words and the total number of input sentences.
Experimental results show that the system-initiative dialog system
requires the reasonable number of vocabulary, though the user-initiative
dialog system requires the unlimited number of vocabulary.
The purpose of second experiment is to observe human ability of error
correction for misrecognition results. Experimental results show that
human can correctly understand many misrecognized sentences.
In particular, if human can refer to the context in which the utterance
is generated, he can correct about half of misrecognized sentences.
Also we can say that human can easily correct misrecognition of
post-position, but it is very difficult to correct misrecognition of
content word.
The interpretation system that receives the recognition results has two
difficulties. (1) Since spontaneous speech is not well-formed sentence,
even if the recognition result is correct, the interpretation of the
result is difficult. (2) Also the recognition results of spontaneous
speech may have recognition errors. Accuracy of the recognition system
that is used in the above experiments is about 50%. The about half of
inputs have some recognition errors. The interpretation part in the
dialog system has to identify misrecognized words, correct them and
extract the correct meaning representation.
We developed the robust interpretation method and applied it to the
dialog system. The interpretation method uses some heuristics for
omissions of post-position and inversions and top-down strategy on
context knowledge.
The interpretation system interprets recognition result by the
following steps.
Heuristics for interpretation of spontaneous speech and error
correction are used in the syntactic analysis part (step 3).
Heuristics can be divided to three kinds such as post-position and
inversion rules, filtering rules and key-word analysis rules.
The post-position and inversion rules cope with omissions and
substitutions of post-position and inversions. We extract these rules
from the corpus of spoken dialog transcript. These rules can analyze
ninety percent of omissions of post-position and inversions of
spontaneous speech.
The filtering process receives the semantic representation and
translates it to the correct representation or reject it if needed.
If the input representation is correct, the filtering process does
nothing. The filtering rules have triggering patterns and correction
methods.
The key-word analysis process is invoked when all other methods fail
to interpret recognition result. The key-word analysis process
receives the chart data-base that stores the analyzing results for the
part of input sentence. The process decides the whole meaning of
input without syntactic relations. The key-word analysis rules have
key-word list and skeleton of meaning.
We applied this robust spoken dialog understanding system to the task
of "sight-seeing guidance for Mt.Fuji." The vocabulary size of the
system is about 250 words and the perplexity of the grammar is about 70.
The developed interpreter in the system indicated almost same
performance as human being.
Keywords: spontaneous speech recognition, robustness, ill-formed sentence, spoken dialog system,