Understanding Spontaneous Speech in Spoken Dialogue

Hiroaki SAITO

Department of Mathematics, Keio University
Yokohama 223, Japan
e-mail: hxs@nak.math.keio.ac.jp

It is hard to write context-free grammar rules to cover free word order phenomena in spontaneous speech. Even the recognition performance gets improved, the utterance itself might be ungrammatical. Thus, it is not practical to pursue precise and specified syntactic rules.
An approach which combines a context-free principle and case frame instantiation has been claimed to be robust for ungrammaticality. This approach, however, is not effective in speech applications, because its strategy heavily leans on a word which specifies case and such word is often short and pronounced unstressed, especially in English. Case frame instantiation also relies on a verb. Depending on such particular words is risky in speech applications.
For instance, it would be easy to parse a written sentence ``I send a mail to Smith.'' When that sentence is pronounced, the word `to' is often hard to be recognized. This causes great difficulty for the system to understand the sentence, because that tiny word plays an important role semantically. If the task is small, we can effortlessly build a knowledge that `send' customarily puts `to' before destination. As the task gets bigger, however, building such knowledge gets hard and time-consuming. Thus such knowledge should be extracted from a corpus automatically.
This research proposes a method which handles syntax loosely and extracts the meaning of an utterance with the help of co-occurrence of words as an important semantic information. Co-occurrence information is obtained by parsing corpus sentences by a generalized LR parser. A function which detects word connectivity is attached to each rule, e.g. co-occurrence $<$send, to$>$ is found by rule ``S $-->$ S PP'' and its semantic function ``(cooccur (x1 head) (x2 prep)).'' This method can extract syntactic connectivity between words, while the conventional statistical measure bigram or trigram simply scans surface connection.
It is impossible to write the context-free grammar rules to parse all the corpus sentences. Thus, the generalized LR parser should be equipped with the following four error recovery techniques.
Situation: Suppose S is the top state of a parsing stack. Suppose X is the current input symbol. Action function `action(S,X)' returns possible actions (shift, reduce, accept or error) by looking up the action table of a grammar. Multiple actions might be returned because of the generalized LR parsing. Suppose action(Si,Xi) returns `error' in parsing an input string X1, X2, ..., Xn.

[] [word substitution]
Substitute a terminal symbol Yi for Xi on the condition that action(Si,Yi) does not return error.

[] [word deletion]
Ignore Xi on the condition that action(Si,Xi+1) does not return error.

[] [word insertion]
Insert a terminal symbol Yi just before Xi on the condition that action(Si,Yi) does not return error.

[] [gap filling (putting a dummy nonterminal)]
When state Si is a shift-able entry, consult the goto table to see if nonterminal D can be reduced at state Si and the new state is Sj. If that is the case, put a dummy nonterminal D onto the stack and go to state Sj.

The first three techniques are described for a single word substitution / deletion / insertion. Of course these techniques can be applied to manipulation of two or more words. In practice, however, that may expand search too much. Adapting the gap filling technique loosely may also blow up search. Thus a heuristics like ``Two consecutive dummy nonterminals must not be created'' should be adopted.
With the four error recovery mechanisms above, co-occurrence data are extracted from the corpus. Now we see how the data are utilized in parsing erroneous speech input ``I send a mail Smith.'' Parsing proceeds with no trouble against ``I send a mail.'' In parsing the next word `Smith,' the action function returns error. The same four error recovery mechanisms are also used in this parsing phase. Although the actual action depends on the grammar, [word insertion] interposes a preposition. If the terminal symbols of the grammar are lexicons, the actual prepositions are chosen. If terminals are parts of speech, a preterminal like *preposition is chosen. In either case, if the co-occurrence data show that `to' is frequently used after `send' in a particular rule, `to' can be inserted with high probability.
Checking word co-occurrence in the semantic action in each rule enhances robustness of our error-recoverable generalized LR parser especially for ungrammaticalities and ellipses in spontaneous speech.
----------------------------------------------------------------------
As a related research, a parser generator called NLyacc has been announced as a free software. NLyacc accepts an arbitrary context-free grammar (cyclic rules like A $-->$ A are excluded) written in the yacc format and produces its generalized LR parser. NLyacc, unlike yacc, accepts multiple values from a lexical analyzer, which is useful for handling ambiguous parts of speech of a word in natural language applications.

Keywords: spontaneous speech, parsing, word cooccurrence