Modeling Dialogue Control Strategies to Relieve Speech Recognition Errors

Yasuhisa NIIMI and Yutaka KOBAYASHI

Department of Electronics and Information Science, Kyoto Institute of Technology
Matsugasaki, Sakyo-ku, Kyoto, 606 JAPAN
e-mail: niimi@dj.kit.ac.jp

A number of attempts have been made to study spoken dialogue systems. However, current technology for speech recognition, which has made remarkable progress, is still insufficient for complete recognition of utterances in spoken dialogue. So dialogue systems need to confirm recognized utterances. This paper considers three dialogue control strategies to relieve speech recognition errors. These are the prompt to speak again, the direct confirmation and the indirect confirmation. Here assume that the dialogue system have recognized an utterance as the sentence, ``Please tell me the entrance fee of Kinkakuji temple.'' If the system cannot accept the sentence reliably, it has three options; it prompts the user to speak again, confirms directly by saying, ``You mean an entrance fee of Kinkakuji temple ?'', or makes an indirect confirmation by answering, ``You can enter Kinkakuji temple by 500 yen,'' instead of answering, ``it's 500 yen.''
The purpose of modeling the dialog control strategies is to estimate two quantities $P_{ac}$ and $N$; $P_{ac}$ is the probability that information included in user's utterance is conveyed to the system correctly, and $N$ is the average number of turns taken between the user and the system until terminating subdialogue on user's first utterance.
The first dialogue control strategy, the simplest of the three, is that the dialogue system accepts user's utterances when their recognition scores are greater than a threshold value, but rejects them otherwise and prompts the user to speak again. The dialogue system using this strategy is called model 0. Now assume we know the probability, denoted by $a$, that user's utterances are accepted, and the probability, denoted by $p$, that accepted utterances have been recognized correctly. How to estimate these two parameters will be explained later. Then $P_{ac}^{(0)}$ and $N^{(0)}$ (the upper scripts indicate the model index) are given by the following formulae: \[ P_{ac}^{(0)}=p, \makebox[30mm]{and} N^{(0)}=\frac{2}{a}-1. \] Since $p$ is expected to be inversely proportional to $a$, it is necessary for $a$ to make small in order to increase $P_{ac}^{(0)}=p$. This, however, makes $N^{(0)}$ large. Some tradeoff is then needed between $P_{ac}^{(0)}$ and $N^{(0)}$.
Now we consider how to estimate $a$ and $p$. Let $A$ denote the acoustic data stream of an utterance, and $W$ denote a string of words. We can adopt the conditional probability $P(W/A)$ of $W$ given $A$ as a recognition score. The recognized string of words is such a string that maximizes $P(W/A)$ under the given linguistic constraint. By Bayes' theorem, \[ P(W/A)=P(A/W)P(W)/P(A).\] The quantity $P(A/W)P(W)$, which is used as a conventional criterion in speech recognition, is computed by using the hidden Markov model and the language model. Two methods can be considered for estimating $P(A)$; the first is to approximate $P(A)$ by $\max\{P(X)P(A/X)\}$ where $X$ is a string of phonemes, and the second is to use the HMM to compute $P(A)$ directly. Using this scheme to compute $P(W/A)$'s for many training utterances, we can create a distribution for $P(W/A)$. Selecting a threshold value $\theta$, we can estimate a as the area of the portion of the distribution in which the inequality $P(W/A)\geq\theta$ is satisfied. $p$ is also estimated in the similar way by using separate distributions created from correct recognitions and incorrect recognitions.
Now we return to the dialogue control strategy. The second strategy is the direct confirmation. The system using this strategy is called model 1. By this strategy the system confirms recognized utterances when their recognition scores are less than the threshold value $\theta$, while it accepts them otherwise. User's response to this confirmation is assumed to be either ``yes'' or ``no'' for simplicity. When the response cannot be accepted, the user is asked to tell again what he has said first. Assuming we know the probability, denoted by q, of having recognized correctly the utterances for which the confirmation is made, $P_{ac}^{(1)}$ and $N^{(1)}$ of the model 1 are given by the following formulae; \[ P_{ac} = \frac{p\{1+(1-\alpha)q\}}{1+(1-\alpha)\beta} \] and \[ N^{(1)} = \frac{\alpha + (1-\alpha)(4-\alpha\beta)} {\alpha + (1-\alpha)\alpha\beta} \] where $\beta=1+2pq-p-q$.
It is proven by simple calculation that $N^{(1)}>N^{(0)}$, and $P_{ac}^{(1)}>P_{ac}^{(0)}$ if $q>1/2$.
Finally we consider more complex strategy: indirect confirmation. The dialogue system using this strategy is called model 2. The model 2 uses indirect confirmations as well as direct confirmations. Assume the followings for the performance of the system and user's response to the indirect confirmation.

The system selects a direct confirmation with the probability $\gamma$ and an indirect confirmation with the probability $1-\gamma$.

The user proceeds to a new utterance without making any comment to the indirect confirmation of correct recognition, but makes some correction to incorrect recognition. Therefore, of user's responses new utterances occur with the probability $q$ and corrections occur with the probability $1-q$.

These two kinds of utterance are accepted and recognized correctly with the equal probability.

When user's new utterance is recognized correctly, we consider the turns taken for the confirmation and user's new utterance are not spent to convey the information of the first utterance.

Under these assumptions, $P_{ac}^{(2)}$ and $N^{(2)}$ of the model 2 are given by the following formulae; \[ P_{ac}^{(2)} = \frac{p\{1+(1-\alpha)[1+(q-1)\gamma]\}} {1+(1-\alpha)\{1+(1-\beta)\gamma\}} \] and \[ N^{(2)} = \frac{\alpha+(1-\alpha) [(4-\alpha\beta)\gamma+ (4-\alpha-2\alpha pq)(1-\gamma)]} {\alpha+(1-\alpha)[\alpha\beta\gamma+\alpha(1-\gamma)]} \] In this case it is proven that $P_{ac}^{(2)}>P_{ac}^{(0)}$ if $q>1/2$, and $P_{ac}^{(2)}=P_{ac}^{(0)}$ and $N^{(2)}>N^{(0)}$ if $\gamma=0$, that is, if the system adopts only the indirect confirmation.
This paper has reported three dialogue control strategies to relieve errors in speech recognition, and analyzed them mathematically. The analysis has proven that the direct confirmation can increase the probability that information included in user's utterances is conveyed to the system correctly, and the indirect confirmation can reduce the average number of turns exchanged between the user and the system.