Recognition of Dialogue Speech

Shozo MAKINO, Akinori ITO, Yukihiro OSAKA, Takashi OTUKI,
Yoshiyuki OKIMOTO, and Motoyuki SUZUKI

Graduate School of Information Sciences, Tohoku University
Aoba-ku, Sendai 980-77, JAPAN
e-mail: makino@rcais.tohoku.ac.jp

We have been developing phoneme recognition systems with high accuracy for dialogue speech. In dialogue speech, variations of coarticulation effect and speaking rate are larger comparing to those in isolated spoken word and continuous speech. In order to deal with these problems, we are constructing the following three phoneme recognition systems. (1) Phoneme recognition using HM-net made with a new successive state splitting algorithm, (2) Phoneme recognition using discriminative training and reference patterns with variable length, and (3) Phoneme recognition with adaptability function to speaking rate of input speech. In this report, we describe outline of the first research. Many methods have been proposed to construct context-dependent phoneme models using HMMs to get better performance. These conventional methods require the previously given contextual factors. If the given factors are not sufficient, the constructed models do not give good recognition performance. We propose a new construction algorithm of HMnet without giving contextual factors. The new algorithm is very similar to the Successive State Splitting (SSS) algorithm (called SSS-original) proposed by J.Takami and S.Sagayama. The difference of the new algorithm (called SSS-free) from the SSS-original is the way of splitting the state on the contextual domain. In the SSS-original, the distribution of training samples accepted by a state to one of the two new states is done by splitting the phoneme-context classes belonging to the contextual factor which accomplishes the maximum likelihood. However, if the given factors are not sufficient, there often occurs that all of the training samples are assigned to one of the two new states. Therefore, we can not continue the state splitting. On the other hand, in the SSS-free, each of the training samples accepted by a state is distributed to one of the two new states which gives higher likelihood. The SSS-free does not have the previously-described problems because we do not need to give contextual factors for the splitting. Therefore, we can construct a HMnet in any case. Training samples passing through a single path in the HMnet made by the SSS-free have similar acoustical characteristics.
One of advantages of the SSS-original is that we can obtain a HMnet which expresses not only phoneme-context classes appeared in training samples but also other phoneme-context classes which are not contained in training samples. In the SSS-free, the HMnet can not express phoneme-context classes which do not appear in training samples. This is one of disadvantages of the SSS-free, but the number of phoneme-context classes which do not appear in training samples is very few because we can use huge number of training samples. So, we can conclude that this disadvantage has little influence on recognition performance.
The HMnet made by the SSS-free is not equivalent to context-dependent models because the SSS-free does not explicitly consider contextual factors. However, if the phoneme-context class of the test sample is given in the recognition stage, this HMnet can be regarded as context-dependent models by the restriction of path using ``context table''. Each row in the context table is composed of a path-name in the HMnet, and a list of phoneme-context classes corresponding to the path-name. These lists are defined in the final retraining; a phoneme-context of each sample is put in the list corresponding to the path through which the sample passed. In the recognition stage, we calculate a likelihood after the restriction of path using context table. However, one phoneme-context class is not necessarily limited to a single path. The likelihood of the phoneme is calculated as the maximum likelihood among the corresponding paths.

In order to show the effectiveness of the SSS-free, we carried out a speaker-dependent phoneme recognition for 24 Japanese phonemes in the ATR Japanese sentence speech uttered by two male (MHO and MTK) and two female (FKN and FTK) speakers. Training data is 400 sentences in the ATR Japanese sentence speech Database, and test data is other 103 sentences. The maximum number of states per phoneme was set to 30 and the total number of states was set o 500. The recognition scores of the SSS-free for the four speakers (MHO, MTK, FKN, and FTK) were 87.8%, 95.9%, 93.9% and 94.8%, respectively. The recognition scores of the SSS-original for the four speakers (MHO, MTK, FKN, and FTK) were 85.7%, 94.4%, 91.3% and 93.9%, respectively. Average recognition score of the SSS-free was 93.1% and that of the SSS-original was 91.3%. The average recognition score increased by 1.8%. We found that, in the SSS-original, there were several phonemes that the state splitting of the phonemes could not be done before the number of states reached the pre-defined number.