Recognition of Dialogue Speech
Shozo MAKINO, Akinori ITO, Yukihiro OSAKA, Takashi OTUKI,
Yoshiyuki OKIMOTO, and Motoyuki SUZUKI
Graduate School of Information Sciences, Tohoku University
Aoba-ku, Sendai 980-77, JAPAN
e-mail: makino@rcais.tohoku.ac.jp
We have been developing phoneme recognition systems with high
accuracy for dialogue speech. In dialogue speech, variations of
coarticulation effect and speaking rate are larger comparing to
those in isolated spoken word and continuous speech. In order
to deal with these problems, we are constructing the following
three phoneme recognition systems. (1) Phoneme recognition using
HM-net made with a new successive state splitting algorithm,
(2) Phoneme recognition using discriminative training and
reference patterns with variable length, and (3) Phoneme
recognition with adaptability function to speaking rate of
input speech. In this report, we describe outline of the first
research.
Many methods have been proposed to construct context-dependent phoneme
models using HMMs to get better performance. These conventional
methods require the previously given contextual factors. If the given
factors are not sufficient, the constructed models do not give good
recognition performance. We propose a new construction algorithm of
HMnet without giving contextual factors. The new algorithm is very
similar to the Successive State Splitting (SSS) algorithm (called
SSS-original) proposed by J.Takami and S.Sagayama. The difference of
the new algorithm (called SSS-free) from the SSS-original is the way
of splitting the state on the contextual domain. In the SSS-original,
the distribution of training samples accepted by a state to one of the
two new states is done by splitting the phoneme-context classes
belonging to the contextual factor which accomplishes the maximum
likelihood. However, if the given factors are not sufficient, there
often occurs that all of the training samples are assigned to one of
the two new states. Therefore, we can not continue the state
splitting. On the other hand, in the SSS-free, each of the training
samples accepted by a state is distributed to one of the two new
states which gives higher likelihood. The SSS-free does not have the
previously-described problems because we do not need to give
contextual factors for the splitting. Therefore, we can construct a
HMnet in any case. Training samples passing through a single path in
the HMnet made by the SSS-free have similar acoustical
characteristics.
One of advantages of the SSS-original is that we can obtain a HMnet
which expresses not only phoneme-context classes appeared in training
samples but also other phoneme-context classes which are not contained
in training samples. In the SSS-free, the HMnet can not express
phoneme-context classes which do not appear in training samples. This
is one of disadvantages of the SSS-free, but the number of
phoneme-context classes which do not appear in training samples is
very few because we can use huge number of training samples. So, we
can conclude that this disadvantage has little influence on recognition
performance.
The HMnet made by the SSS-free is not equivalent to context-dependent
models because the SSS-free does not explicitly consider contextual
factors. However, if the phoneme-context class of the test sample is
given in the recognition stage, this HMnet can be regarded as
context-dependent models by the restriction of path using ``context
table''. Each row in the context table is composed of a path-name in
the HMnet, and a list of phoneme-context classes corresponding to the
path-name. These lists are defined in the final retraining; a
phoneme-context of each sample is put in the list corresponding to
the path through which the sample passed. In the recognition stage,
we calculate a likelihood after the restriction of path using context
table. However, one phoneme-context class is not necessarily limited
to a single path. The likelihood of the phoneme is calculated as the
maximum likelihood among the corresponding paths.
In order to show the effectiveness of the SSS-free, we carried out
a speaker-dependent phoneme recognition for 24 Japanese phonemes
in the ATR Japanese sentence speech uttered by two male (MHO and MTK)
and two female (FKN and FTK) speakers. Training data is 400 sentences
in the ATR Japanese sentence speech Database, and test data is
other 103 sentences. The maximum number of states per phoneme
was set to 30 and the total number of states was set o 500.
The recognition scores of the SSS-free for the four speakers (MHO,
MTK, FKN, and FTK) were 87.8%, 95.9%, 93.9% and 94.8%, respectively.
The recognition scores of the SSS-original for the four speakers (MHO,
MTK, FKN, and FTK) were 85.7%, 94.4%, 91.3% and 93.9%, respectively.
Average recognition score of the SSS-free was 93.1% and that of
the SSS-original was 91.3%. The average recognition score increased
by 1.8%. We found that, in the SSS-original, there were several
phonemes that the state splitting of the phonemes could not be done
before the number of states reached the pre-defined number.