Speech Synthesis Method for Spoken Dialogue System and Psychological Assessment of Synthetic Speech
Hideki KASUYA, Chang-Sheng YANG, Wen DING, Jin-Lin Lu,
Yoshihisa WATANABE, and Takamitsu MATSUSHITA
Department of Electrical and Electronic Engineering,
Utsunomiya University
2753 Ishii-machi, Utsunomiya 321, JAPAN
e-mail: kasuya@ustunomiya-u.ac.jp
In order to encourage comfortable spoken dialogue between man and
machine, synthetic speech generated by a voice response system
must sound natural and should have variable voice qualities and prosodies
depending on the situation of dialogue. A formant-type speech synthesizer
has a significant advantage over an LPC- or cepstrum-based synthesis method
in that it inherently possesses an ability of manipulating voice qualities
associated with voice source properties independently of the ones related
to the vocal tract transfer characteristics. A waveform concatenative
method as PSOLA has recently shown to be able to generate natural sounding
synthetic speech as long as a sufficiently large number of waveform segments
are prepared to cope with considerable variations of the speech waveforms
occurring in ordinary spoken dialogue. However, in order to make it possible
to change voice qualities of synthetic speech with this method, an immense
number of waveform entries are needed, leading to an unrealistic situation.
We believe that a synthesizer based on the source-filter model is only a
realistic solution to the flexible control of voice-quality and prosodic
variations of synthetic speech.
The formant-type synthesizer that realizes the source-filter model,
however, has long been suffering from generating unnatural sounding speech
quality, primarily due to incomplete strategies employed in controlling
formant and source parameters of a synthesizer. In order to overcome this
flaw, while preserving its advantage, we have proposed a concatenative
formant-type synthesizer based on formant/antiformant parameter templates
of VCV (vowel-consonant-vowel) segments.
In this method, a stable algorithm is essential to extract
formant/antiformant and voice source parameters simultaneously from
natural speech of a target speaker. We have devised a novel adaptive
pitch-synchronous analysis method to estimate simultaneously vocal
tract (formant/antiformant) and voiced source parameters. We use the
parametric Rosenberg-Klatt model to generate a glottal waveform and an
autoregressive-exogenous (ARX) model to represent a voiced speech
production process. The Kalman filter algorithm is used to estimate
the formant/antiformant parameters from the coefficients of the ARX
model, and the simulated annealing method is employed as a nonlinear
optimization approach to estimate the voice source parameters. The
two approaches work together in a system identification procedure to
find the best set of the parameters of both the models. The new
method has been compared using synthetic speech with some other
approaches which have so far been proposed, in terms of accuracy of
estimated parameter values and has proved to be superior. We also
have shown that the proposed method can estimate accurately the
parameters from natural speech sounds.
Preliminary experiments have been performed to synthesize a few
Japanese sentences consisting of various voiced consonants and vowels
including three types of nasal consonants. All the VCV segments
appearing in the sentences were pronounced in the carrier sentence,
"sorewa VCV desu" (that is VCV). VCV sound segments of an appropriate
duration were edited out of the sentence utterances and were subjected to
the analysis of the formant/antiformant and voice source parameters by the
proposed method. The parameter sequences of all the VCV segments were stored
as the templates. VCV parameter templates were then concatenated according
to an input text and edited on the basis of the isochronism principle of
Japanese moraic durations. F0 contours measured from the sentence utterances
were directly used to control F0 pattern of the synthetic speech. Informal
perceptual evaluation test of the synthetic speech has shown quite natural.