Speech Synthesis Method for Spoken Dialogue System and Psychological Assessment of Synthetic Speech

Hideki KASUYA, Chang-Sheng YANG, Wen DING, Jin-Lin Lu,
Yoshihisa WATANABE, and Takamitsu MATSUSHITA

Department of Electrical and Electronic Engineering, Utsunomiya University
2753 Ishii-machi, Utsunomiya 321, JAPAN
e-mail: kasuya@ustunomiya-u.ac.jp

In order to encourage comfortable spoken dialogue between man and machine, synthetic speech generated by a voice response system must sound natural and should have variable voice qualities and prosodies depending on the situation of dialogue. A formant-type speech synthesizer has a significant advantage over an LPC- or cepstrum-based synthesis method in that it inherently possesses an ability of manipulating voice qualities associated with voice source properties independently of the ones related to the vocal tract transfer characteristics. A waveform concatenative method as PSOLA has recently shown to be able to generate natural sounding synthetic speech as long as a sufficiently large number of waveform segments are prepared to cope with considerable variations of the speech waveforms occurring in ordinary spoken dialogue. However, in order to make it possible to change voice qualities of synthetic speech with this method, an immense number of waveform entries are needed, leading to an unrealistic situation. We believe that a synthesizer based on the source-filter model is only a realistic solution to the flexible control of voice-quality and prosodic variations of synthetic speech.
The formant-type synthesizer that realizes the source-filter model, however, has long been suffering from generating unnatural sounding speech quality, primarily due to incomplete strategies employed in controlling formant and source parameters of a synthesizer. In order to overcome this flaw, while preserving its advantage, we have proposed a concatenative formant-type synthesizer based on formant/antiformant parameter templates of VCV (vowel-consonant-vowel) segments.
In this method, a stable algorithm is essential to extract formant/antiformant and voice source parameters simultaneously from natural speech of a target speaker. We have devised a novel adaptive pitch-synchronous analysis method to estimate simultaneously vocal tract (formant/antiformant) and voiced source parameters. We use the parametric Rosenberg-Klatt model to generate a glottal waveform and an autoregressive-exogenous (ARX) model to represent a voiced speech production process. The Kalman filter algorithm is used to estimate the formant/antiformant parameters from the coefficients of the ARX model, and the simulated annealing method is employed as a nonlinear optimization approach to estimate the voice source parameters. The two approaches work together in a system identification procedure to find the best set of the parameters of both the models. The new method has been compared using synthetic speech with some other approaches which have so far been proposed, in terms of accuracy of estimated parameter values and has proved to be superior. We also have shown that the proposed method can estimate accurately the parameters from natural speech sounds.
Preliminary experiments have been performed to synthesize a few Japanese sentences consisting of various voiced consonants and vowels including three types of nasal consonants. All the VCV segments appearing in the sentences were pronounced in the carrier sentence, "sorewa VCV desu" (that is VCV). VCV sound segments of an appropriate duration were edited out of the sentence utterances and were subjected to the analysis of the formant/antiformant and voice source parameters by the proposed method. The parameter sequences of all the VCV segments were stored as the templates. VCV parameter templates were then concatenated according to an input text and edited on the basis of the isochronism principle of Japanese moraic durations. F0 contours measured from the sentence utterances were directly used to control F0 pattern of the synthetic speech. Informal perceptual evaluation test of the synthetic speech has shown quite natural.