A Method on Speech Synthesis for Spoken Dialogue Systems and Psychological Assessment of the Synthetic Speech

Keikichi HIROSE, Noboru TAKAHASHI, Nobuaki MINEMATSU,
Toru SENOO, and Mayumi SAKATA

Department of Electronic Engineering, Faculty of Engineering, University of Tokyo
7-3-1 Hongo, Bunkyo-ku, 113 Tokyo, JAPAN
e-mail: hirose@gavo.t.u-tokyo.ac.jp

The current research project has been organized to develop a technology for generating response speech in advanced spoken dialogue systems. In order to enable smooth communications between man and machine, the response speech should not only be with high-quality but also be easily understood by the users. In order to satisfy these requirements, the following three items were selected as major topics of the research:

1. Generate sentences from response contents of deep- level semantic representation, which may include information on focal position, ellipsis and anaphora. Generated sentences should include high-level linguistic information, such as syntactic and discourse structures, and information on intentions to be transmitted.

2. Synthesize response speech with prosodic features, naturally sounding as dialogue speech. The prosodic features should well convey the syntactic and discourse structures, as well as the lexical meaning.

3. Synthesize high-quality speech also from the viewpoint of segmental features. The conventional terminal- analogue synthesizer will be improved and be used for the synthesis.

As for the first item, we have already constructed a preliminary method of generating surface sentences for the dialogue system of guiding skiing resorts, which has already been reported last year as one of the results of the project. This year, we have developed a method of controlling ellipses and focal positions based on the degree of novelty of the information. As for the second item, we have newly recorded dialogue speech and have analyzed its prosodic features. Based on the results, preliminary prosodic rules were constructed and evaluated by the speech synthesis. As for the last item, the improvements are under the way.
If words with information known and useless to the user are included in the response speech, they not only elongates the interval for the transmission of the necessary information, but also occasionally obscure its location in the sentence. Therefore, in order to make the spoken dialogue system usable for users, the response sentences should include appropriate elliptic and anaphoric expressions. Conversely, excessive use of these expressions may cause misunderstanding between the system and the user. Information known to the user should sometimes be included as the confirmation in the response sentences. In this case, the user can easily extract the necessary information from the response speech if a prosodic focus is placed on the key words. Although the precise control of these expressions requires various kinds of knowledge bases, such as the user's knowledge, a preliminary control method was constructed for the current study.
By restricting the dialogue to that of questions and answers, and by forbidding the sudden jump in the dialogue topics, the dialogue flow can be represented by pairs of answers and questions called "fundamental routines of dialogue (FRD)." Case elements in user's questions are stored sequentially in the corresponding stacks with information on the number of FRD, which increases as the dialogue proceeds. For each case element of the semantic representation, the number of FRD attached to the element is compared with the number of latest FRD in the dialogue. If the difference in the numbers exceeds 2, flag "0" is generated, and, otherwise, flag "-2" is generated. For case element corresponding to the key information of the answer, flag "+1" is assigned. These flags are utilized for the control of elliptical expressions and prosodic emphases as follows:

(Flag +1) Include the case element in the surface sentence with a prosodic emphasis.

(Flag 0) Include the case element in the surface sentence without prosodic emphasis.

(Flag -1) Include the case element in the surface sentence with a prosodic de-emphasis.

(Flag -2) Exclude the case element from the surface sentence.

If the speech recognition / understanding process is not complete, there may be the cases where the system is not sure on the contents of the case elements. Even if such case elements have flag "-2," they should be included in the surface sentence to avoid the misunderstanding between the system and the user. Flag "-1" will be assigned for these elements. The above method has been tested for the spoken dialogue system to support patrolmen of electric power facilities, where information on weather condition is supplied to the user upon request.
Prosodic rules have already been constructed for the rule synthesis of reading speech. However, because of rather different prosodic features of dialogue speech, these rules cannot be used for the synthesis of response speech as they are. In order to construct prosodic rules for dia logue speech, comparative study was conducted on the prosodic features of dialogue speech and those of reading speech. Thirteen male and five female speakers of the common Japanese, with trainings as actors and actress, simulated a spontaneous dialogue by referring to the written text of questions and answers on ski resorts. They also uttered the individual sentences of the text in a normal reading style. These utterances were recorded and digitized to serve as the material for the analysis of prosodic features. The utterances of simulated dialogue shall henceforth be referred to as the dialogue-style samples and those of text reading as the reading-style samples. For these speech samples, analyses were conducted on F0 contours, speech rates, and s egemental waveform powers. The followings are the some of results obtained:
1. For dialogue-style samples, larger values and wider dynamic ranges were observed in the fundamental frequencies and in the speech rates. As for the fundamental frequencies, these respectively correspond to the larger baseline values and the larger phrase and accent commands of F0 contours.
2. In the case of dialogue-style samples, larger accent commands were observed for words conveying the key information. No change in the speech rate was ob- served for these words.
3. In both of reading-style and dialogue-style samples, the F0 contour rises at the sentence-final were usually observed for sentences with interrogation. Although these can be represented as the accent components, the onset timings sometimes delay in the case of dialogue- style samples.
4. For the dialogue-style samples, each utterance starts with rather low speech rate. Then, the rate increases at the utterance center, and decreases to a rather large extent at the utterance end. The variation in the speech rate was small for the reading-style samples.
Based on these results, prosodic rules for the text-to-speech conversion were modified to those for the synthesis of response speech in spoken dialogue systems.

Keywords: spoken dialogue system, response speech, sentence generation, dialogue speech, speech synthesis, fundamental frequency contours, prosodic emphasis