A Method on Speech Synthesis for Spoken Dialogue Systems and Psychological Assessment of the Synthetic Speech
Keikichi HIROSE, Noboru TAKAHASHI, Nobuaki MINEMATSU,
Toru SENOO, and Mayumi SAKATA
Department of Electronic Engineering, Faculty of Engineering, University of Tokyo
7-3-1 Hongo, Bunkyo-ku, 113 Tokyo, JAPAN
e-mail: hirose@gavo.t.u-tokyo.ac.jp
The current research project has been organized to
develop a technology for generating response speech in
advanced spoken dialogue systems. In order to enable
smooth communications between man and machine, the
response speech should not only be with high-quality but
also be easily understood by the users. In order to satisfy
these requirements, the following three items were selected
as major topics of the research:
1. Generate sentences from response contents of deep-
level semantic representation, which may include
information on focal position, ellipsis and anaphora.
Generated sentences should include high-level linguistic
information, such as syntactic and discourse structures,
and information on intentions to be transmitted.
2. Synthesize response speech with prosodic features,
naturally sounding as dialogue speech. The prosodic
features should well convey the syntactic and discourse
structures, as well as the lexical meaning.
3. Synthesize high-quality speech also from the viewpoint
of segmental features. The conventional terminal-
analogue synthesizer will be improved and be used for
the synthesis.
As for the first item, we have already constructed a preliminary method of generating surface sentences for the
dialogue system of guiding skiing resorts, which has
already been reported last year as one of the results of the
project. This year, we have developed a method of controlling ellipses and focal positions based on the degree of
novelty of the information. As for the second item, we
have newly recorded dialogue speech and have analyzed
its prosodic features. Based on the results, preliminary
prosodic rules were constructed and evaluated by the
speech synthesis. As for the last item, the improvements
are under the way.
If words with information known and useless to the
user are included in the response speech, they not only
elongates the interval for the transmission of the necessary
information, but also occasionally obscure its location in
the sentence. Therefore, in order to make the spoken
dialogue system usable for users, the response sentences
should include appropriate elliptic and anaphoric expressions. Conversely, excessive use of these expressions may
cause misunderstanding between the system and the user.
Information known to the user should sometimes be included as the confirmation in the response sentences. In
this case, the user can easily extract the necessary information
from the response speech if a prosodic focus is placed
on the key words. Although the precise control of these
expressions requires various kinds of knowledge bases,
such as the user's knowledge, a preliminary control
method was constructed for the current study.
By restricting the dialogue to that of questions and
answers, and by forbidding the sudden jump in the dialogue
topics, the dialogue flow can be represented by pairs
of answers and questions called "fundamental routines of
dialogue (FRD)." Case elements in user's questions are
stored sequentially in the corresponding stacks with
information on the number of FRD, which increases as the
dialogue proceeds. For each case element of the semantic
representation, the number of FRD attached to the element
is compared with the number of latest FRD in the dialogue.
If the difference in the numbers exceeds 2, flag "0"
is generated, and, otherwise, flag "-2" is generated. For
case element corresponding to the key information of the
answer, flag "+1" is assigned. These flags are utilized for
the control of elliptical expressions and prosodic emphases
as follows:
If the speech recognition / understanding process is not complete,
there may be the cases where the system is not sure on the contents of
the case elements. Even if such case elements have flag "-2," they
should be included in the surface sentence to avoid the
misunderstanding between the system and the user. Flag "-1" will be
assigned for these elements. The above method has been tested for the
spoken dialogue system to support patrolmen of electric power
facilities, where information on weather condition is supplied to the
user upon request.
Prosodic rules have already been constructed for the rule
synthesis of reading speech. However, because of rather different
prosodic features of dialogue speech, these rules cannot be used for
the synthesis of response speech as they are. In order to construct
prosodic rules for dia logue speech, comparative study was conducted
on the prosodic features of dialogue speech and those of reading
speech. Thirteen male and five female speakers of the common
Japanese, with trainings as actors and actress, simulated a
spontaneous dialogue by referring to the written text of questions and
answers on ski resorts. They also uttered the individual sentences of
the text in a normal reading style. These utterances were recorded
and digitized to serve as the material for the analysis of prosodic
features. The utterances of simulated dialogue shall henceforth be
referred to as the dialogue-style samples and those of text reading as
the reading-style samples. For these speech samples, analyses were
conducted on F0 contours, speech rates, and s egemental waveform
powers. The followings are the some of results obtained:
1. For dialogue-style samples, larger values and wider
dynamic ranges were observed in the fundamental
frequencies and in the speech rates. As for the fundamental
frequencies, these respectively correspond to the
larger baseline values and the larger phrase and accent
commands of F0 contours.
2. In the case of dialogue-style samples, larger accent
commands were observed for words conveying the key
information. No change in the speech rate was ob-
served for these words.
3. In both of reading-style and dialogue-style samples,
the F0 contour rises at the sentence-final were usually
observed for sentences with interrogation. Although
these can be represented as the accent components, the
onset timings sometimes delay in the case of dialogue-
style samples.
4. For the dialogue-style samples, each utterance starts
with rather low speech rate. Then, the rate increases at
the utterance center, and decreases to a rather large
extent at the utterance end. The variation in the speech
rate was small for the reading-style samples.
Based on these results, prosodic rules for the text-to-speech
conversion were modified to those for the synthesis of response speech
in spoken dialogue systems.
Keywords: spoken dialogue system, response speech, sentence generation, dialogue speech, speech synthesis, fundamental frequency contours, prosodic emphasis