Advanced Research for Modeling of Spontaneous Spoken Dialogue Understanding System Using Supra Segmental Features
Akira ICHIKAWA, Atsushi IMIYA,
Ken-ji YAGI, Shinnji SATOU and Naoya WATANABE
Department of Information and Computer Sciences, Chiba University
1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba 263, Japan
e-mail: ichikawa@ics.tj.chiba-u.ac.jp
In this research, it is claimed that natural dialogue language has
important characteristics which differ from written languages to
comprehend its utterances easily under the real time communication
conditions.
In the situation of the dialogue, roles of the prosody seem to be
essential. The first one is to indicate the semantic structure of the
utterances to be understood immediately. The second is as real- time
control information for turn-taking. The third is as the information
of speaker's psychological situations.
From the point of view above mentioned, 128 spontaneous
conversations (total length is about 24 hours) were collected from 64
subjects through a map task. The design as well as materials and
procedures followed that of The HCRC Map Task Corpus, but the feature
names were carefully scrutinized so they possibly represent
interesting phonological modifications in Japanese. The recording
system was enhanced. Giver and Follower's speech were recorded from
independent microphones and separately stored on a DAT. Also,
subject's maps and faces were video-recorded so that they show a
Giver's hand movement on a map and a Follower's drawing a root and
their eye- contact situations.
Then from the corpus, some characteristics of the prosody and
cushion words were analyzed. As the control function of turn-taking,
three points will be reported; (1)timing of nod-back or response, (2)
discrimination cue between interruption or nod-back, and (3)
discrimination cue between the end of utterances or stagnation of
thinking. Also we find three types of cushion words; (1)indication of
interrogation, (2) request of affirmation, and (3) indication of self
affirmation.
These results will be useful to develop the spontaneous spoken
dialogue human-machine interface system which should not obstruct the
user's consideration for his task. So, we propose a new concept of a
real-time spontaneous dialogue understanding system model using
prosodical information.
The system consists as a multi-agent system. Each agent
processes its own job in parallel; ex., one agent extracts sentence
structure from the prosody under the real-time condition, one
recognizes phonemes, one predicts next utterances, one analyzes an
intention of the utterance, etc.
As for the example of the agent, the sentence-structure-estimate
agent extracts pitch frequencies and approximates a pitch pattern as a
piece-wise linear lines using a newly proposed algorithm which is
based on Randomized Hough Transform for line detection real-timely and
constructs the sentence structure from the approximated patterns using
also a newly proposed modified multi-resolution algorithm.
To confirm the importance of the supra segmental features in
natural dialogue languages, we made a plan to compare the spoken
language with the sign language as the natural dialogue languages.
It is shown that the prosody of the sign language has important
information of its sentence structure through a sentence recognition
experiment using locus of the movement of hands under some specific
task conditions.
Keywords: real-time dialogue understanding model, prosody, similarity of sentence structure, corpus, multi-agent system, sign language