Advanced Research for Modeling of Spontaneous Spoken Dialogue Understanding System Using Supra Segmental Features

Akira ICHIKAWA, Atsushi IMIYA,
Ken-ji YAGI, Shinnji SATOU and Naoya WATANABE

Department of Information and Computer Sciences, Chiba University
1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba 263, Japan
e-mail: ichikawa@ics.tj.chiba-u.ac.jp

In this research, it is claimed that natural dialogue language has important characteristics which differ from written languages to comprehend its utterances easily under the real time communication conditions.
In the situation of the dialogue, roles of the prosody seem to be essential. The first one is to indicate the semantic structure of the utterances to be understood immediately. The second is as real- time control information for turn-taking. The third is as the information of speaker's psychological situations.
From the point of view above mentioned, 128 spontaneous conversations (total length is about 24 hours) were collected from 64 subjects through a map task. The design as well as materials and procedures followed that of The HCRC Map Task Corpus, but the feature names were carefully scrutinized so they possibly represent interesting phonological modifications in Japanese. The recording system was enhanced. Giver and Follower's speech were recorded from independent microphones and separately stored on a DAT. Also, subject's maps and faces were video-recorded so that they show a Giver's hand movement on a map and a Follower's drawing a root and their eye- contact situations.
Then from the corpus, some characteristics of the prosody and cushion words were analyzed. As the control function of turn-taking, three points will be reported; (1)timing of nod-back or response, (2) discrimination cue between interruption or nod-back, and (3) discrimination cue between the end of utterances or stagnation of thinking. Also we find three types of cushion words; (1)indication of interrogation, (2) request of affirmation, and (3) indication of self affirmation.
These results will be useful to develop the spontaneous spoken dialogue human-machine interface system which should not obstruct the user's consideration for his task. So, we propose a new concept of a real-time spontaneous dialogue understanding system model using prosodical information.
The system consists as a multi-agent system. Each agent processes its own job in parallel; ex., one agent extracts sentence structure from the prosody under the real-time condition, one recognizes phonemes, one predicts next utterances, one analyzes an intention of the utterance, etc.
As for the example of the agent, the sentence-structure-estimate agent extracts pitch frequencies and approximates a pitch pattern as a piece-wise linear lines using a newly proposed algorithm which is based on Randomized Hough Transform for line detection real-timely and constructs the sentence structure from the approximated patterns using also a newly proposed modified multi-resolution algorithm.
To confirm the importance of the supra segmental features in natural dialogue languages, we made a plan to compare the spoken language with the sign language as the natural dialogue languages.
It is shown that the prosody of the sign language has important information of its sentence structure through a sentence recognition experiment using locus of the movement of hands under some specific task conditions.

Keywords: real-time dialogue understanding model, prosody, similarity of sentence structure, corpus, multi-agent system, sign language