Marking Up the Dialogs with $<$utterance$>$ Tags:
The Unit of Utterance in the Technical Sense

Syun TUTIYA

Department of Philosophy, Chiba University
1-33 Yayoi-cho, Inage-ku, Chiba, 263 Japan
tutiya@cogsci.l.chiba-u.ac.jp

In the process of developing corpora of spoken dialogs is always involved the task of tagging the text as a sequence of utterances. Interestingly enough, an utterance is not necessarily a sentence in the grammatical sense, nor a sentence an utterance. Besides, the sentences might not be completed for various obvious reasons. Responses from the interlocuter might not be verbal but could be kinetic. Most typically, one interlocuter's utterance might interrupt the other's utterance, giving the observer the impression that the utterances do not follow each other but overlap. All these casual observations lead us to a serious consideration of the notion of ``utterance'' and the way to tag utterances. We still need to compromise about the notion of ``utterance'' in the technical sense.
By definition, a dialog is a sequence of utterances. In terms of SGML, the mark-up language we have decided to adopt as the basis of our tags of the transcription of dialogs, the content model of the element $<$text type=dialog$>$ is one or more of the ordered sequence of the elements $<$u$>$. Technically put, the problem is where to put the start tag $<$u$>$ and the end tag $<$/u$>$ for each utterance. Take for example the difficult case of semi-interruption. In the dialogs in Japanese, speakers, more often than observed in the dialogs in English, tend to utter interjective phrases or non- lexical human voices as signs of assent. Assuming we are equipped with an SGML mechanism of handling overlapping phenomena as in the line of TEI, we still have problems deciding on the status of such assenting sounds and the continuity of the utterance to which such assents are addressed.
The literature from the preceding attempts to mark up dialogs mainly in English show two distinctive policies in handling the unity of the utterance. In the tradition of conversation analysis and discourse understanding, where the notion of ``turn taking'' plays an important role, the utterance is more or less synonymous with the turn in their sense. An utterance would continue as long as the other interlocuter take up the right of utterance. In their analysis, the kinds of ascending voices would not mark the end of the interlocuter's utterance. On the other hand, in the tradition of cognitive science and discourse analysis, it is more customary to ``chop'' the dialogs in smaller pieces, searching for a unit which is just a little longer than a linguistic phrase.
We have experimentally tagged sample dialogs from the recording of map task dialogs in Japanese according to the two different policies and collected statistical data. After analysis, we are inclined to decide that the dialog transcriptions tagged with the notion of utterance in the second tradition would provide more reliable basis for further research in both speech recognition/generation and discourse understanding.