Marking Up the Dialogs with $<$utterance$>$ Tags:
The Unit of Utterance in the Technical Sense
Syun TUTIYA
Department of Philosophy, Chiba University
1-33 Yayoi-cho, Inage-ku, Chiba, 263 Japan
tutiya@cogsci.l.chiba-u.ac.jp
In the process of developing corpora of spoken dialogs is always
involved the task of tagging the text as a sequence of utterances.
Interestingly enough, an utterance is not necessarily a sentence in
the grammatical sense, nor a sentence an utterance. Besides, the
sentences might not be completed for various obvious reasons.
Responses from the interlocuter might not be verbal but could be
kinetic. Most typically, one interlocuter's utterance might interrupt
the other's utterance, giving the observer the impression that the
utterances do not follow each other but overlap. All these casual
observations lead us to a serious consideration of the notion of
``utterance'' and the way to tag utterances. We still need to
compromise about the notion of ``utterance'' in the technical sense.
By definition, a dialog is a sequence of utterances. In terms of
SGML, the mark-up language we have decided to adopt as the basis of
our tags of the transcription of dialogs, the content model of the
element $<$text type=dialog$>$ is one or more of the ordered sequence
of the elements $<$u$>$. Technically put, the problem is where to put
the start tag $<$u$>$ and the end tag $<$/u$>$ for each utterance.
Take for example the difficult case of semi-interruption. In the
dialogs in Japanese, speakers, more often than observed in the dialogs
in English, tend to utter interjective phrases or non- lexical human
voices as signs of assent. Assuming we are equipped with an SGML
mechanism of handling overlapping phenomena as in the line of TEI, we
still have problems deciding on the status of such assenting sounds
and the continuity of the utterance to which such assents are
addressed.
The literature from the preceding attempts to mark up dialogs
mainly in English show two distinctive policies in handling the
unity of the utterance. In the tradition of conversation analysis
and discourse understanding, where the notion of ``turn taking''
plays an important role, the utterance is more or less synonymous
with the turn in their sense. An utterance would continue as long
as the other interlocuter take up the right of utterance. In
their analysis, the kinds of ascending voices would not mark the
end of the interlocuter's utterance. On the other hand, in the
tradition of cognitive science and discourse analysis, it is more
customary to ``chop'' the dialogs in smaller pieces, searching for
a unit which is just a little longer than a linguistic phrase.
We have experimentally tagged sample dialogs from the recording of
map task dialogs in Japanese according to the two different
policies and collected statistical data. After analysis, we are
inclined to decide that the dialog transcriptions tagged with the
notion of utterance in the second tradition would provide more
reliable basis for further research in both speech
recognition/generation and discourse understanding.