Problems in Creating Tagged Orthographical Transcriptions of Spoken Dialogs in Japanese

Syun Tutiya

Department of Philosophy, Chiba University

1-33 Yayoi-cho, Inage-ku, Chiba, 263 Japan

tutiya@cogsci.l.chiba-u.ac.jp

Problems in creating orthographical transcriptions of spoken dialogs in Japanese are discussed. Sociolinguists' and conversation/discourse analysts' methods are well motivated and well meant but lack digitization and portability. The TEI encoding scheme of spoken language is introduced and discussed, along with the annotative tags for the morphological categories. TEI encourages the use of header information as an inline means of cataloging. In the case of speech corpus, the information internal to it, like the speakers and situations, are easily and consistently expressible the header part of the document. Dialogs are full of interruptions. Turn takings do not take place as described in drama scripts. Sociolinguistic transcriptions of the phenomena are readable by humans but it can not be encoded in any straightforward manner in to electronic form. We propose to accept the TEI style encoding which, by utilizing empty tags with pointers, enables us to encode non-linear phenomena like overlapping, interruption and interleaving. The mechanism requires further elaboration in developing and implementing handling tools but that part of elaboration will be made use of in linking locations to digital sound frames. "Pretty Printing" of the fully tagged texts will be available. Japanese orthographical transcription poses an interesting challenge to the corpus encoders. Japanese orthography does not have a builtin mechanism with which to make word boundaries visible. I.e., words, or whatever corresponds to the Western ``words,'', are not separated by spaces. This fact requires the encoder to mark some morphological segmentation in an elegant and manageable way so that search for a "word" may not pick up all the possible strings which contains the "word." Segmentized units are marked with category indicators in a way which is conformat to SGML/TEI. This will facilitate the use of spoken dialog corpora for various different purposes. A detailed account of morphological units and their categories are given in terms of the written corpus being develop by the author.