A Language Analysis Method for Spoken Dialogue Understanding
-- Discourse Segmentation Based on Lexical Cohesion --

Manabu OKUMURA

School of Information Science, Japan Advanced Institute of Science and Technology
Tatsunokuchi, Ishikawa 923-12, Japan
e-mail: oku@jaist.ac.jp

A text is not a mere set of unrelated sentences. Rather, sentences in a text are about the same thing and connected to each other. Lexical cohesion is said to contribute to such connection of the sentences. We call a sequence of words which are in lexical cohesion relation with each other a lexical chain. Lexical chains tend to indicate portions of a text that form a semantic unit. Therefore, lexical chains provide a clue for the determination of segment boundaries of the text. We think it is crucial to identify the segment boundaries as a first step to construct the structure of a text. Here we use a Japanese thesaurus `Bunrui-goihyo and we count a sequence of words which are included in the same category as a lexical chain.
In this paper, we describe how segment boundaries of a text can be determined with the aid of lexical cohesion. When a portion of a text forms a semantic unit, there is a tendency for related words to be used. Therefore, if lexical chains can be found, they will tend to indicate the segment boundaries of the text. When a lexical chain ends, there is a tendency for a segment to end. If a new chain begins, this might be an indication that a new segment has begun. Taking into account this correspondence of lexical chain boundaries to segment boundaries, we measure the plausibility of each point in the text as a segment boundary: for each point between sentences $n$ and $n+1$ (where $n$ ranges from 1 to the number of sentences in the text minus 1), compute the sum of the number of lexical chains that end at the sentence $n$ and the number of lexical chains that begin at the sentence $n+1$. We call this naive measure of a degree of agreement of the start and end points of lexical chains $w(n,n+1)$ {boundary strength}. The points in the text are selected in the order of boundary strength as candidates of segment boundaries. Furthermore, since lexical chains have gaps inside themselves, we refine the measure by taking into account their gaps. The start and end points of the gaps also can get the boundary strength.
To evaluate our system, we pick out 22 texts, which are from the questions of the Japanese language that ask us to partition the texts into a given number of segments. The system's performance is judged by the comparison with segment boundaries marked as an attached model answer. Since human subjects do not always agree with each other on segmentation, our evaluation method using the texts in the questions with model answers is considered to be a good simplification. The system's performance is evaluated in two ways. One is in case where the system generates the given number of segment boundaries in the order of the strength. From this evaluation, we can compute the system's marks as an examinee in the test that consists of the questions. The other is in case where segment boundaries are generated down to half of the maximum strength. In this case we use the following metrics for the evaluation: {Recall} is the quotient of the number of correctly identified boundaries by the total number of correct boundaries. {Precision} is the quotient of the number of correctly identified boundaries by the number of generated boundaries.
Here we do not take into account the information of paragraph boundaries, such as the indentation, at all in the following reasons:

Because our texts are from the exam questions, many of them have no marks of paragraph boundaries;

In case of Japanese, it is pointed out that paragraph and segment boundaries do not always coincide with each other.

Keywords: lexical chain, contextual analysis, discourse structure