A Language Analysis Method for Spoken Dialogue Understanding
-- Discourse Segmentation Based on Lexical Cohesion --
Manabu OKUMURA
School of Information Science,
Japan Advanced Institute of Science and Technology
Tatsunokuchi, Ishikawa 923-12, Japan
e-mail: oku@jaist.ac.jp
A text is not a mere set of unrelated sentences. Rather, sentences in
a text are about the same thing and connected to each other. Lexical
cohesion is said to contribute to such connection of the sentences. We
call a sequence of words which are in lexical cohesion relation with
each other a lexical chain. Lexical chains tend to indicate portions
of a text that form a semantic unit. Therefore, lexical chains provide
a clue for the determination of segment boundaries of the text. We
think it is crucial to identify the segment boundaries as a first step
to construct the structure of a text.
Here we use a Japanese thesaurus `Bunrui-goihyo and we count a
sequence of words which are included in the same category as a lexical
chain.
In this paper, we describe how segment boundaries of a text can be
determined with the aid of lexical cohesion.
When a portion of a text forms a semantic unit, there is a tendency
for related words to be used. Therefore, if lexical chains can be
found, they will tend to indicate the segment boundaries of the text.
When a lexical chain ends, there is a tendency for a segment to end.
If a new chain begins, this might be an indication that a new segment
has begun.
Taking into account this correspondence of lexical chain boundaries to
segment boundaries, we measure the plausibility of each point in the
text as a segment boundary: for each point between sentences $n$ and
$n+1$ (where $n$ ranges from 1 to the number of sentences in the text
minus 1), compute the sum of the number of lexical chains that end at
the sentence $n$ and the number of lexical chains that begin at the
sentence $n+1$. We call this naive measure of a degree of agreement of
the start and end points of lexical chains $w(n,n+1)$ {boundary
strength}. The points in the text are
selected in the order of boundary strength as candidates of segment
boundaries.
Furthermore, since lexical chains have gaps inside themselves, we
refine the measure by taking into account their gaps.
The start and end points of the gaps also can get the boundary
strength.
To evaluate our system, we pick out 22 texts, which are from the
questions of the Japanese language that ask us to partition the texts
into a given number of segments. The system's performance is judged by
the comparison with segment boundaries marked as an attached model
answer.
Since human subjects do not always agree with each other on
segmentation, our
evaluation method using the texts in the questions with model answers
is considered to be a good simplification.
The system's performance is evaluated in two ways. One is in case
where the system generates the given number of segment
boundaries in the order of the strength. From this evaluation, we can
compute the system's marks as an examinee in the test that consists of
the questions. The other is in case where segment boundaries are
generated down to half of the maximum strength. In this case we use
the following metrics for the evaluation: {Recall} is the quotient
of the number of correctly identified boundaries by the total number
of correct boundaries. {Precision} is the quotient of the number
of correctly identified boundaries by the number of generated
boundaries.
Here we do not take into account the information of paragraph
boundaries, such as the indentation, at all in the following reasons:
Keywords: lexical chain, contextual analysis, discourse structure