Thesaurus Construction and Analysis Method for Dialogue Understanding
--Thesaurus Construction for Dialogue Understanding--

Hiroaki TSURUMARU, Hideyuki MAEDA, and AKihiro KAWASHIMA

Department of Electrical Engineering and Computer Science, Nagasaki University
1-14 Bunkyou-machi, Nagasaki 852, JAPAN
e-mail: turumaru@ec.nagasaki-u.ac.jp

Many ellipses and demonstrative pronouns occur in dialogue. Generally speaking, the omitted words (or phrases) and the pronominal references are complemented by the use of common sense and discourse information. Here it becomes a serious problem for the dialogue understanding that the definition of common sense is not clear. The thesaurus consisting of the semantic (hierarchical) relations such as upper/lower relation or part/whole relation between words is regarded as an approximate model of the common sense. There are some thesauri such as ``Bunrui-Goi-Hyo (Word List by Semantic Principles)'' and ``Roget's Thesaurus''. However, they are not always sufficient for natural language processing, because they are mainly for the use of human beings.
This study aims to clarify the method for constructing a thesaurus based on hierarchical relations such as upper/lower relation and part/whole relation between the concepts of words, and to approach to the problems of the application of the thesaurus to dialogue understanding. Here we regard one of the senses of a word as one concept. Now, how and from what to obtain these hierarchical relations is one of the most important problems for constructing the thesaurus.
We have been studying how to acquire these hierarchical relations from the definition sentences in the on-line Japanese dictionary, and developing a programming system for computer-aided thesaurus construction. The contents of the current year's studies are mainly as follows: First, we review the algorithm for extracting the hierarchical relations. Second, we discuss the evaluation of the trial thesaurus which has been made on an experimental basis through the results of these works. And third, we also discuss the application of the thesaurus to presumption of the elliptical words in dialogue. Now we describe those three topics more specifically.

(1) Concerning the extracting algorithm, we have reviewed it from a theoretical viewpoint. The basic idea of the extraction of the hierarchical relations is as follows; generally the definition sentence contains the core word(s) expressing the central meaning of the word sense, which we call the definition word(s). Then we extract the definition word(s) and the relational information, and decide the semantical relation between the entry word and the definition word. Here the semantical relations include, as well as upper/lower relation and part/whole relation, synonymous relation and element/set relation. We also regard the latter two relations as hierarchical relation in a wide sense.

(2) Concerning the evaluation of the trial thesaurus, we have researched on the followings;

[a] the number of superordinates (upper words) and the number of hyponyms (lower words) contained in each superordinate

[b] the definition of the depth, the number of passes in each depth and the average depth on the conceptual hierarchy

[c] the utilization of the by-pass to infer the valid pass in the multi-passes

[d] the assignment of the superordinates to the local maximal words

[e] the logical extension of the words-pairs satisfying part/whole relation with the aid of upper/lower relation

(3) Concerning the application of the thesaurus, we have studied algorithm for the inference of the omitted words or phrases in dialogue using the thesaurus and IPAL basic verbs dictionary in order to verify the validity of the trial thesaurus. The outline of the algorithm is as follows;

morphological analysis of input sentence,

estimation of the deep structure (case structure),

recognition of the existence of the omitted words or phrases,

presumption of the omitted words or phrases.

This algorithm can be used to handle pronominal and anaphoric reference. We have collected the dialogue data for experiments from the texts of NHK Sequel Basic English(1991).

Keywords : thesaurus, semantic dictionary, word knowledge, conceptual hierarchy