Integrated Processing of Linguistic and Speech Information

Tanaka Hozumi

Department of Computer Science, Tokyo Institute of Technology

e-mail: tanaka@cs.titech.ac.jp

Morphological analysis of Japanese is very different from that of English, because no spaces are placed between words. This is also the case in many Asian languages such as Korean, Chinese, Thai and so forth. In the Indo-European family, some languages such as German have the same phenomena in forming complex noun phrases. Processing such languages requires the identification of the boundaries of words in the first place. This process is often called segmentation which is one of the most important tasks of morphological analysis for these languages. Segmentation is a very important process, since the wrong segmentation causes fatal errors in the later stages such as syntactic, semantic and contextual analysis. However, correct segmentation is not always possible only with morphological information. Syntactic, semantic and contextual information may help resolve the ambiguities in segmentation. Over the past few decades a number of studies have been made on the morphological and syntactic analysis of Japanese. They can be classified into the following three approaches: Cascade, Interleave and Single Framework approaches. Representing the morphological and syntactical constraints separately as in the first two approaches, Cascade and Interleave, makes maintaining and extending the constraints easier. This is an advantage of these approaches. Many natural language processing systems have used these two approaches. However in these approaches, two different algorithms for each analysis are required and all ambiguities from the morphological analysis are retained until the syntactic analysis begins. These are drawbacks of these approach. On the other hand, from a viewpoint of processing, it is preferable to integrate the morphological and syntactic analysis into a single framework, since some syntactic constraints are useful for morphological analysis and vice versa. The last approach fulfills this requirement. There have been several attempts to develop CFG that covers both the morphological and syntactic constraints. However, it is empirically difficult to describe both constraints by using only CFG. Therefore, it is desirable to represent the morphological and syntactic constraints separately as in Cascade/Interleave, and to integrate the execution of both analysis into a single process as in Single Framework. In our method, we have captured these advantages by representing the morphological constraints in connection matrices and the syntactic constraints in CFGs, then compiling both constraints into an LR table The already existing, efficient LR parsing algorithms can be used with minor modifications, enabling us to utilize both the morphological and syntactic constraints at the same time. This approach is also promising to incorporate phonetic level constraints into the system.Keywords: morphological analysis, syntactic analysis, Japanese analysis, generalized LR parsing