中文机构名称的识别与分析

Xiaoheng ZHANG, Hong Kong Polytechnic University
Lingling WANG, Hong Kong Polytechnic University

Abstract

中文机构名称数目庞大,层出不穷,绝大多数未能收入词典,给自然语言处理带来困扰。但是,从语言学的角度来看,机构名称是一种偏正复合式专有名词,同时又是一类较为简单的偏正名词词组,有自己的结构规律和形态标记。本文以高校名称为重点,以中国内地、香港和台湾三地实际语料为依据,从语言学和计算机技术两方面对机构名称的识别与分析展开讨论,并总结出相应的规则。根据这些规则,对六百多万字的三地语料库作高校名称识别,正确率(指前后界定位均正确)达97.3%,召回率为96.9%。这些规则还可应用于拼音汉字智能转换和机器翻译等其它领域。

As important proper nouns, Chinese names of organizations and institutions play an indispensable role in language communication. Unfortunately, due to their infinite quantity, constant creation and disappearance, and relative length and complexity, most of these names have failed to find their way into Chinese dictionaries of computer systems. Linguistically, however, these proper nouns can be viewed as a special group of compound nouns and as a simple category of noun phrase, possessing their own formation rules and physical markers. This paper presents a pioneer discussion on the analysis of Chinese names of organizations and institutions from the computational point of view. Useful linguistic rules has been drawn from the discussion and applied to the identification of names of organizations and institutions in the 6,000,000 character Mainland Hongkong Taiwan corpus of modern Chinese developed by Hong Kong Polytechnic University. Preliminary experiments show that both precision and recall rates for identifying names of colleges and universities are over 96%.