Chinese Language Technology
   HOME

TheInfoList



OR:

Chinese computational linguistics is a subset of
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
; it is the scientific study and information processing of the
Chinese language Chinese (, especially when referring to written Chinese) is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in Greater China. About 1.3 billion people (or approximately 16% of the wor ...
by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language applications. The term ''Chinese computational linguistics'' is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical. Rather than introducing computational linguistics in a general sense, this article will focus on the unique issues involved with implementing the Chinese language compared to other languages. The contents include
Chinese character Chinese characters () are logograms developed for the Written Chinese, writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are k ...
information processing, word segmentation, proper noun recognition, natural language understanding and generation,
corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
, and
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
.


Chinese character information processing

''Chinese character Information Technology (IT)'' is the technology of computer processing of
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the
Xinhua Dictionary The ''Xinhua Zidian'' (), or ''Xinhua Dictionary'', is a Chinese language dictionary published by the Commercial Press. It is the best-selling Chinese dictionary and the world's most popular reference work. In 2016, Guinness World Records offi ...
. In the
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese characters. This means that computer processing of Chinese characters is the most intensive among all languages.


Chinese character input

Computer input of Chinese characters is more complicated than languages which have simpler character systems. For example, the English language is written with 26 letters and a handful of other characters, and each character is assigned to a key on the
keyboard Keyboard may refer to: Text input * Keyboard, part of a typewriter * Computer keyboard ** Keyboard layout, the software control of computer keyboards and their mapping ** Keyboard technology, computer keyboard hardware and firmware Music * Musi ...
. Theoretically, Chinese characters could be input in a similar way, but this approach is impractical for most applications due to the number of characters; it would require a massive keyboard with thousands of keys, and the user would find it difficult and time-consuming to locate individual characters on the keyboard. An alternative method is to use the English keyboard layout, and encode each Chinese character in the English characters; this is the predominant method of Chinese character input today. ''Sound-based encoding'' is normally based on an existing Latin character scheme for Chinese phonetics, such as the
Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...
Scheme for
Mandarin Chinese Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of ...
or
Putonghua Standard Chinese ()—in linguistics Standard Northern Mandarin or Standard Beijing Mandarin, in common speech simply Mandarin, better qualified as Standard Mandarin, Modern Standard Mandarin or Standard Mandarin Chinese—is a modern standar ...
, and the Jyutping Scheme for the
Cantonese Cantonese ( zh, t=廣東話, s=广东话, first=t, cy=Gwóngdūng wá) is a language within the Chinese (Sinitic) branch of the Sino-Tibetan languages originating from the city of Guangzhou (historically known as Canton) and its surrounding are ...
dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard. A Chinese character can alternatively be input by ''form-based encoding''. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the
Cangjie input method The Cangjie input method (Tsang-chieh input method, sometimes called Changjie, Cang Jie, Changjei or Chongkit) is a system for entering Chinese characters into a computer using a standard computer keyboard. In filenames and elsewhere, the name Can ...
, character (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and Cangjie (仓颉) in Taiwan and Hong Kong. The most important feature of ''intelligent input'' is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "" (University Professor), when types "daxuepiaopiao" the computer will suggest "" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.


Chinese character encoding for information interchange

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The first ''GB Chinese character encoding standard'' is GB2312, which was released by the PRC in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by
Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...
, and the rest by
radicals Radical may refer to: Politics and ideology Politics *Radical politics, the political intent of fundamental societal change *Radicalism (historical), the Radical Movement that began in late 18th century Britain and spread to continental Europe and ...
(indexing components). GB2312 was designed for
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
.
Traditional characters Traditional Chinese characters are one type of standard Chinese characters, Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the libian, clerical change and mostly remained in the ...
which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set. The standard of ''
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
encoding'' was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters. The full version of the ''
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
standard'' represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by
Kangxi Radicals The 214 Kangxi radicals (), also known as the Zihui radicals, form a system of radicals () of Chinese characters. The radicals are numbered in stroke count order. They are the most popular system of radicals for dictionaries that order Traditio ...
. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5). Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.


Chinese character output

Like English and other languages, Chinese characters are output on printers and screens in different fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families. Fonts appear in different sizes. In addition to the international measurement system of
points Point or points may refer to: Places * Point, Lewis, a peninsula in the Outer Hebrides, Scotland * Point, Texas, a city in Rains County, Texas, United States * Point, the NE tip and a ferry terminal of Lismore, Inner Hebrides, Scotland * Point ...
, Chinese characters are also measured by size numbers (called ''zihao'', 字号) invented by an American for Chinese printing in 1859.


Word segmentation

It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example, (Chinese original text) (word-segmented text) Chinese information journal (word-by-word English translation) Journal of Chinese Information Processing (English name) Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段) ). Typically an intersection ambiguity is in the format of ABC, where A, AB, BC and C are all words in the lexicon. It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美 国会’ (the US Parliament) or ‘美国 会’ (the US can/will). The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences: (1) 你 可以 坐下。 you can sit down. You can sit down. (2) 你 可 以 他们 为 样板。 you can take them as example. You can take them as an example. Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95 % . However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English . But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.


Proper noun recognition

A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, ‘Mr. John Nealon’, ‘America’ and ‘Cambridge University’. However, Chinese proper nouns are usually not marked in any style. Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning ‘people’. Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference. A people’s name not found in the dictionary can be recognized with a list of surnames and titles, for example ‘张大方先生’’,李经理’, where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person’s name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说). Names of place also have characteristics useful for computer recognition. For example, in ‘在广东省中山市民众镇’, component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location. The correctness rate of computer recognition has reached around 90 % for persons’ names and 95 % for place names .


Journals and proceedings

*Journal of Chinese information processing (http://jcip.cipsc.org.cn/CN/home) *International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (https://www.aclclp.org.tw/journal/index.php) *China National Conference on Chinese Computational Linguistics (https://link.springer.com/conference/cncl) *Rocling Proceedings (https://www.aclclp.org.tw/pub_proce.php)


See also

*
Computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
*
Chinese language Chinese (, especially when referring to written Chinese) is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in Greater China. About 1.3 billion people (or approximately 16% of the wor ...
*
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
* Chinese character IT


Notes


References


Citations


Works cited

* * * * * * * * * * * {{refend Computational linguistics Chinese language