Chinese computational linguistics is a subset of

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

; it is the scientific study and information processing of the

Chinese language Chinese (, especially when referring to written Chinese) is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in Greater China. About 1.3 billion people (or approximately 16% of the wor ...

by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language

applications Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a c ...

. The term ''Chinese computational linguistics'' is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical. Rather than introducing computational linguistics in a general sense, this article will focus on the unique issues involved with implementing the Chinese language compared to other languages. The contents include

Chinese character Chinese characters () are logograms developed for the Written Chinese, writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are k ...

information processing,

word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...

proper noun A proper noun is a noun that identifies a single entity and is used to refer to that entity (''Africa'', ''Jupiter'', ''Sarah'', ''Microsoft)'' as distinguished from a common noun, which is a noun that refers to a class of entities (''continent, ...

recognition, natural language understanding and generation,

corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...

, and

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

Chinese character information processing

''Chinese character Information Technology (IT)'' is the technology of computer processing of

Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...

. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the

Xinhua Dictionary The ''Xinhua Zidian'' (), or ''Xinhua Dictionary'', is a Chinese language dictionary published by the Commercial Press. It is the best-selling Chinese dictionary and the world's most popular reference work. In 2016, Guinness World Records offi ...

. In the

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...

multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese characters. This means that computer processing of Chinese characters is the most intensive among all languages.

Chinese character input

Computer input of Chinese characters is more complicated than languages which have simpler character systems. For example, the English language is written with 26 letters and a handful of other characters, and each character is assigned to a key on the

keyboard Keyboard may refer to: Text input * Keyboard, part of a typewriter * Computer keyboard ** Keyboard layout, the software control of computer keyboards and their mapping ** Keyboard technology, computer keyboard hardware and firmware Music * Musi ...

. Theoretically, Chinese characters could be input in a similar way, but this approach is impractical for most applications due to the number of characters; it would require a massive keyboard with thousands of keys, and the user would find it difficult and time-consuming to locate individual characters on the keyboard. An alternative method is to use the English keyboard layout, and encode each Chinese character in the English characters; this is the predominant method of Chinese character input today. ''Sound-based encoding'' is normally based on an existing Latin character scheme for Chinese phonetics, such as the

Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...

Scheme for

Mandarin Chinese Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of ...

Putonghua Standard Chinese ()—in linguistics Standard Northern Mandarin or Standard Beijing Mandarin, in common speech simply Mandarin, better qualified as Standard Mandarin, Modern Standard Mandarin or Standard Mandarin Chinese—is a modern standar ...

, and the

Jyutping Jyutping is a romanisation system for Cantonese developed by the Linguistic Society of Hong Kong (LSHK), an academic group, in 1993. Its formal name is the Linguistic Society of Hong Kong Cantonese Romanization Scheme. The LSHK advocates for ...

Scheme for the

Cantonese Cantonese ( zh, t=廣東話, s=广东话, first=t, cy=Gwóngdūng wá) is a language within the Chinese (Sinitic) branch of the Sino-Tibetan languages originating from the city of Guangzhou (historically known as Canton) and its surrounding are ...

dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard. A Chinese character can alternatively be input by ''form-based encoding''. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the

Cangjie input method The Cangjie input method (Tsang-chieh input method, sometimes called Changjie, Cang Jie, Changjei or Chongkit) is a system for entering Chinese characters into a computer using a standard computer keyboard. In filenames and elsewhere, the name Can ...

, character (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and

Cangjie Cangjie () is a legendary ancient Chinese figure said to have been an official historian of the Yellow Emperor and the inventor of Chinese characters. Legend has it that he had four eyes, and that when he invented the characters, the deities an ...

(仓颉) in Taiwan and Hong Kong. The most important feature of ''intelligent input'' is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "" (University Professor), when types "daxuepiaopiao" the computer will suggest "" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.

Chinese character encoding for information interchange

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The first ''GB Chinese character encoding standard'' is GB2312, which was released by the

PRC China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's List of countries and dependencies by population, most populous country, with a Population of China, population exceeding 1.4 billion, slig ...

in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by

, and the rest by radicals (indexing components). GB2312 was designed for

simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...

Traditional characters Traditional Chinese characters are one type of standard Chinese characters, Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the libian, clerical change and mostly remained in the ...

which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set. The standard of ''

Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...

encoding'' was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters. The full version of the ''

standard'' represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by

Kangxi Radicals The 214 Kangxi radicals (), also known as the Zihui radicals, form a system of radicals () of Chinese characters. The radicals are numbered in stroke count order. They are the most popular system of radicals for dictionaries that order Traditio ...

. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5). Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.

Chinese character output

Like English and other languages, Chinese characters are output on printers and screens in different

fonts In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mode ...

and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families. Fonts appear in different sizes. In addition to the international measurement system of points, Chinese characters are also measured by size numbers (called ''zihao'', 字号) invented by an American for Chinese printing in 1859.

Word segmentation

It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example, (Chinese original text) (word-segmented text) Chinese information journal (word-by-word English translation) Journal of Chinese Information Processing (English name) Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段) ). Typically an intersection ambiguity is in the format of ABC, where A, AB, BC and C are all words in the lexicon. It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美国会’ (the US Parliament) or ‘美国会’ (the US can/will). The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences: (1) 你可以坐下。 you can sit down. You can sit down. (2) 你可以他们为样板。 you can take them as example. You can take them as an example. Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95 % . However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English . But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.

Proper noun recognition

A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, ‘Mr. John Nealon’, ‘America’ and ‘Cambridge University’. However, Chinese proper nouns are usually not marked in any style. Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning ‘people’. Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference. A people’s name not found in the dictionary can be recognized with a list of surnames and titles, for example ‘张大方先生’’,李经理’, where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person’s name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说). Names of place also have characteristics useful for computer recognition. For example, in ‘在广东省中山市民众镇’, component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location. The correctness rate of computer recognition has reached around 90 % for persons’ names and 95 % for place names .

Journals and proceedings

*Journal of Chinese information processing (http://jcip.cipsc.org.cn/CN/home) *International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (https://www.aclclp.org.tw/journal/index.php) *China National Conference on Chinese Computational Linguistics (https://link.springer.com/conference/cncl) *Rocling Proceedings (https://www.aclclp.org.tw/pub_proce.php)

Chinese character information processing

Chinese character input

Chinese character encoding for information interchange

Chinese character output

Word segmentation

Proper noun recognition

Journals and proceedings

See also

Notes

References

Citations

Works cited