Chinese computational linguistics is a subset of

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

; it is the scientific study and information processing of the

Chinese language Chinese ( or ) is a group of languages spoken natively by the ethnic Han Chinese majority and List of ethnic groups in China, many minority ethnic groups in China, as well as by various communities of the Chinese diaspora. Approximately 1.39& ...

by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language

applications Application may refer to: Mathematics and computing * Application software, computer software designed to help the user to perform specific tasks ** Application layer, an abstraction layer that specifies protocols and interface methods used in a ...

. The term ''Chinese computational linguistics'' is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical. Rather than introducing computational linguistics in a general sense, this article will focus on the unique issues involved with implementing the Chinese language compared to other languages. The contents include

Chinese character Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only on ...

information processing,

word segmentation A word is a basic element of language that carries meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguists on its ...

proper noun A proper noun is a noun that identifies a single entity and is used to refer to that entity ('' Africa''; ''Jupiter''; '' Sarah''; ''Walmart'') as distinguished from a common noun, which is a noun that refers to a class of entities (''continent, ...

recognition, natural language understanding and generation,

corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...

, and

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

Chinese character information processing

''Chinese character Information Technology (IT)'' is the technology of computer processing of

Chinese characters Chinese characters are logographs used Written Chinese, to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represe ...

. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the

Xinhua Dictionary The ''Xinhua Zidian'' (), also as ''Xinhua Dictionary'', is a Chinese-language dictionary published by the Commercial Press. The first edition of ''Xinhua Zidian'' was published in 1957. The latest version is the 12th edition, which was publis ...

. In the

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese characters. This means that computer processing of Chinese characters is the most intensive among all languages.

Chinese character input

Computer input of Chinese characters is more complicated than languages which have simpler character systems. For example, the English language is written with 26 letters and a handful of other characters, and each character is assigned to a key on the

keyboard Keyboard may refer to: Text input * Keyboard, part of a typewriter * Computer keyboard ** Keyboard layout, the software control of computer keyboards and their mapping ** Keyboard technology, computer keyboard hardware and firmware Music * Mus ...

. Theoretically, Chinese characters could be input in a similar way, but this approach is impractical for most applications due to the number of characters; it would require a massive keyboard with thousands of keys, and the user would find it difficult and time-consuming to locate individual characters on the keyboard. An alternative method is to use the English keyboard layout, and encode each Chinese character in the English characters; this is the predominant method of Chinese character input today. ''Sound-based encoding'' is normally based on an existing Latin character scheme for Chinese phonetics, such as the

Pinyin Hanyu Pinyin, or simply pinyin, officially the Chinese Phonetic Alphabet, is the most common romanization system for Standard Chinese. ''Hanyu'' () literally means 'Han Chinese, Han language'—that is, the Chinese language—while ''pinyin' ...

Scheme for

Mandarin Chinese Mandarin ( ; zh, s=, t=, p=Guānhuà, l=Mandarin (bureaucrat), officials' speech) is the largest branch of the Sinitic languages. Mandarin varieties are spoken by 70 percent of all Chinese speakers over a large geographical area that stretch ...

Putonghua Standard Chinese ( zh, s=现代标准汉语, t=現代標準漢語, p=Xiàndài biāozhǔn hànyǔ, l=modern standard Han speech) is a modern standard form of Mandarin Chinese that was first codified during the republican era (1912–1949). ...

, and the

Jyutping The Linguistic Society of Hong Kong Cantonese Romanization Scheme, also known as Jyutping, is a romanisation system for Cantonese developed in 1993 by the Linguistic Society of Hong Kong (LSHK). The name ''Jyutping'' (itself the Jyutping ro ...

Scheme for the

Cantonese Cantonese is the traditional prestige variety of Yue Chinese, a Sinitic language belonging to the Sino-Tibetan language family. It originated in the city of Guangzhou (formerly known as Canton) and its surrounding Pearl River Delta. While th ...

dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard. A Chinese character can alternatively be input by ''form-based encoding''. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the

Cangjie input method The Cangjie input method (Tsang-chieh input method, sometimes called Changjie, Cang Jie, Changjei or Chongkit) is a system for entering Chinese characters into a computer using a standard computer keyboard. In filenames and elsewhere, the name C ...

, character (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and

Cangjie Cangjie is a legendary figure in Chinese mythology, said to have been an official historian of the Yellow Emperor and the inventor of Chinese characters. Legend has it that he had four eyes, and that when he invented the characters, the deities ...

(仓颉) in Taiwan and Hong Kong. The most important feature of ''intelligent input'' is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "" (University Professor), when types "daxuepiaopiao" the computer will suggest "" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.

Chinese character encoding for information interchange

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The first ''GB Chinese character encoding standard'' is GB2312, which was released by the

PRC China, officially the People's Republic of China (PRC), is a country in East Asia. With a population exceeding 1.4 billion, it is the second-most populous country after India, representing 17.4% of the world population. China spans the e ...

in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by

, and the rest by

radicals Radical (from Latin: ', root) may refer to: Politics and ideology Politics *Classical radicalism, the Radical Movement that began in late 18th century Britain and spread to continental Europe and Latin America in the 19th century *Radical politics ...

(indexing components). GB2312 was designed for

simplified Chinese characters Simplified Chinese characters are one of two standardized Chinese characters, character sets widely used to write the Chinese language, with the other being traditional characters. Their mass standardization during the 20th century was part of ...

Traditional characters Traditional Chinese characters are a standard set of Chinese character forms used to write Chinese languages. In Taiwan, the set of traditional characters is regulated by the Ministry of Education and standardized in the ''Standard Form of ...

which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set. The standard of ''

Big5 Big-5 or Big5 ( zh, t=大五碼) is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 ...

encoding'' was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters. The full version of the ''

standard'' represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by

Kangxi Radicals The ''Kangxi'' radicals (), also known as ''Zihui'' radicals, are a set of 214 radicals that were collated in the 18th-century '' Kangxi Dictionary'' to aid categorization of Chinese characters. They are primarily sorted by stroke count. They ...

. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5). Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.

Chinese character output

Like English and other languages, Chinese characters are output on printers and screens in different

fonts In movable type, metal typesetting, a font is a particular #Characteristics, size, weight and style of a ''typeface'', defined as the set of fonts that share an overall design. For instance, the typeface Bauer Bodoni (shown in the figure) inclu ...

and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families. Fonts appear in different sizes. In addition to the international measurement system of

points A point is a small dot or the sharp tip of something. Point or points may refer to: Mathematics * Point (geometry), an entity that has a location in space or on a plane, but has no extent; more generally, an element of some abstract topologica ...

, Chinese characters are also measured by size numbers (called ''zihao'', 字号) invented by an American for Chinese printing in 1859.

Word segmentation

It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example, (Chinese original text) (word-segmented text) Chinese information journal (word-by-word English translation) Journal of Chinese Information Processing (English name) Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段). Typically an intersection ambiguity is in the format of ABC, where A, AB, BC and C are all words in the lexicon. It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美国会’ (the US Parliament) or ‘美国会’ (the US can/will). The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences: (1) 你可以坐下。 you can sit down. You can sit down. (2) 你可以他们为样板。 you can take them as example. You can take them as an example. Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95%. However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English. But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.

Proper noun recognition

A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, "Mr. John Nealon", "America" and "Cambridge University". However, Chinese proper nouns are usually not marked in any style. Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning "people". Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference. A people's name not found in the dictionary can be recognized with a list of surnames and titles, for example "张大方先生",李经理", where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person's name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说). Names of place also have characteristics useful for computer recognition. For example, in "在广东省中山市民众镇", component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location. The correctness rate of computer recognition has reached around 90% for persons' names and 95% for place names.

Journals and proceedings

Journal of Chinese Information Processing ''Journal of Chinese Information Processing'' () is the journal of Chinese Information Processing Society of China. It was founded in 1986 and has been focused on publishing academic papers on the basic theory and applied technology of Chinese i ...

(http://jcip.cipsc.org.cn/CN/home) *International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (https://www.aclclp.org.tw/journal/index.php) *China National Conference on Chinese Computational Linguistics (https://link.springer.com/conference/cncl) *Rocling Proceedings (https://www.aclclp.org.tw/pub_proce.php)

References

Citations

Works cited

* * * * * * * * * * * {{refend Computational linguistics Chinese language