Chinese character IT is the information technology for computer processing of
Chinese characters
Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
.
While the English writing system uses a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the ''
Xinhua Dictionary
The ''Xinhua Zidian'' (), or ''Xinhua Dictionary'', is a Chinese language dictionary published by the Commercial Press. It is the best-selling Chinese dictionary and the world's most popular reference work. In 2016, Guinness World Records offi ...
''. In the Unicode multilingual character set of 149,813 characters, 98,682 (about two-thirds) are Chinese. That means computer processing of Chinese characters is the toughest among other languages.
Chinese faces special issues compared to other languages, including the technology of computer input, internal encoding and output of Chinese characters.
Character input
Computer input of Chinese characters is by no means as easy as English. English is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Chinese can be input in a similar way. However that would involve a huge keyboard with at least thousands of keys. Searching for a character on the keyboard would be a daunting job.
People did try to 'shrink' the Chinese keyboard by putting multiple characters on one key. That turned the original one-step input procedure into two steps for the writer:
# pressing the key for the character group of the target character,
# selecting the target character in the group.
The resulting keyboard still remained clumsy, because if you put more characters on one key, the key becomes bigger to make the characters recognizable, and selecting a character from a large group is difficult. Additionally, it is not easy to group the characters evenly in a reasonable and easy-to-learn way. Another drawback of a Chinese keyboard for direct whole character input is its inconsistency with English input.
An alternative way is to encode each Chinese character in English characters, enabling Chinese input on an English keyboard. As a matter of fact, this method has become predominant for Chinese computer input.
The software of an encoding input method includes a character-code table (). When an
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
input code is typed on the English keyboard, the software will search for matching Chinese characters in the table. If there are multiple characters sharing the same code, they will be presented to the user for selection.
To make the input method easy to learn, encoding must be based on distinctive features in forms, sounds or meanings of Chinese characters. Because the meanings of characters tend to be more abstract and complicated, input encoding is normally based on the sound or form.
Sound-based encodings
Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as
pinyin
Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...
for Putonghua, and
Jyutping for Cantonese. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua pinyin input code of (Hong Kong) is ''xianggang'' or ''xiang1gang3'', and the Cantonese Jyutping code is ''hoenggong'' or ''hoeng1gong2'', all of which can be easily input via an English keyboard.
In Putonghua pinyin, there are two letters not appearing on the English keyboard: ê and ü. According to the national standard, ê should be represented by 'ea', and ü by 'v' in the pinyin input code. In some Chinese input software ê is also represented as 'e^', and ü as 'u:' or 'uu'.
Popular sound-based input methods in China include Microsoft Pinyin, Sogou Pinyin, Google Pinyin and Jyutping on the mainland and Hong Kong, and
bopomofo
Bopomofo (), or Mandarin Phonetic Symbols, also named Zhuyin (), is a Chinese transliteration system for Mandarin Chinese and other related languages and dialects. More commonly used in Taiwanese Mandarin, it may also be used to transcribe ...
in Taiwan.
There are a number of advantages for sound-based encoding:
# Easy to learn because most Chinese writers have already got a good command of Putonghua and pinyin.
# Consistent with Chinese language learning.
# Allows simplified and traditional Chinese characters to be input in a similar way.
# Allows writing Chinese and English on the same keyboard.
The shortcomings of sound-based encoding lie in its high degree of duplicate encoding, with homophone Chinese characters sharing the same code. A Chinese character is normally pronounced with one syllable. Chinese Putonghua only has about 400 different syllables without considering tones, or approximately 1,200 syllables when tones are considered. On the other hand, there are tens of thousands of Chinese characters. That means on the average, each syllable has to cover over 10 characters. This problem can be largely solved by inputting Chinese word by word instead of character by character, because most words in modern Chinese consist of more than one character and duplicate encoding is much less frequent at words level. For example, the pinyin of 香港 (Hong Kong) is unique to the word, while either character 香 or 港 shares its pronunciation with many other characters. Another limitation of sound-based Chinese input is that you must know the pronunciation of a Chinese character before you can input it into the computer. This issue can be solved by form-based encoding.
Form-based encodings
A Chinese character can alternatively be input according to its form (or shape) and structure. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. For example, the character
福 ('good fortune', 'happiness') can be decomposed as
There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the
Cangjie input method
The Cangjie input method (Tsang-chieh input method, sometimes called Changjie, Cang Jie, Changjei or Chongkit) is a system for entering Chinese characters into a computer using a standard computer keyboard. In filenames and elsewhere, the name Can ...
, character 疆 ('border') is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted.
Stroke-based coding is simpler than component-based coding. But the codes tend to be longer. There are approximately 30~40 distinctive strokes of Chinese characters. They are usually classified into five categories of heng (一), shu (丨), pie (丿), dian (丶) and zhe (𠃍) for dictionary consultancy and Chinese input on a mobile phone. For Chinese input with an ASCII keyboard, 2 strokes can be combined to form 5*5=25 different pairs for mapping to the English letters. For example, in input method ZYQ, the sequence of stroke pairs '一一, 一丨, 一丿, ..., 𠃍丿, 𠃍丶, 𠃍𠃍' are represented by 'a, b, c, ..., w, x, y' respectively. Popular form-based encoding methods include
Wubi on the mainland and
Cangjie in Taiwan and Hong Kong.
The pros and cons of form-based input methods are complementary to sound-based methods. The major advantage of form-based methods lies in their low degree of duplicate encoding, enabling high speed input of Chinese characters. And the major shortcoming is difficulty of learning. Normally students have to remember over one hundred components and their corresponding English letters. In addition, they have to learn the complicated rules for breaking a character into a sequence of components and making a selection among them.
Optical character recognition
Chinese characters can also be input into the computer by
optical character recognition
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
(OCR), handwriting recognition and
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
based on technology similar to that of English.
Compared with English, Chinese OCR and handwriting recognition is more difficult, because there are thousands of different commonly-used characters instead of 26 letters. Generally speaking, print character recognition is more accurate than handwriting characters because their forms are more standardized. There are OCR tools for different fonts, including the popular Song, Kai and Hei. In comparison with offline handwriting, online handwriting recognition is more efficient, because the computer not only 'sees' the written character but also the procedure of writing it.
Speech recognition
Speech recognition converts a continuous speech signal into a sequence of words. There are two problems: the variation in pronunciation of words by different speakers and the existence of homophones such as 'pair', 'pear' and 'pare' in English, and 攻势, 公式, 公示 (gong1shi4) in Chinese. Speech recognition relies on corpus statistical methods and linguistic rules. A helpful feature of Chinese is that each character is pronounced with one syllable.
Both Chinese character recognition and speech recognition has reached application level. However neither can guarantee 100% correctness without human proofreading or online character selection.
Intelligent input engines
The most important feature of intelligent input is application of contextual constraints for candidate characters selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", they will get 大学教授 (University Professor), when types "daxuepiaopiao" the computer suggested 大雪飘飘 (heavy snow flying). Though the non-diacritical pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.
Intelligent Chinese input also makes use of corpus information and linguistic rules. The computer's selection among ambiguous Chinese characters is not always correct, and further improvement is required.
Other input
In the
Chinese writing system, there are
graphemes
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
other than complete Chinese characters, such as
punctuation
Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...
marks (e.g. '。', '、' and '《》'),
strokes (e.g. '丿', '𠃍' and '乚'),
radicals
Radical may refer to:
Politics and ideology Politics
*Radical politics, the political intent of fundamental societal change
*Radicalism (historical), the Radical Movement that began in late 18th century Britain and spread to continental Europe and ...
(e.g. '氵', '宀' and '刂'), and letters used for romanization, like the vowel letters with diacritics used in pinyin and the
Yale romanization of Cantonese. (e.g. 'ā', 'á', 'ǎ', 'à').
There are facilities available on Microsoft Windows, Office and the web, which will enable us to input almost all of these Chinese auxiliary characters, ranging from the input of punctuation marks in general Chinese input methods, to inputting diacritical pinyin with soft keyboards, to inputting strokes and radicals from the Unicode website and by Unicode-character conversion, as well as the application of special tools on the Web to input pinyin and other characters. More information on non-logogram input can be found in paper, which includes a list of 280 non-ASCII non-logograms, with each annotated with its Unicode code point and the input code of the author's design. It is also possible to input a character on Microsoft Word by typing its Unicode code point and pressing keys Alt+X.
Chinese character encoding for information interchange
Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The following sections will introduce the most important encoding standards used in Chinese information technology, including
GB,
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
and
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
.
GB
GB stands for
Guobiao
The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China.
...
, "Guojia Biaozhun" (国家标准, or ‘national standard’) in
Putonghua
Standard Chinese ()—in linguistics Standard Northern Mandarin or Standard Beijing Mandarin, in common speech simply Mandarin, better qualified as Standard Mandarin, Modern Standard Mandarin or Standard Mandarin Chinese—is a modern standar ...
, and is the prefix for reference numbers of official standards issued by the
People's Republic of China
China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...
.
The first GB Chinese character encoding standard is
GB 2312, which was released in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by
Pinyin
Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...
, and the rest by
radicals
Radical may refer to:
Politics and ideology Politics
*Radical politics, the political intent of fundamental societal change
*Radicalism (historical), the Radical Movement that began in late 18th century Britain and spread to continental Europe and ...
(indexing components). GB2312 was designed for
simplified Chinese characters
Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
.
Traditional characters
Traditional Chinese characters are one type of standard Chinese characters, Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the libian, clerical change and mostly remained in the ...
which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released.
The latest version of GB encoding is
GB18030. GB18030 supports both simplified and traditional Chinese characters, and is consistent with Unicode's character set.
Big5
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau.
The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order.
Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters.
Unicode
Unicode is the most influential international standard for multilingual character encoding. It is consistent with (or virtually equivalent to) standard ISO/IEC10646. The full version of Unicode represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional.
In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which 98,682, about two-thirds, are Chinese sorted by
Kangxi radicals
The 214 Kangxi radicals (), also known as the Zihui radicals, form a system of radicals () of Chinese characters.
The radicals are numbered in stroke count order. They are the most popular system of radicals for dictionaries that order Traditio ...
. Even very rarely-used characters are available. The following are some example characters with their Unicode put in brackets:
H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5).
All the 5,009 characters of the Hong Kong Supplementary Character Set (
HKSCS
The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
) are included in Unicode. HKSCS was developed by the Hong Kong government as a collection of locally specific Chinese characters not available on the computer in the early days, for instance 咗 (already), 嘢 (thing), 脷 (tongue), and 曱甴 (cockroach).
As GB, Big5 and Unicode are concurrently used in Chinese encoding, when the computer mistakenly interprets a text with an encoding standard different from its original code, it will be presented with wrong characters, a phenomenon called "luànmǎ" (code confusing), which occasionally happens on the Web or in emails. This problem is often solved by manual selection of encoding or character set (such as the case on Web browsers) or by code conversion beforehand.
Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.
Output
Typefaces
Like English and other languages, Chinese characters are output on printers and screens in different
fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families, for example,
汉字字体 (Song)
汉字字体 (Kai)
汉字字体 (Hei or Black)
汉字字体 (FangSong)
Font size
Fonts appear in different sizes. In addition to the international measurement system of
points
Point or points may refer to:
Places
* Point, Lewis, a peninsula in the Outer Hebrides, Scotland
* Point, Texas, a city in Rains County, Texas, United States
* Point, the NE tip and a ferry terminal of Lismore, Inner Hebrides, Scotland
* Point ...
, Chinese characters are also measured by size numbers (called ''zihao'', 字号) invented by an American for Chinese printing in 1859. Table 1 is a list of all the font sizes in numbers available on Chinese version MS Word and their equivalent points.
Table 1:Chinese font sizes in numbers, points and mm
字号 (Number) 点数 (pt) 毫米 (mm) Example
八号 (#8) 5 1.76 中文
七号 (#7) 5.5 1.93 中文
小六号 (#small 6) 6.5 2.28 中文
六号 (#6) 7.5 2.64 中文
小五号 (#small 5) 9 3.16 中文
五号 (#5) 10.5 3.69 中文
小四号 (#small 4) 12 4.22 中文
四号 (#4) 14 4.92 中文
小三号 (#small 3) 15 5.27 中文
三号 (#3) 16 5.62 中文
小二号 (#small 2) 18 6.33 中文
二号 (#2) 22 7.73 中文
小一号 (#small 1) 24 8.44 中文
一号 (#1) 26 9.14 中文
小初号 (#small primary) 36 12.65 中文
初号 (#primary) 42 14.76 中文
This table is particularly useful for Chinese typesetting on computers not supporting font sizes in numbers. For example, from the table, we get to know that Chinese size number 3 (三号) is equivalent to 16 points, or 5.62mm high, as shown by the example characters.
The image of a Chinese character in a particular font is represented in the computer by a matrix of dots (called
dot matrix
A dot matrix is a 2-dimensional patterned array, used to represent characters, symbols and images. Most types of modern technology use dot matrices for display of information, including mobile phones, televisions, and printers. The system is al ...
fonts or
bitmapped font) or by outlines (called
outline font), again like the case in English.
See also
*
Chinese character encoding
In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character enc ...
*
Chinese Character Code for Information Interchange
*
Chinese computational linguistics
*
Japanese language and computers
*
Unihan
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature s ...
*
List of CJK fonts
Notes
References
Citations
Works cited
*
*
*
*
*
*
*
*
*
*
*
*
*
{{Refend
Computational linguistics
IT