HOME

TheInfoList



OR:

In computing, Chinese character encodings can be used to represent text written in the CJK languages—
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
,
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
,
Korean Korean may refer to: People and culture * Koreans, ethnic group originating in the Korean Peninsula * Korean cuisine * Korean culture * Korean language **Korean alphabet, known as Hangul or Chosŏn'gŭl **Korean dialects and the Jeju language ** ...
—and (rarely) obsolete
Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...
, all of which use
Chinese character Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanj ...
s. Several general-purpose
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
s accommodate Chinese characters, and some of them were developed specifically for Chinese. In addition to
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
(with the set of
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode ...
), local encoding systems exist. The Chinese Guobiao (or GB, "national standard") system is used in
Mainland China "Mainland China" is a geopolitical term defined as the territory governed by the People's Republic of China (including islands like Hainan or Chongming), excluding dependent territories of the PRC, and other territories within Greater China. ...
and
Singapore Singapore (), officially the Republic of Singapore, is a sovereign island country and city-state in maritime Southeast Asia. It lies about one degree of latitude () north of the equator, off the southern tip of the Malay Peninsula, bor ...
, and the (mainly) Taiwanese
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
system is used in
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
,
Hong Kong Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
and
Macau Macau or Macao (; ; ; ), officially the Macao Special Administrative Region of the People's Republic of China (MSAR), is a city and special administrative region of China in the western Pearl River Delta by the South China Sea. With a p ...
as the two primary "legacy" local encoding systems. Guobiao is usually displayed using
simplified characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
and Big5 is usually displayed using
traditional characters Traditional Chinese characters are one type of standard Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the clerical change and mostly remained in the same structure they took at ...
. There is however no mandated connection between the encoding system and the font used to display the characters; font and encoding are usually tied together for practical reasons. The issue of which encoding to use can also have political implications, as GB is the official standard of the
People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...
and Big5 is a ''
de facto ''De facto'' ( ; , "in fact") describes practices that exist in reality, whether or not they are officially recognized by laws or other formal norms. It is commonly used to refer to what happens in practice, in contrast with ''de jure'' ("by la ...
'' standard of
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
. In contrast to the situation with Japanese, there has been relatively little overt opposition to
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
, which solves many of the issues involved with GB and Big5. Unicode is widely regarded as politically neutral, has good support for both simplified and traditional characters, and can be easily converted to and from the GB and Big5. Furthermore, Unicode has the advantage of not being limited only to Chinese, since it can also display many other character sets.


Guobiao

The Guobiao (GB) line of character encodings start with the
Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions, ...
charset
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准 ...
published in 1980. Two encoding schemes existed for GB 2312: a one-or-two byte 8-bit
EUC-CN Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded chara ...
encoding commonly used, and a 7-bit encoding called HZ for usenet posts. A traditional variant called
GB/T 12345 GB 12345, entitled ''Code of Chinese ideogram set for information interchange supplementary set'' ( zh, s=信息交換用漢字編碼字符集 輔助集), is a Traditional Chinese character set standard established by China, and can be thought ...
was published in 1990. The EUC-CN form was later extended into GBK to include ''all'' Unicode 1.1 CJK Ideographs in 1993, abandoning the ISO-2022 model. By doing so, GBK includes
Traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
characters in addition to simplified ones in GB2312. GBK gained popularity through the widespread Code page 936 implementation found in Microsoft Windows 95. In 2000,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
was published as GBK's successor. This new encoding includes a four-byte UTF which encodes all Unicode codepoints not previously encoded. In 2005,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
was published to contain reference glyphs for scripts used by ethnic minorities in China, as well as glyphs from
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode ...
Extension B due to the update of
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
. Adobe-GB1 is the corresponding PostScript charset for GB encodings.


Big5

The Big5 family of character encodings start with the initial definition by the consortium of five companies in Taiwan that developed it. It is a double-byte character set (DBCS) somehow similar to
Shift JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjuncti ...
, often combined with a MBCS like
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
. Quite a few vendors as well as official extensions exist, of which ETEN,
HKSCS The Hong Kong Supplementary Character Set (; commonly abbreviated to HKSCS) is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong (whether in w ...
(Hong Kong) and Big5-2003 (as a part of
CNS 11643 The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code or CSIC ( zh, tr=, t=中文標準交換碼), is officially the standard character set of Taiwan (Republic of China). In p ...
by Taiwan) are the most well-known ones. Adobe-CNS1 is the PostScript charset corresponding to the Big5 family of encodings.


Conversion

Prior to GBK which includes both traditional and simplified characters, conversion between Traditional Chinese and Simplified Chinese charsets was complicated by the need of transcribing text between the two variants of Chinese, as one charset cover many of the other's characters only in its own variant. The conversion between traditional and simplified Chinese is usually problematic, because the simplification of some traditional forms merged two or more different characters into one simplified form. The traditional to simplified (many-to-one) conversion is technically simple. The opposite conversion often results in a data loss when converting to
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准 ...
: in mapping one-to-many when assigning traditional glyphs to the simplified glyphs, some characters will inevitably be the wrong choices in some of the usages. Thus simplified to traditional conversion often requires usage context or common phrase lists to resolve conflicts. This issue is less of a problem with newer standards such as GBK,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
and
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
which have separate code points for both simplified and traditional characters. One other issue is that many of the encoding systems are missing characters. While the missing characters are often literary and not commonly used in ordinary text, this does become a problem because people's names often contain these characters. An example of the problem is the
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
ese politician Wang Chien-shien who has a () character in his name which is not in some character systems, and former Premier of the People's Republic of China
Zhu Rongji Zhu Rongji (; IPA: ; born 23 October 1928) is a retired Chinese politician who served as Premier of the People's Republic of China from 1998 to 2003 and CCP Politburo Standing Committee member from 1992 to 2002 along with the Chinese Communist ...
, whose () character is not in GB 2312. The newest GB standard,
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
has the complete character repertoire of Unicode 4.0, including the
Unihan Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature s ...
extensions in the Supplementary Ideographic Plane.


See also

*
Chinese input methods for computers Chinese input methods are methods that allow a computer user to input Chinese characters. Most, if not all, Chinese input methods fall into one of two categories: phonetic readings or root shapes. Methods under the phonetic category usually are e ...
*
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
*
Four corner method The Four-Corner Method () is a character-input method used for encoding Chinese characters into either a computer or a manual typewriter, using four or five numerical digits per character. The Four-Corner Method is also known as the Four-Corner ...


References


Further reading

*


External links


Chinese Encoding Converter

ICU's Converter Explorer



Chinese Character Codes




{{CJK_computing Chinese-language computing Encodings of Asian languages Korean language