is a key official

character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...

of the

People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...

, used for

Simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...

. GB2312 is the registered internet name for

EUC-CN Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded chara ...

, which is its usual encoded form. ''GB'' refers to the

Guobiao standards The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China. A ...

(国家标准), whereas the ''T'' suffix ( zh, c= 推荐, p=tuījiàn, l=recommendation, labels=no) denotes a non-mandatory standard. was originally a mandatory national standard designated . However, following a National Standard Bulletin of the

in 2017, GB 2312 is no longer mandatory, and its standard code is modified to . has been superseded by GBK and

GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...

, which include additional characters, but remains in widespread use as a subset of those encodings. , GB2312 is the second-most popular encoding served from China and territories (after

UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...

), with 5.5% of web servers serving a page declaring it. Globally, GB2312 is declared on 0.1% of all web pages. However, all major web browsers decode GB2312-marked documents as if they were marked with the superset GBK encoding, except for Safari and Edge on the label GB_2312. There is an analogous character set known as GB/T 12345, closely related to GB/T 2312, but with traditional character forms replacing simplified forms, and some extra 62 supplemental characters. GB-encoded fonts often come in pairs, one with the GB/T 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.

Character range in rows

While GB/T 2312 covers over 99.99% contemporary Chinese text usage, historical texts and many names remain out of scope. Old standard includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese

kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...

, the

Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...

and

Cyrillic alphabets Numerous Cyrillic alphabets are based on the Cyrillic script. The early Cyrillic alphabet was developed in the 9th century AD and replaced the earlier Glagolitic script developed by the Byzantine theologians Cyril and Methodius. It is the b ...

Zhuyin Bopomofo (), or Mandarin Phonetic Symbols, also named Zhuyin (), is a Chinese transliteration system for Mandarin Chinese and other related languages and dialects. More commonly used in Taiwanese Mandarin, it may also be used to transcribe ...

, and a double-byte set of

Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Chinese, Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally writte ...

letters with tone marks. In later version GB/T 2312-1980, there are 7,445 letters. Characters in GB/T 2312 are arranged in a 94×94 grid (as in ISO 2022), and the two-byte code point of each character is expressed in the ''kuten'' (or qūwèi, 区位) form, which specifies a row (''ku'' or qū，区) and the position of the character within the row (cell, ''ten'' or wèi，位). For example, the character "外" (meaning: foreign) is located in row 45 position 66, thus its ''kuten'' code is 45-66. The rows (numbered from 1 to 94) contain characters as follows: * 01–09, comprising punctuation and other special characters; also

Hiragana is a Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" originally as contrast ...

Katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...

, Cyrillic,

, Bopomofo * 16–55, the first level of

Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...

, arranged according to

. (3755 characters). * 56–87, the second level of Chinese characters, arranged according to radical and strokes. (3008 characters). The rows 10–15 and 88–94 are unassigned. For GB/T 2312-1980, it contains 682 signs and 6763 Chinese Characters.

Encodings of GB/T 2312

EUC-CN

is often used as the

character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...

(i.e. for external storage) in programs that deal with GB/T 2312, thus maintaining compatibility with

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...

. Two

bytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...

are used to represent every character not found in

. The value of the first byte is from 0xA1–0xF7 (161–247), while the value of the second byte is from 0xA1–0xFE (161–254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last. Compared to

, GB/T 2312 (whether native or encoded in EUC-CN) is more storage efficient: while

uses three bytes per CJK ideograph, GB/T 2312 only uses two. However, GB/T 2312 does not cover as many ideographs as Unicode does. To map the ''kuten'' code points to EUC bytes, add 160 (0xA0) to both the row number (''ku'' or qū, 区) and cell/column number (''ten'' or wèi, 位). The result of addition to the row number of the code point will form the high byte, and the result of addition to the cell number of the code point will form the low byte. For example, to encode the character "外" at ''kuten'' cell 45-66, the high byte will use the row number 45: 45+160=205=0xCD, and the low byte will come from the cell number 66: 66+160=212=0xE2. So, the full encoding is .

ISO-2022-CN

ISO-2022-CN is another encoding form of GB/T 2312, which is also the encoding specified in the official documentation. This encoding references the

ISO-2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...

standard, which also uses two bytes to encode characters not found in ASCII. However, instead of using the extended region of ASCII, ISO-2022 uses the same byte range as ASCII: the value of the first byte is from 0x21–0x77 (33–119), while the value of the second byte is from 0x21–0x7E (33–126). As the byte range overlaps ASCII significantly, special characters are required to indicate whether a character is in the ASCII range or is part of the two-byte sequence of extended region, namely the Shift Out and Shift In functions. This poses a risk for misencoding as improper handling of text can result in missing information. To map the ''kuten'' code points to ISO-2022 bytes, add 32 (0x20) to both the row number (''ku'' or qū, 区) and cell/column number (''ten'' or wèi, 位). The result of addition to the row number of the code point will form the high byte, and the result of addition to the cell number of the code point will form the low byte similar to EUC encoding. For example, to encode the character "外" at ''kuten'' cell 45-66, the high byte will use the row number 45: 45+32=77=0x4D, and the low byte will come from the cell number 66: 66+32=98=0x62. So, the full encoding is <4D 62>.

HZ

HZ is another encoding of GB/T 2312 that is used mostly for

Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it wa ...

postings; characters are represented with the same byte pairs as in ISO-2022-CN, but the byte sequences denoting the beginning and end of a range of GB 2312 text differ.

Code charts

In the tables below, where a pair of hexadecimal numbers is given for a prefix byte or a coding byte, the smaller (with the eighth bit unset or unavailable) is used when encoded over GL ( 0x21-0x7E), as in ISO-2022-CN or HZ-GB-2312, and the larger (with the eighth bit set) is used in the more typical case of it being encoded over GR (0xA1-0xFE), as in

, GBK or

. Qūwèi numbers are given in decimal. When GB/T 2312 is encoded over GR, both bytes have the eighth bit set (i.e. are greater than 0x7F). GBK and GB 18030 also make use of two-byte codes in which only the first byte has the eighth bit set for extension purposes: such codes are outside of the GB/T 2312 plane, and are not tabulated here.

Lead byte

This chart details the overall layout of the main plane of the GB/T 2312 character set by lead byte. For lead bytes used for characters other than hanzi, links are provided to charts on this page listing the characters encoded under that lead byte. For lead bytes used for hanzi, links are provided to the appropriate section of

Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a numbe ...

's hanzi index.

Non-Hanzi rows

The following charts list the non- hanzi characters available in GB/T 2312, in GB/T 12345, and in double-byte region 1 of

(which roughly corresponds to the non-hanzi region of GB/T 2312). Notes are made where these differ, and where GB 6345.1 and

ISO-IR-165 The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992. It is defined in ITU T.101, annex C, which codifies ...

differ from these. Cross-references are made to articles on other CJK national character sets for comparison.

Two implementations of GB2312

Unicode mappings of the interpunct () and em dash () in the subset of GBK and

corresponding to GB/T 2312 ( and ) differ from the those which are listed in GB2312.TXT ( and ), which is a data file which was previously provided by the

Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intentio ...

, although it has been designated as obsolete since August 2011 and is no longer hosted as of September 2016. As of 2015, Microsoft .Net Framework follows GB 18030 mappings when mapping those two characters in data labelled , whereas ICU, iconv-1.14, php-5.6, ActivePerl-5.20, Java 1.7 and Python 3.4 follow GB2312.TXT in response to the label. Ruby 2.2 is compatible with both implementations; it internally converts the conflictive characters to the GB 18030 subset. The W3C/

WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, l ...

technical recommendation for use with

HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...

specifies a GBK encoding to be inferred for streams labelled gb2312, which in turn uses a GB18030 decoder. Other differing mappings have been defined and used by individual vendors, including one from

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, where its wild ancestor, ' ...

Character set 0x21/0xA1 (row 1: punctuation and symbols)

This row contains punctuation, mathematical operators, and other symbols. The following table shows the GB 18030 mappings for these GB/T 2312 characters first, followed by any other documented mappings.

Character set 0x22/0xA2 (row 2: list markers)

This row contains various types of list marker. Lowercase forms of the Roman numerals were not included in the original GB/T 2312 nor in GB/T 12345, but are included in both Windows code page 936 and

. A euro sign was also added by GB 18030.

Character set 0x23/0xA3 (row 3: ISO 646-CN)

This row contains

ISO 646-CN ISO/IEC 646 is a set of ISO/ IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in ...

(GB/T 1988-80), a national counterpart to

. Compare row 3 of KS X 1001, which does the same with

South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korean Peninsula and sharing a land border with North Korea. Its western border is formed by the Yellow Sea, while its eas ...

's ISO 646 version, and row 3 of JIS X 0208 and of KPS 9566, which include only the alphanumeric subset, but in the same layout. The following chart lists ISO 646-CN. When used in an encoding allowing combination with ASCII such as

(and its superset

), these characters are usually implemented as

fullwidth In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK: 半角) characters. Unlik ...

characters, hence mappings to the Halfwidth and Fullwidth Forms block are used as shown below. GB 6345.1 also handles this row as fullwidth, and adds the halfwidth forms (as above) as row 10. Apple mostly maps this row to fullwidth code points as below, but uses non-fullwidth mappings for the overline and

yuan sign The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. The symbol is usu ...

as above.

Character set 0x24/0xA4 (row 4: Hiragana)

This set contains

for writing the

Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...

. Compare with row 4 of JIS X 0208, which this row matches, and with row 10 of KS X 1001 and of KPS 9566, which use the same layout, but in a different row.

Character set 0x25/0xA5 (row 5: Katakana)

This set contains

for writing the

. However, the Japanese long vowel mark, which is used in katakana text and included in row 1 of

JIS X 0208 JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...

, is not included in GB/T 2312, although it is added in GBK and GB 18030 outside of the main GB/T 2312 plane, at 0xA960. Compare with row 5 of JIS X 0208, which this row matches, and with row 11 of KS X 1001 and of KPS 9566, which use the same layout, but in a different row.

Character set 0x26/0xA6 (row 6: Greek and vertical extensions)

This row contains basic support for the modern

Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as w ...

, without diacritics or the final sigma. The highlighted characters are presentation forms of punctuation marks for vertical writing, and are not included in GB/T 2312 proper, but are included in this row by GB/T 12345, Windows code page 936, Mac OS Simplified Chinese, and GB 18030. They are seen as "standard extensions to GB 2312". Conversely,

includes patterned

semigraphic Text-based semigraphics or pseudographics is a primitive method used in early text mode video hardware to emulate raster graphics without having to implement the logic for such a display mode. There are two different ways to accomplish the emu ...

characters in this row (mostly without exact counterparts in Unicode), colliding with the code positions used for the vertical extensions. Compare with row 6 of JIS X 0208, which this row matches when the vertical forms are not included, and with row 6 of KPS 9566, which includes the same Greek letters in the same layout, but adds Roman numerals rather than vertical forms. Contrast row 5 of KS X 1001, which offsets the Greek letters to include the Roman numerals first.

Character set 0x27/0xA7 (row 7: Cyrillic)

This set includes both cases of 33 letters from the

Cyrillic script The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...

, sufficient to write the modern Russian alphabet and

Bulgarian alphabet The Bulgarian Cyrillic alphabet is used to write the Bulgarian language. The Cyrillic alphabet was originally developed in the First Bulgarian Empire during the 9th – 10th century AD at the Preslav Literary School. It has been used in Bulgar ...

, although other forms of Cyrillic require additional letters. Compare with row 7 of JIS X 0208, which this row matches, and with row 12 of KS X 1001 and row 5 of KPS 9566, which use the same layout but in different rows.

Character set 0x28/0xA8 (row 8: zhuyin and non-ASCII pinyin)

This row contains bopomofo and

pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Chinese, Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally writte ...

characters, excluding ASCII letters (which are in row 3). The highlighted characters are those which are not in the base GB 2312 set but are added by GB 6345.1, and also included in GB/T 12345, Windows code page 936, Mac OS Simplified Chinese and GB 18030. They are seen as "standard extensions to GB 2312". GB 6345.1 treats the pinyin in this row as fullwidth, and includes halfwidth counterparts as row 11; GB 18030 does not do this.

Character set 0x29/0xA9 (row 9: box drawing)

Hanzi rows

Inclusion of non-standard Simplified Chinese characters and Traditional Chinese characters

GB/T 2312 included 2 non-standard

: * (68–41): Simplified from “”, but the ''Complete List of Simplified Characters'' ( zh, c=简化字总表, p=Jiǎnhuà Zì Zǒng Biǎo) has merged “” with “”. Old versions of '' Xinhua Zidian'' ( zh, c=新华字典, p=Xīnhuá Zìdiǎn) had included this word and noted as juice ( zh, c=汁, p=zhì), new versions has cancelled this and merged “” with “”. * (79–64): Simplified from “”, but the ''Complete List of Simplified Characters'' has merged “” with “”. GB/T 2312 also included 3

Traditional Chinese characters Traditional Chinese characters are one type of standard Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the clerical change and mostly remained in the same structure they took ...

: * (79–81): The original document used the character “” with traditional part, but the ''Complete List of Simplified Characters'' has merged “” with “” and simplified to “”, later templates changed the word to “”.) in 1964 noted that can be used in names and citing Classical Chinese texts, ''

Table of General Standard Chinese Characters The ''Table of General Standard Chinese Characters'' () is the current standard list of 8,105 Chinese characters published by the government of the People's Republic of China and promulgated in June 2013. Of the characters included, 3,500 are ...

'' ( zh, c=通用規範漢字表, p=Tōngyòng Guīfàn Hànzì Biǎo) in 2013 has accepted (2013:7679) to be used in names. * (65–65): The character has been merged with “” (26-83) in the ''Complete List of Simplified Characters'', and did not have any notes about unclear usage, but GB/T 2312 had included this character. * (84–80): The original document used the character “” with traditional part, but the ''Complete List of Simplified Characters'' has stated that “” should be simplified to “”; the corresponding Simplified Chinese character “” was submitted to Unicode by Japan as

Shinjitai are the simplified forms of kanji used in Japan since the promulgation of the Tōyō Kanji List in 1946. Some of the new forms found in ''shinjitai'' are also found in Simplified Chinese characters, but ''shinjitai'' is generally not as extensiv ...

“”. Although GB 5007.1–85 has changed “” with “”, however, the following amendments (GB 5007.1–2001 and GB/T 5007.1–2010) keeps the unsimplified form. ''

'' included “” on 2013:7748.

Corrections

GB 5007.1-85 ''24x24

Bitmap In computing, a bitmap is a mapping from some domain (for example, a range of integers) to bits. It is also called a bit array or bitmap index. As a noun, the term "bitmap" is very often used to refer to a particular bitmapping application: t ...

Font Set of Chinese Characters for Information Exchange'' ( zh, c=信息交换用汉字 24x24 点阵字模集) is the earliest font template based on GB/T 2312 that features corrections and extensions including: * changing the glyph shape of

Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and th ...

"g" * adding 6 Hanyu Pinyin characters: ɑ, ḿ, ń, ň, ǹ, ɡ * changed “” to “” * included 94 half-width glyphs in row 10 (half-width form of row 3, equivalent to GB 1988–80 * included half-width form of 32 Hanyu Pinyin characters from row 8 in row 11. GB/T 2312 did not have corrections, but these corrections are included in font templates that are based on GB/T 2312 including GB/T 12345; its supersets GBK and

also included these corrections. GB/T 2312 is also used in

References

Notes

External links

Graphical View of GB2312 in ICU's Converter Explorer

Chinese Character Codes

Coded Chinese Graphic Character Set for Information Interchange ISO-IR 58

C code generates 6763 basic characters with output

{{Character encoding Character sets 2312 Encodings of Asian languages Chinese-language computing