is a key official character set of the

People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's List of countries and dependencies by population, most populous country, with a Population of China, population exceeding 1.4 billion, sli ...

, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准), whereas the ''T'' suffix ( zh, c= 推荐, p=tuījiàn, l=recommendation, labels=no) denotes a non-mandatory standard. was originally a mandatory national standard designated . However, following a National Standard Bulletin of the

in 2017, GB 2312 is no longer mandatory, and its standard code is modified to . has been superseded by GBK and

GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...

, which include additional characters, but remains in widespread use as a subset of those encodings. , GB2312 is the second-most popular encoding served from China and territories (after

UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...

), with 5.5% of web servers serving a page declaring it. Globally, GB2312 is declared on 0.1% of all web pages. However, all major web browsers decode GB2312-marked documents as if they were marked with the superset GBK encoding, except for Safari and Edge on the label GB_2312. There is an analogous character set known as GB/T 12345, closely related to GB/T 2312, but with traditional character forms replacing simplified forms, and some extra 62 supplemental characters. GB-encoded fonts often come in pairs, one with the GB/T 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.

Character range in rows

While GB/T 2312 covers over 99.99% contemporary Chinese text usage, historical texts and many names remain out of scope. Old standard includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese

kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most pr ...

, the

Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...

and Cyrillic alphabets, Zhuyin, and a double-byte set of

Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese fo ...

letters with tone marks. In later version GB/T 2312-1980, there are 7,445 letters. Characters in GB/T 2312 are arranged in a 94×94 grid (as in ISO 2022), and the two-byte code point of each character is expressed in the ''kuten'' (or qūwèi, 区位) form, which specifies a row (''ku'' or qū，区) and the position of the character within the row (cell, ''ten'' or wèi，位). For example, the character "外" (meaning: foreign) is located in row 45 position 66, thus its ''kuten'' code is 45-66. The rows (numbered from 1 to 94) contain characters as follows: * 01–09, comprising punctuation and other special characters; also

Hiragana is a Japanese language, Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" ori ...

, Katakana,

, Cyrillic,

, Bopomofo * 16–55, the first level of

Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as '' kan ...

, arranged according to

. (3755 characters). * 56–87, the second level of Chinese characters, arranged according to radical and strokes. (3008 characters). The rows 10–15 and 88–94 are unassigned. For GB/T 2312-1980, it contains 682 signs and 6763 Chinese Characters.

Encodings of GB/T 2312

EUC-CN

EUC-CN is often used as the

character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...

(i.e. for external storage) in programs that deal with GB/T 2312, thus maintaining compatibility with

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...

. Two bytes are used to represent every character not found in

. The value of the first byte is from 0xA1–0xF7 (161–247), while the value of the second byte is from 0xA1–0xFE (161–254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last. Compared to

, GB/T 2312 (whether native or encoded in EUC-CN) is more storage efficient: while

uses three bytes per CJK ideograph, GB/T 2312 only uses two. However, GB/T 2312 does not cover as many ideographs as Unicode does. To map the ''kuten'' code points to EUC bytes, add 160 (0xA0) to both the row number (''ku'' or qū, 区) and cell/column number (''ten'' or wèi, 位). The result of addition to the row number of the code point will form the high byte, and the result of addition to the cell number of the code point will form the low byte. For example, to encode the character "外" at ''kuten'' cell 45-66, the high byte will use the row number 45: 45+160=205=0xCD, and the low byte will come from the cell number 66: 66+160=212=0xE2. So, the full encoding is .

ISO-2022-CN

ISO-2022-CN is another encoding form of GB/T 2312, which is also the encoding specified in the official documentation. This encoding references the ISO-2022 standard, which also uses two bytes to encode characters not found in ASCII. However, instead of using the extended region of ASCII, ISO-2022 uses the same byte range as ASCII: the value of the first byte is from 0x21–0x77 (33–119), while the value of the second byte is from 0x21–0x7E (33–126). As the byte range overlaps ASCII significantly, special characters are required to indicate whether a character is in the ASCII range or is part of the two-byte sequence of extended region, namely the Shift Out and Shift In functions. This poses a risk for misencoding as improper handling of text can result in missing information. To map the ''kuten'' code points to ISO-2022 bytes, add 32 (0x20) to both the row number (''ku'' or qū, 区) and cell/column number (''ten'' or wèi, 位). The result of addition to the row number of the code point will form the high byte, and the result of addition to the cell number of the code point will form the low byte similar to EUC encoding. For example, to encode the character "外" at ''kuten'' cell 45-66, the high byte will use the row number 45: 45+32=77=0x4D, and the low byte will come from the cell number 66: 66+32=98=0x62. So, the full encoding is <4D 62>.

HZ

HZ is another encoding of GB/T 2312 that is used mostly for

Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it was ...

postings; characters are represented with the same byte pairs as in ISO-2022-CN, but the byte sequences denoting the beginning and end of a range of GB 2312 text differ.

Code charts

In the tables below, where a pair of hexadecimal numbers is given for a prefix byte or a coding byte, the smaller (with the eighth bit unset or unavailable) is used when encoded over GL ( 0x21-0x7E), as in ISO-2022-CN or

HZ-GB-2312 The HZ character encoding is an encoding of GB 2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee () of Stanford University, and subsequently codified in 1995 into RFC 1843. The HZ, short f ...

, and the larger (with the eighth bit set) is used in the more typical case of it being encoded over GR (0xA1-0xFE), as in EUC-CN, GBK or

. Qūwèi numbers are given in decimal. When GB/T 2312 is encoded over GR, both bytes have the eighth bit set (i.e. are greater than 0x7F). GBK and GB 18030 also make use of two-byte codes in which only the first byte has the eighth bit set for extension purposes: such codes are outside of the GB/T 2312 plane, and are not tabulated here.

Lead byte

This chart details the overall layout of the main plane of the GB/T 2312 character set by lead byte. For lead bytes used for characters other than

hanzi Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...

, links are provided to charts on this page listing the characters encoded under that lead byte. For lead bytes used for hanzi, links are provided to the appropriate section of

Wiktionary Wiktionary ( , , rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number ...

's hanzi index.

Non-Hanzi rows

The following charts list the non-

characters available in GB/T 2312, in GB/T 12345, and in double-byte region 1 of

(which roughly corresponds to the non-hanzi region of GB/T 2312). Notes are made where these differ, and where GB 6345.1 and ISO-IR-165 differ from these. Cross-references are made to articles on other CJK national character sets for comparison.

Two implementations of GB2312

Unicode mappings of the

interpunct An interpunct , also known as an interpoint, middle dot, middot and centered dot or centred dot, is a punctuation mark consisting of a vertically centered dot used for interword separation in ancient Latin script. (Word-separating spaces did no ...

() and

em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...

() in the subset of GBK and

corresponding to GB/T 2312 ( and ) differ from the those which are listed in GB2312.TXT ( and ), which is a data file which was previously provided by the

Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...

, although it has been designated as obsolete since August 2011 and is no longer hosted as of September 2016. As of 2015, Microsoft .Net Framework follows GB 18030 mappings when mapping those two characters in data labelled , whereas

ICU ICU commonly refers to: * Intensive care unit, a special department of a hospital ICU may also refer to: Organisations Universities * Information and Communications University, South Korea *Istanbul Commerce University, Istanbul, Turkey * Intern ...

, iconv-1.14, php-5.6, ActivePerl-5.20, Java 1.7 and Python 3.4 follow GB2312.TXT in response to the label. Ruby 2.2 is compatible with both implementations; it internally converts the conflictive characters to the GB 18030 subset. The W3C/

WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, ...

technical recommendation for use with HTML5 specifies a GBK encoding to be inferred for streams labelled gb2312, which in turn uses a GB18030 decoder. Other differing mappings have been defined and used by individual vendors, including one from

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus '' Malus''. The tree originated in Central Asia, where its wild ances ...

Character set 0x21/0xA1 (row 1: punctuation and symbols)

This row contains punctuation, mathematical operators, and other symbols. The following table shows the GB 18030 mappings for these GB/T 2312 characters first, followed by any other documented mappings.

Character set 0x22/0xA2 (row 2: list markers)

This row contains various types of list marker. Lowercase forms of the Roman numerals were not included in the original GB/T 2312 nor in GB/T 12345, but are included in both Windows code page 936 and

. A euro sign was also added by GB 18030.

Character set 0x23/0xA3 (row 3: ISO 646-CN)

This row contains ISO 646-CN (GB/T 1988-80), a national counterpart to

. Compare row 3 of KS X 1001, which does the same with

South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korea, Korean Peninsula and sharing a Korean Demilitarized Zone, land border with North Korea. Its western border is formed ...

's ISO 646 version, and row 3 of JIS X 0208 and of KPS 9566, which include only the alphanumeric subset, but in the same layout. The following chart lists ISO 646-CN. When used in an encoding allowing combination with ASCII such as EUC-CN (and its superset

), these characters are usually implemented as fullwidth characters, hence mappings to the Halfwidth and Fullwidth Forms block are used as shown below. GB 6345.1 also handles this row as fullwidth, and adds the halfwidth forms (as above) as row 10. Apple mostly maps this row to fullwidth code points as below, but uses non-fullwidth mappings for the overline and yuan sign as above.

Character set 0x24/0xA4 (row 4: Hiragana)

This set contains

for writing the

Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...

. Compare with row 4 of JIS X 0208, which this row matches, and with row 10 of KS X 1001 and of KPS 9566, which use the same layout, but in a different row.

Character set 0x25/0xA5 (row 5: Katakana)

This set contains Katakana for writing the

. However, the Japanese long vowel mark, which is used in katakana text and included in row 1 of JIS X 0208, is not included in GB/T 2312, although it is added in GBK and GB 18030 outside of the main GB/T 2312 plane, at 0xA960. Compare with row 5 of JIS X 0208, which this row matches, and with row 11 of KS X 1001 and of KPS 9566, which use the same layout, but in a different row.

Character set 0x26/0xA6 (row 6: Greek and vertical extensions)

This row contains basic support for the modern

Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as ...

, without diacritics or the

final sigma Sigma (; uppercase Σ, lowercase σ, lowercase in word-final position ς; grc-gre, σίγμα) is the eighteenth letter of the Greek alphabet. In the system of Greek numerals, it has a value of 200. In general mathematics, uppercase Σ is used as ...

. The highlighted characters are presentation forms of punctuation marks for vertical writing, and are not included in GB/T 2312 proper, but are included in this row by GB/T 12345, Windows code page 936, Mac OS Simplified Chinese, and GB 18030. They are seen as "standard extensions to GB 2312". Conversely, ISO-IR-165 includes patterned semigraphic characters in this row (mostly without exact counterparts in Unicode), colliding with the code positions used for the vertical extensions. Compare with row 6 of JIS X 0208, which this row matches when the vertical forms are not included, and with row 6 of KPS 9566, which includes the same Greek letters in the same layout, but adds Roman numerals rather than vertical forms. Contrast row 5 of KS X 1001, which offsets the Greek letters to include the Roman numerals first.

Character set 0x27/0xA7 (row 7: Cyrillic)

This set includes both cases of 33 letters from the

Cyrillic script The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking cou ...

, sufficient to write the modern Russian alphabet and

Bulgarian alphabet The Bulgarian Cyrillic alphabet is used to write the Bulgarian language. The Cyrillic alphabet was originally developed in the First Bulgarian Empire during the 9th – 10th century AD at the Preslav Literary School. It has been used in Bulgar ...

, although other forms of Cyrillic require additional letters. Compare with row 7 of JIS X 0208, which this row matches, and with row 12 of KS X 1001 and row 5 of KPS 9566, which use the same layout but in different rows.

Character set 0x28/0xA8 (row 8: zhuyin and non-ASCII pinyin)

This row contains bopomofo and

pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese fo ...

characters, excluding ASCII letters (which are in row 3). The highlighted characters are those which are not in the base GB 2312 set but are added by GB 6345.1, and also included in GB/T 12345, Windows code page 936, Mac OS Simplified Chinese and GB 18030. They are seen as "standard extensions to GB 2312". GB 6345.1 treats the pinyin in this row as fullwidth, and includes halfwidth counterparts as row 11; GB 18030 does not do this.

Character set 0x29/0xA9 (row 9: box drawing)

Hanzi rows

Inclusion of non-standard Simplified Chinese characters and Traditional Chinese characters

GB/T 2312 included 2 non-standard Simplified Chinese characters: * (68–41): Simplified from “”, but the ''Complete List of Simplified Characters'' ( zh, c=简化字总表, p=Jiǎnhuà Zì Zǒng Biǎo) has merged “” with “”. Old versions of ''

Xinhua Zidian The ''Xinhua Zidian'' (), or ''Xinhua Dictionary'', is a Chinese language dictionary published by the Commercial Press. It is the best-selling Chinese dictionary and the world's most popular reference work. In 2016, Guinness World Records offic ...

'' ( zh, c=新华字典, p=Xīnhuá Zìdiǎn) had included this word and noted as juice ( zh, c=汁, p=zhì), new versions has cancelled this and merged “” with “”. * (79–64): Simplified from “”, but the ''Complete List of Simplified Characters'' has merged “” with “”. GB/T 2312 also included 3

Traditional Chinese characters Traditional Chinese characters are one type of standard Chinese characters, Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the libian, clerical change and mostly remained in the ...

: * (79–81): The original document used the character “” with traditional part, but the ''Complete List of Simplified Characters'' has merged “” with “” and simplified to “”, later templates changed the word to “”.) in 1964 noted that can be used in names and citing Classical Chinese texts, '' Table of General Standard Chinese Characters'' ( zh, c=通用規範漢字表, p=Tōngyòng Guīfàn Hànzì Biǎo) in 2013 has accepted (2013:7679) to be used in names. * (65–65): The character has been merged with “” (26-83) in the ''Complete List of Simplified Characters'', and did not have any notes about unclear usage, but GB/T 2312 had included this character. * (84–80): The original document used the character “” with traditional part, but the ''Complete List of Simplified Characters'' has stated that “” should be simplified to “”; the corresponding Simplified Chinese character “” was submitted to Unicode by Japan as

Shinjitai are the simplified forms of kanji used in Japan since the promulgation of the Tōyō Kanji List in 1946. Some of the new forms found in ''shinjitai'' are also found in Simplified Chinese characters, but ''shinjitai'' is generally not as extensi ...

“”. Although GB 5007.1–85 has changed “” with “”, however, the following amendments (GB 5007.1–2001 and GB/T 5007.1–2010) keeps the unsimplified form. '' Table of General Standard Chinese Characters'' included “” on 2013:7748.

Corrections

GB 5007.1-85 ''24x24 Bitmap

Font In movable type, metal typesetting, a font is a particular #Characteristics, size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "Sort (typesetting), sort") for each glyph. A typeface consists of a range of ...

Set of Chinese Characters for Information Exchange'' ( zh, c=信息交换用汉字 24x24 点阵字模集) is the earliest font template based on GB/T 2312 that features corrections and extensions including: * changing the glyph shape of

Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the ...

"g" * adding 6

Hanyu Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...

characters: ɑ,

ḿ Ḿ, ḿ ( m- acute) is a letter in Chinese pinyin. In Chinese pinyin ḿ is the ''yángpíng'' tone (阳平, high-rising tone) of “m”. It was also used in an old version of the Sorbian alphabet and in older Polish. This letter is also use ...

, ń, ň, ǹ, ɡ * changed “” to “” * included 94 half-width glyphs in row 10 (half-width form of row 3, equivalent to GB 1988–80 * included half-width form of 32 Hanyu Pinyin characters from row 8 in row 11. GB/T 2312 did not have corrections, but these corrections are included in font templates that are based on GB/T 2312 including GB/T 12345; its supersets GBK and

also included these corrections. GB/T 2312 is also used in ISO-IR-165.

References

Notes

External links

Graphical View of GB2312 in ICU's Converter Explorer

Chinese Character Codes

Coded Chinese Graphic Character Set for Information Interchange ISO-IR 58

C code generates 6763 basic characters with output

{{Character encoding Character sets 2312 Encodings of Asian languages Chinese-language computing