CJK ideographs
   HOME

TheInfoList



OR:

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as
CJK characters In internationalization, CJK characters is a collective term for the Chinese, Japanese, and Korean languages, all of which include Chinese characters and derivatives in their writing systems, sometimes paired with other scripts. Collectively, th ...
. In the process called
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode 15.0, Unicode defines a total of 97,058 CJK Unified Ideographs. The term ''ideographs'' is a misnomer, as the
Chinese script Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
is not
ideographic An ideogram or ideograph (from Greek "idea" and "to write") is a graphic symbol that represents an idea or concept, independent of any particular language, and specific words or phrases. Some ideograms are comprehensible only by familiari ...
but rather
logographic In a written language, a logogram, logograph, or lexigraph is a written character that represents a word or morpheme. Chinese characters (pronounced ''hanzi'' in Mandarin, ''kanji'' in Japanese, ''hanja'' in Korean) are generally logograms, as ...
. Historically, Vietnam used Chinese characters too, so sometimes the abbreviation CJKV is used. Vietnamese use was replaced by the Latin-based
Vietnamese alphabet The Vietnamese alphabet ( vi, chữ Quốc ngữ, lit=script of the National language) is the modern Latin writing script or writing system for Vietnamese. It uses the Latin script based on Romance languages originally developed by Portuguese m ...
in the 1920s.


Sources

The
Ideographic Research Group The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC 1/SC 2 (SC 2), the subcommittee of the Joint Technical Committee of ISO and IEC which is responsible for ...
(IRG) is responsible for developing extensions to the encoded repertoires of CJK unified ideographs. IRG processes proposals for new CJK unified ideographs submitted by its member bodies, and after undergoing several rounds of expert review, IRG submits a consolidated set of characters to ISO/IEC JTC 1/SC 2 Working Group 2 (WG2) and the
Unicode Technical Committee The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...
(UTC) for consideration for inclusion in the
ISO/IEC 10646 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
and
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
standards. The following IRG member bodies have been involved in the standardization of CJK unified ideographs: * China *
Hong Kong Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
* Japan *
South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korean Peninsula and sharing a land border with North Korea. Its western border is formed by the Yellow Sea, while its eas ...
*
North Korea North Korea, officially the Democratic People's Republic of Korea (DPRK), is a country in East Asia. It constitutes the northern half of the Korean Peninsula and shares borders with China and Russia to the north, at the Yalu (Amnok) and T ...
*
Macau Macau or Macao (; ; ; ), officially the Macao Special Administrative Region of the People's Republic of China (MSAR), is a city and special administrative region of China in the western Pearl River Delta by the South China Sea. With a p ...
*
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
, liaison member represented by the Taipei Computer Association (TCA) *
Vietnam Vietnam or Viet Nam ( vi, Việt Nam, ), officially the Socialist Republic of Vietnam,., group="n" is a country in Southeast Asia, at the eastern edge of mainland Southeast Asia, with an area of and population of 96 million, making i ...
*
Unicode Technical Committee The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intent ...
(liaison member) *
United Kingdom The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Europe, off the north-western coast of the European mainland, continental mainland. It comprises England, Scotlan ...
* SAT (liaison member) The ideographs submitted by the UTC and the United Kingdom are not specific to any particular region, but are characters which have been suggested for encoding by individual experts. The ideographs submitted by SAT are required for the SAT Daizōkyō text database. The table below gives the numbers of encoded CJK unified ideographs for each IRG source for Unicode 15.0. The total number of characters (223,653) far exceeds the number of encoded CJK unified ideographs (97,058) as many characters have more than one source.


UTC sources

The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee (UTC) documents. Other sources include: * '' ABC Chinese-English Dictionary'' by
John DeFrancis John DeFrancis (August 31, 1911January 2, 2009) was an American linguist, sinologist, author of Chinese language textbooks, lexicographer of Chinese dictionaries, and Professor Emeritus of Chinese Studies at the University of Hawaii at Mānoa. ...
* The Adobe-CNS1 glyph collection * The Adobe-Japan1 glyph collection * A Complete Checklist of Species and Subspecies of Chinese Birds (中国鸟类系统检索) * The Great Nom Dictionary (Đại Tự Điển Chữ Nôm) * Annotations to ''
Shuowen Jiezi ''Shuowen Jiezi'' () is an ancient Chinese dictionary from the Han dynasty. Although not the first comprehensive Chinese character dictionary (the ''Erya'' predates it), it was the first to analyze the structure of the characters and to give t ...
'' (annotated by
Duan Yucai Duan Yucai () (1735–1815), courtesy name Ruoying () was a Chinese philologist of the Qing Dynasty. He made great contributions to the study of Historical Chinese phonology, and is known for his annotated edition of ''Shuowen Jiezi''. Biograph ...
) * GB18030-2000 * Required Character List Supplied by
the Church of Jesus Christ of Latter-day Saints The Church of Jesus Christ of Latter-day Saints, informally known as the LDS Church or Mormon Church, is a nontrinitarian Christian church that considers itself to be the restoration of the original church founded by Jesus Christ. The ch ...
(Hong Kong) * New Commercial Dictionary (商务新词典), Hong Kong * Modern Chinese Dictionary (现代汉语词典), by
Chinese Academy of Social Sciences The Chinese Academy of Social Sciences (CASS) is a Chinese research institute and think tank. The institution is the premier comprehensive national academic research organization in the People's Republic of China for the study in the fields of ...
, Linguistics Research Institute, Dictionary Editorial Office * Working Group (WG2) documents * Wenlin (文林) http://www.wenlin.com/


CJK Unified Ideographs blocks


CJK Unified Ideographs

The basic block named ''
CJK Unified Ideographs The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode ...
'' (4E00–9FFF) contains 20,992 basic
Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...
in the range U+4E00 through U+9FFF. The block not only includes characters used in the
Chinese writing system Written Chinese () comprises Chinese characters used to represent the Chinese language. Chinese characters do not constitute an alphabet or a compact syllabary. Rather, the writing system is roughly logosyllabic; that is, a character generally rep ...
but also
kanji are the logographic Chinese characters taken from the Chinese family of scripts, Chinese script and used in the writing of Japanese language, Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese ...
used in the
Japanese writing system The modern Japanese writing system uses a combination of logographic kanji, which are adopted Chinese characters, and syllabic kana. Kana itself consists of a pair of syllabaries: hiragana, used primarily for native or naturalised Japanese ...
and
hanja Hanja (Hangul: ; Hanja: , ), alternatively known as Hancha, are Chinese characters () used in the writing of Korean. Hanja was used as early as the Gojoseon period, the first ever Korean kingdom. (, ) refers to Sino-Korean vocabulary, ...
, whose use is diminishing in
Korea Korea ( ko, 한국, or , ) is a peninsular region in East Asia. Since 1945, it has been divided at or near the 38th parallel, with North Korea (Democratic People's Republic of Korea) comprising its northern half and South Korea (Republic o ...
. Many characters in this block are used in all three
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s, while others are in only one or two of the three. Chữ Hán are also used in Vietnam's
chữ Nôm Chữ Nôm (, ; ) is a logographic writing system formerly used to write the Vietnamese language. It uses Chinese characters ('' Chữ Hán'') to represent Sino-Vietnamese vocabulary and some native Vietnamese words, with other words represent ...
(now obsolete). The first 20,902 characters in the block are arranged according to the
Kangxi Dictionary The ''Kangxi Dictionary'' ( (Compendium of standard characters from the Kangxi period), published in 1716, was the most authoritative dictionary of Chinese characters from the 18th century through the early 20th. The Kangxi Emperor of the Qing ...
ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order. The block is the result of
Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
, which was somewhat controversial within East Asia. Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the ''source separation rule'' states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding. Using
variation selectors Variation Selectors is the block name of a Unicode code point block containing 16 variation selectors. Each variation selector is used to specify a specific glyph variant for a preceding character. They are currently used to specify standardize ...
, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
, which has 14,684 ideographic variation sequences, is an extreme example of the use of variation selectors.


Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.


Sources

Note: Most characters appear in multiple sources, so the sum of individual character counts (102,794) is far greater than the number of encoded characters (20,992). In Unicode 4.1, 14 HKSCS-2004 characters and 8
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet n ...
characters were assigned to between U+9FA6 and U+9FBB code points. Since then, other additions were added to this block for various reasons, all summarized in the
version history Software versioning is the process of assigning either unique ''version names'' or unique ''version numbers'' to unique states of computer software. Within a given version number category (e.g., major or minor), these numbers are generally assig ...
section below.


CJK Unified Ideographs Extension A

The block named '' CJK Unified Ideographs Extension A'' (3400–4DBF) contains 6,592 additional characters in the range U+3400 through U+4DBF.


Charts

3400-4DBF.


Sources

Note: Most characters appear in more than one source, so the sum of individual character counts (18,832) is far greater than the number of encoded characters (6,592).


CJK Unified Ideographs Extension B

The block named ''
CJK Unified Ideographs Extension B CJK Unified Ideographs Extension B is a Unicode block A Unicode block is one of several contiguous ranges of numeric character codes ( code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and ...
'' (20000–2A6DF) contains 42,720 characters in the range U+20000 through U+2A6DF. These include most of the characters used in the
Kangxi Dictionary The ''Kangxi Dictionary'' ( (Compendium of standard characters from the Kangxi period), published in 1716, was the most authoritative dictionary of Chinese characters from the 18th century through the early 20th. The Kangxi Emperor of the Qing ...
that are not in the basic CJK Unified Ideographs block, as well as many Hán-Nôm characters that were formerly used to write Vietnamese.


Charts

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.


Sources

Note: Many characters appear in more than one source, so the sum of individual character counts (74,204) is far greater than the number of encoded characters (42,720).


CJK Unified Ideographs Extension C

The block named '' CJK Unified Ideographs Extension C'' (2A700–2B73F) contains 4,154 characters in the range U+2A700 through U+2B739. It was initially added in Unicode 5.2 (2009).


Charts

2A700-2B73F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,570) is greater than the number of encoded characters (4,154).


CJK Unified Ideographs Extension D

The block named '' CJK Unified Ideographs Extension D'' (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).


Charts

2B740–2B81F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (229) is greater than the number of encoded characters (222).


CJK Unified Ideographs Extension E

The block named '' CJK Unified Ideographs Extension E'' (2B820–2CEAF) contains 5,762 characters in the range U+2B820 through U+2CEA1 that were added in Unicode 8.0 (2015).


Charts

2B820–2CEAF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,828) is greater than the number of encoded characters (5,762).


CJK Unified Ideographs Extension F

The block named '' CJK Unified Ideographs Extension F'' (2CEB0–2EBEF) contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0 (2017). It includes more than 1,000
Sawndip Zhuang characters or ''Sawndip'' (Sawndip: ; ) are logograms derived from Chinese characters and used by the Zhuang people of Guangxi and Yunnan provinces in China to write the Zhuang languages for more than one thousand years. The script is used ...
characters for Zhuang.


Charts

2CEB0–2EBEF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (7,774) is greater than the number of encoded characters (7,473).


CJK Unified Ideographs Extension G

A block named '' CJK Unified Ideographs Extension G'' was added as part of Unicode 13.0 to the Tertiary Ideographic Plane in the range U+30000 through U+3134F, containing 4,939 characters.


Charts

30000–3134F.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,081) is greater than the number of encoded characters (4,939).


CJK Unified Ideographs Extension H

A block named '' CJK Unified Ideographs Extension H'' was added as part of Unicode 15.0 to the Tertiary Ideographic Plane in the range U+31350 through U+323AF, containing 4,192 characters.


Charts

31350–323AF.


Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,305) is greater than the number of encoded characters (4,192).


CJK Compatibility Ideographs

The block named '' CJK Compatibility Ideographs'' (F900–FAFF) was created to retain round-trip compatibility with other standards. Only twelve of its characters have the "Unified Ideograph" property: U+FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29. None of the other characters in this and other "Compatibility" blocks relate to CJK Unification.


Charts

F900–FAFF.


Sources

Note: All characters appear in more than one source, so the sum of individual character counts (36) is greater than the number of encoded characters (12).


Known issues


Disunification


U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings. The proposal of disunification of U+4039 was accepted and the new character is encoded at U+9FC3 (鿃) in Unicode 5.1.


Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters are incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones.


Unifiable variants and exact duplicates in Extension B

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded. In addition to the deliberate encoding of close glyph variants, six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a ''de facto'' disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake: * U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8 * U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes * U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the Chinese Mainland-, Taiwan- and Japan-source glyphs for U+8641 * U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals * U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes * U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals * U+26842 𦡂 = U+26866 𦡦 : same glyph shapes * U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")


Other CJK ideographs in Unicode, not Unified

Apart from the nine blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different. An example of a not-unified CJK-character is in the
CJK Symbols and Punctuation CJK Symbols and Punctuation is a Unicode block containing symbols and punctuation used for writing the Chinese, Japanese and Korean languages. It also contains one Chinese character. Block The block has variation sequences defined for East ...
block. Although it is not covered under "CJK Unified Ideographs", it is treated as a CJK-character for all other intents and purposes. Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets: *
CJK Compatibility CJK Compatibility is a Unicode block containing square symbols (both CJK and Latin alphanumeric) encoded for compatibility with East Asian character sets. In Unicode 1.0, it was divided into two blocks, named CJK Squared Words (U+3300–U+337F) ...
(3300–33FF) *
CJK Compatibility Forms CJK Compatibility Forms is a Unicode block containing vertical glyph variants for east Asian compatibility. Its block name in Unicode 1.0 was CNS 11643 Compatibility, in reference to CNS 11643. History The following Unicode-related documents ...
(FE30–FE4F) * CJK Compatibility Ideographs (F900–FAFF) * CJK Compatibility Ideographs Supplement (2F800–2FA1F) They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.


Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecima ...
, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have fewer characters (about 13,000 and 8,000, respectively) than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista.


Unicode version history


See also

*
Han Unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...
*
List of Unicode characters As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets. This article includes the 1062 characters in the Multilingual European Character Set 2 (ME ...
*
List of CJK fonts This is a list of notable CJK fonts ( computer fonts which contain a large range of Chinese/Japanese/Korean characters). These fonts are primarily sorted by their typeface, the main classes being "with serif", "without serif" and "script". In th ...
*
Ideographic Research Group The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC 1/SC 2 (SC 2), the subcommittee of the Joint Technical Committee of ISO and IEC which is responsible for ...
*
Chinese cultural sphere The East Asian cultural sphere, also known as the Sinosphere, the Sinic world, the Sinitic world, the Chinese cultural sphere, the Chinese character sphere encompasses multiple countries in East Asia and Southeast Asia that were historically ...


Notes


External links


UK-Source Ideographs
(Documents IRG N2107R2 and IRG N2232R) {{Unicode navigation CJK, Unicode CJK Unified Ideographs