HOME

TheInfoList



OR:

The Chinese Character Code for Information Interchange () or CCCII is a
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
developed by the Chinese Character Analysis Group in
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
. It was first published in 1980, and significantly expanded in 1982 and 1987. It is used mostly by library systems. It is one of the earliest established and most sophisticated encodings for
traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
(predating the establishment of
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
in 1984 and
CNS 11643 The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code or CSIC ( zh, tr=, t=中文標準交換碼), is officially the standard character set of Taiwan (Republic of China). In p ...
in 1986). It is distinguished by its unique system for encoding simplified versions and other
variants Variant may refer to: In arts and entertainment * ''Variant'' (magazine), a former British cultural magazine * Variant cover, an issue of comic books with varying cover art * ''Variant'' (novel), a novel by Robison Wells * "The Variant", 2021 e ...
of its main set of hanzi characters. A variant of an earlier version of CCCII is used by the
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The library ...
as part of MARC-8, under the name East Asian Character Code (EACC, ANSI/NISO Z39.64), where it comprises part of MARC 21's
JACKPHY In library automation the initialism JACKPHY refers to a group of language scripts not based on Roman characters, specifically: Japanese, Arabic, Chinese, Korean, Persian, Hebrew, and Yiddish. Focus on these seven writing systems by Library of C ...
support. However, EACC contains fewer characters than the most recent versions of CCCII.


Design


Byte ranges

CCCII is designed as an 94n set, as defined by
ISO/IEC 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
. Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive. Thus, the maximum number of Chinese characters representable in CCCII is 94×94×94 = 830584. In practice the number of characters encodable by CCCII would be less than this number, because variant characters are encoded in related ISO 2022 planes under CCCII, so most of the code points would have to be reserved for variants. In practice, however, bytes outside of these ranges are sometimes used. The code 0x212320 is used by some implementations as an
ideographic space In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
. A CCCII specification used by libraries in Hong Kong uses codes starting with 0x2120 for punctuation and symbols. The first byte 0x7F is used by some variants to encode codes for some otherwise unavailable Unified Repertoire and Ordering or CJK Unified Ideographs Extension A hanzi (e.g. 0x7F3449 for U+3449 or 0x7F796E for U+796E; notice how the continuation bytes match the UCS-2BE code), and this may include bytes outside of the 0x21–0x7E or even 0x20–0x7F range, e.g. 0x7F551C for U+551C, 0x7F5AA4 for U+5AA4 or 0x7F8EDA for U+8EDA.


Interaction with ISO 2022

CCCII/EACC is not registered in the International Registry of Coded Character Sets to be Used with Escape Sequences, and as such, does not have a standard designation escape for use with ISO 2022. MARC-8 assigns EACC the private-use -byte 0x31 () in its implementation of ANSI X3.41 (ISO 2022).


Layers and variant characters

The 94 ISO 2022 planes are grouped into 16 layers of 6 planes each (except for layer 16, which contains the four planes 91–94). Layer 1 contains both non-hanzi and hanzi characters, with the non-hanzi and most frequently used hanzi being placed in plane 1, and with the remaining five planes consisting of less common hanzi. Layer 2 contains
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
, with their row and cell numbers being the same as their
traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
equivalents in layer 1. Layers 3 through 12 contain further variant forms, at row and cell numbers homologous to the first two layers. The last four layers are used for other purposes. Specifically, layer 13 contains additional characters for
Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...
support (
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
and Japanese kokuji), and layer 14 contains additional characters for
Korean language Korean ( South Korean: , ''hangugeo''; North Korean: , ''chosŏnmal'') is the native language for about 80 million people, mostly of Korean descent. It is the official and national language of both North Korea and South Korea (geographic ...
support (
hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
). Layer 15 is unused (reserved), while layer 16 is used for other characters. This distinctive design has been criticized by Christian Wittern of the International Research Institute for Zen Buddhism at
Hanazono University is a private university in Kyoto, Japan that belongs to the Rinzai sect (specifically the Myōshin-ji temple complex, which it is next to). The university and the neighborhood are named for Emperor Hanazono, whose donated his palace to make My� ...
, who asserts that the relationship of character variants "is very complex and can not be expressed in a fixed, one-dimensional, hard-wired codetable". Ken Lunde describes it as "one of the most well thought-out character set standards from Taiwan", describing its structure as "to be truly admired", but concluding that
OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark ...
variant form substitution can provide the same level of functionality. CCCII defines roughly 53940 code points as of its 1987 edition, although a more recent draft from 1989 extends this to 75684 code points (comprising 44167 unique characters and 31517 variants). EACC, the variant used by the Library of Congress, includes only a smaller set of 15686 characters.


Adoption

As of 1995, CCCII or EACC was used mostly in libraries in the
United States The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territori ...
,
Hong Kong Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
and
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
. Although CCCII promised pan- CJK coverage, its support was limited to specialized hardware; difficulty ascertaining when the root versus variant character should be used, exacerbated by a lack of firmly established reference glyphs, further limited its adoption, resulting in
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
being more commonly used for Chinese in those territories outside of library use (since
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
had yet to become widely adopted at the time). , EACC is still in extensive use for specialized bibliographic purposes. It was also an important precursor to Unicode. Unicode hanzi characters are referenced to their corresponding CCCII and EACC codes in the
Unihan Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature s ...
database, in the keys and . Mapping tables for hanzi,
hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
,
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
and punctuation between EACC and Unicode are available from the Library of Congress.


Punctuation, symbol, kana and jamo charts

Following are charts for punctuation, symbols,
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most p ...
and Hangul jamo, showing the characters and giving possible Unicode mappings. Where possible, these are referenced against published mapping data. Unicode mappings for Hangul syllables are omitted below for brevity, but are documented by the Library of Congress. CCCII hanzi number in the tens of thousands and are not shown below (except where they are also included in the non-hanzi range, as radicals or numerals), but mappings to Unicode are available from the Unihan database and from elsewhere.


Character set 0x2120 (plane 1, row 0: Hong Kong punctuation)

Although CCCII is usually a 94n set, and therefore does not usually use codes starting with 0x2120, the following layout is used by a variant used by libraries in Hong Kong:


Character set 0x2121 (plane 1, row 1: reserved for controls)

No characters are assigned in plane 1 row 1, which is reserved for
control code In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than th ...
s.


Character set 0x2122 (plane 1, row 2: mathematical operators)

This row contains mathematical operators. EACC leaves this row empty. The following table is referenced against sources from Taiwan. The following table is referenced against CCCII data provided by the Hong Kong
Innovative Innovation is the practical implementation of ideas that result in the introduction of new goods or services or improvement in offering goods or services. ISO TC 279 in the standard ISO 56000:2020 defines innovation as "a new or changed entit ...
Users Group, a group of libraries in Hong Kong, and hosted by the
University of Hong Kong The University of Hong Kong (HKU) (Chinese: 香港大學) is a public research university in Hong Kong. Founded in 1887 as the Hong Kong College of Medicine for Chinese, it is the oldest tertiary institution in Hong Kong. HKU was also the f ...
. It uses an entirely different layout in this row:


Character set 0x2123 (plane 1, row 3: Roman and punctuation)

This row includes punctuation,
western Arabic numerals Arabic numerals are the ten numerical digits: , , , , , , , , and . They are the most commonly used symbols to write decimal numbers. They are also used for writing numbers in other systems such as octal, and for writing identifiers such as ...
and Roman letters. Compare row 3 of Wansung code and row 3 of GB 2312. Different variants variously encode the
ideographic space In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
(U+3000) at 0x212320 (which the MARC specification acknowledges), 0x212321 (which is listed in the ANSI standard, and is also acknowledged by MARC), or 0x21635F. EACC includes only the hyphen-minus, parentheses and ideographic space in this set.


Character set 0x212A (plane 1, row 10: internal IME characters and geta mark)

In EACC, this row includes several
Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
mapped characters used internally to represent character components by the
RLIN The Research Libraries Group (RLG) was a U.S.-based library consortium that existed from 1974 until its merger with the OCLC library consortium in 2006. RLG developed the Eureka interlibrary search engine, the RedLightGreen database of bibliograp ...
input method, which is used by the Library of Congress for non-Roman cataloging. These component characters should only be used internally by an IME and, if encountered elsewhere, may be replaced with the geta mark (U+3013), which this row also includes at 0x212A46. This row is unassigned in CCCII, but the geta mark is also listed at that location in some mappings for CCCII.


Character set 0x212B (plane 1, row 11: punctuation)

This row contains various punctuation marks used in Chinese, in addition to other symbols. CCCII includes a set of 35 punctuation marks in this row. EACC includes only 13 characters in this row (shown boxed below).


Character sets 0x212C–0x212E (plane 1, rows 12–14: radicals and ordinals)

These rows contain
Chinese radicals A Chinese radical () or indexing component is a graphical component of a Chinese character under which the character is traditionally listed in a Chinese dictionary. This component is often a semantic indicator similar to a morpheme, though ...
, Roman numerals,
celestial stems The ten Heavenly Stems or Celestial Stems () are a Chinese system of ordinals that first appear during the Shang dynasty, c. 1250 BC, as the names of the ten days of the week. They were also used in Shang-period ritual as names for dead family mem ...
and terrestrial branches.


Character set 0x212F (plane 1, row 15: Chinese numerals and bopomofo)

This row includes Chinese numerals and bopomofo characters. EACC includes only the ideographic zero (〇).


Character set 0x272B (plane 7, row 11: reference mark)

This row contains the
reference mark The reference mark or reference symbol "※" is a typographic mark or word used in Chinese, Japanese and Korean (CJK) writing. The symbol was used historically to call attention to an important sentence or idea, such as a prologue or footnote ...
(''kome jirushi'').


Character set 0x272E–0x272F (plane 7, rows 14–15: alternative bopomofo)

A variant used by libraries in Hong Kong does not include bopomofo characters in plane 1 row 15, but includes them in a different layout in plane 7.


Character set 0x6921 (plane 73, row 1: Japanese punctuation)

This row is in plane 73, the first plane of layer 13, which contains characters included for
Japanese language is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...
support. It contains punctuation. Compare row 1 of JIS X 0208, which this row tends to follow the layout of for the characters it includes.


Character set 0x6924 (plane 73, row 4: hiragana)

This row contains
hiragana is a Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" originally as contrast ...
. Compare row 4 of JIS X 0208.


Character set 0x6925 (plane 73, row 5: katakana)

This row contains
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...
. Compare row 5 of JIS X 0208, which this row corresponds to, besides the addition of the separate
dakuten The , colloquially , is a diacritic most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced, for instance, on sounds that have undergone rendaku (sequential voicing). The , ...
and
handakuten The , colloquially , is a diacritic most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced, for instance, on sounds that have undergone rendaku (sequential voicing). The , c ...
.


Character set 0x6F24–0x6F25 (plane 79, rows 4–5: jamo)

These rows contains Korean jamo.


Character set 0x6F76 (plane 79, row 86: archaic Hangul)

This row contains several historic
Hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
characters no longer in regular use. Several of these are mapped to the
Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
.


Character set 0x7B25 (plane 91, row 5: supplementary Katakana)

This row contains additional
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...
used to write foreign phonemes.


Footnotes


References

* Some information on this page is based on the information on th
CNS official website


External links


CNS 11643 official web site
(English version of pages available) has information about the CCCII character set in the "Chinese Information Code" section
Full mapping of EACC to Unicode, from Library of Congress
{{character encoding Character encoding Encodings of Asian languages Chinese-language computing