A CCSID (coded character set identifier) is a 16-bit number that represents a particular

encoding In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...

of a specific code page. For example,

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...

is a code page that has several encoding (so called "transformation") forms, like

UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...

UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...

and

UTF-32 UTF-32 (32- bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode ...

, but which may or may not actually be accompanied by a CCSID number to indicate that this encoding is being used.

Difference between a code page and a CCSID

The terms code page and CCSID are often used interchangeably, even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions from IBM help to illustrate this point: * A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout. * A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "''F''", "", "", and "" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the F's essential F-ness. * A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. It does not specify how those characters are represented in a computer. This level is the first one to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (e.g., Chinese, Korean). It corresponds to a "character repertoire" in the Unicode encoding model. * A code page represents a particular assignment of code point values to characters. It corresponds to a "coded character set" in the Unicode encoding model. A code point for a character is the computer's internal representation of that character in a given code page. Many characters are represented by different code points in different code pages. Certain character sets can be adequately represented with single-byte code pages (which have a maximum 256 code points, hence a maximum of 256 characters), but many require more than that. Examples include

JIS X 0208 JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...

and

. * An encoding scheme is the byte format of a code page. It maps code point values to sequences of one or more byte values in a computer. For example,

and UTF-16BE are two encodings of the same Unicode code page. (Varying only in how many bytes are needed to represent a particular Unicode character value, how it is contained within those bytes, and how the presence of Unicode information is indicated.) Meanwhile, in IBM's character data representation architecture (CDRA), this is typically represented with an ESID (encoding scheme identifier). EUC and

ISO-2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...

are other examples of encoding schemes. * A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize

bidirectional Bidirectional may refer to: * Bidirectional, a roadway that carries traffic moving in opposite directions * Bi-directional vehicle, a tram or train or any other vehicle that can be controlled from either end and can move forward or backward with e ...

orientation, character shaping (mainly of Arabic characters), and other complex encoding information.

Examples

The following examples show how some CCSIDs are made up of other CCSIDs. All three of these variant Shift-JIS CCSIDs are

multi-byte character set A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings ar ...

s (MBCS): the single-byte character set (SBCS) portion of each CCSID is different. The

double-byte character set A double-byte character set (DBCS) is a character encoding in which either all characters (including control characters) are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set (SB ...

(DBCS) portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other two CCSIDs, which is 1041. Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID. There are a few reasons for this complexity: * Many of the CCSIDs are used in IBM databases, like

IBM Db2 Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON a ...

, where a database field only supports an SBCS, DBCS or MBCS string. CCSIDs allow programs to differentiate between which one is being used. * When characters are added or replaced, like the Euro currency sign introduction, one can know whether the stored strings support or do not support those character additions because a different CCSID is being used. This versioning is important for the integrity of the data. * It enables reuse of resources among similar CCSIDs.

References

External links

IBM CDRA (character data representation architecture) glossary of terms

IBM globalization terminology

Complete description of IBM CDRA
(This includes a more detailed description of the architecture surrounding CCSIDs.)

* ttps://web.archive.org/web/20150406044931/http://www-03.ibm.com/systems/i/software/globalization/ccsid_list.html List of CCSIDs supported on the IBM System i computer {{character encoding Character encoding