ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Iso ...
/
IEC
The International Electrotechnical Commission (IEC; in French: ''Commission électrotechnique internationale'') is an international standards organization that prepares and publishes international standards for all electrical, electronic and r ...
standard (equivalent to the
ECMA standard ECMA-35, the
ANSI
The American National Standards Institute (ANSI ) is a private non-profit organization that oversees the development of voluntary consensus standards for products, services, processes, systems, and personnel in the United States. The organi ...
standard ANSI X3.41
and the
Japanese Industrial Standard
are the standards used for industrial activities in Japan, coordinated by the Japanese Industrial Standards Committee (JISC) and published by the Japanese Standards Association (JSA). The JISC is composed of many nationwide committees and plays ...
JIS X 0202) in the field of
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
. Originating in 1971, it was most recently revised in 1994.
ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes (
0x00–1F and 0x7F–9F) to be used for non-printing
control codes
In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the ...
for formatting and in-band instructions (such as
line breaks or formatting instructions for
text terminal
A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. The teletype was an example of an early-day hard-copy terminal and ...
s), rather than graphical characters. It also specifies a syntax for escape sequences, multiple-byte sequences beginning with the control code, which can likewise be used for in-band instructions.
Specific sets of control codes and escape sequences designed to be used with ISO 2022 include
ISO/IEC 6429
ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
, portions of which are implemented by
ANSI.SYS
ANSI.SYS is a device driver in the DOS family of operating systems that provides extra console functions through ANSI escape sequences. It is partially based upon a subset of the text terminal control standard proposed by the ANSI X3L2 Technical C ...
and
terminal emulator
A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote termin ...
s.
ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different
coded character set
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s (for example, between
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
and the Japanese
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current ...
) so as to use multiple in a single document,
effectively combining them into a single
stateful encoding (a feature less important since the advent of
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
). It is designed to be usable in both 8-bit environments and 7-bit environments (those where only seven bits are usable in a byte, such as
e-mail
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
without
8BITMIME
The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typical ...
).
Encodings and conformance
Writing system
A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form ...
s with relatively few characters, such as
Greek
Greek may refer to:
Greece
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group.
*Greek language, a branch of the Indo-European language family.
**Proto-Greek language, the assumed last common ancestor ...
,
Cyrillic
, bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця
, fam1 = Egyptian hieroglyphs
, fam2 = Proto-Sinaitic
, fam3 = Phoenician
, fam4 = G ...
,
Arabic
Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
, or
Hebrew
Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
, as well as forms of the
Latin alphabet
The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the o ...
using
diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
s or letters absent in the
ISO Basic Latin alphabet
The ISO basic Latin alphabet is an international standard (beginning with ISO/IEC 646) for a Latin-script alphabet that consists of two sets (uppercase and lowercase) of 26 letters, codified in various national and international standards and u ...
, have historically been represented on computers with different 8-
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
,
single byte,
extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
encodings. Some of these, such as the
ISO 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12 ...
series, conform to ISO 2022,
while others such as
DOS code page 437 do not, usually due to not reserving the bytes 0x80–9F for control codes.
Certain
East Asian
East Asia is the eastern region of Asia, which is defined in both geographical and ethno-cultural terms. The modern states of East Asia include China, Japan, Mongolia, North Korea, South Korea, and Taiwan. China, North Korea, South Korea a ...
languages, specifically
Chinese
Chinese can refer to:
* Something related to China
* Chinese people, people of Chinese nationality, citizenship, and/or ethnicity
**''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation
** List of ethnic groups in China, people of va ...
,
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
, and
Korean
Korean may refer to:
People and culture
* Koreans, ethnic group originating in the Korean Peninsula
* Korean cuisine
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Chosŏn'gŭl
**Korean dialects and the Jeju language
** ...
(collectively "
CJK"), are written using far more characters than the maximum of 256 which can be represented in a single byte, and were first represented on computers with language-specific
double-byte encodings or
variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
s; some of these (such as the
Simplified Chinese
Simplification, Simplify, or Simplified may refer to:
Mathematics
Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example
* Simplification of algebraic expressions, ...
encoding ) conform to , while others (such as the
Traditional Chinese
A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays or ...
encoding
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
) do not. Control codes in ISO 2022 are always represented with a single byte, regardless of the number of bytes used for graphical characters. CJK encodings used in 7-bit environments which use mechanisms to switch between character sets are often given names starting with "ISO-2022-", most notably
ISO-2022-JP
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
, although some other CJK encodings such as
EUC-JP
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
also make use of ISO 2022 mechanisms.
Since the first 256
code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s of
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
were taken from
ISO 8859-1
ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...
, Unicode inherits the concept of
C0 and C1 control codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
from ISO 2022, although it adds
other non-printing characters besides the ISO 2022 control codes. However,
Unicode transformation format
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
s such as
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
generally deviate from the ISO 2022 structure in various ways, including:
* Using 8-bit bytes, but not representing the C1 codes in their single-byte forms specified in ISO 2022 (most UTFs, one exception being the obsolete
UTF-1
UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...
)
* Representing all characters, including control codes, with multiple bytes (e.g.
UTF-16
UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
,
UTF-32
UTF-32 (32-bit Unicode transformation format, Unicode Transformation Format) is a fixed-length Character encoding, encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must ...
)
* Mixing bytes with the
most significant bit
In computing, bit numbering is the convention used to identify the bit positions in a binary number.
Bit significance and indexing
In computing, the least significant bit (LSB) is the bit position in a binary integer representing the binary 1 ...
set and unset within the coded representation for a single code point (e.g. UTF-1, )
ISO 2022 escape sequences do, however, exist for switching to and from UTF-8 as a "
coding system different from that of ISO 2022",
which are supported by certain
terminal emulator
A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote termin ...
s such as
xterm
In computing, xterm is the standard terminal emulator for the X Window System. It allows users to run programs which require a command-line interface.
If no particular program is specified, xterm runs the user's shell. An X display can show ...
.
Overview
Elements
ISO/IEC 2022 specifies the following:
* An infrastructure of multiple character sets with particular structures which may be included in a single
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
system, including multiple graphical character sets and multiple sets of both
primary (C0) and secondary (C1) control codes,
* A format for encoding these sets, assuming that 8 bits are available per byte,
* A format for encoding these sets in the same encoding system when only 7 bits are available per byte, and a method for transforming any conformant character data to pass through such a 7-bit environment,
* The general structure of
ANSI escape codes
ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape charac ...
,
and
* Specific escape code formats for identifying individual character sets,
for announcing the use of particular encoding features or subsets,
and for interacting with or switching to other encoding systems.
Code versions
A specific implementation does not have to implement all of the standard; the conformance level and the supported character sets are defined by the implementation. Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system.
In particular, 7-bit encoding systems using ISO/IEC 2022 mechanisms include
ISO-2022-JP
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
(or
JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:
* A set of standard coded character sets for Japanese, notably:
** JIS X 0201, the Japanese ve ...
), which has primarily been used in Japanese-language
e-mail
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
. 8-bit encoding systems conforming to ISO/IEC 2022 include
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
(ECMA-43), which is in turn conformed to by
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
,
and
Extended Unix Code
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
, which is used for
East Asia
East Asia is the eastern region of Asia, which is defined in both geographical and ethno-cultural terms. The modern states of East Asia include China, Japan, Mongolia, North Korea, South Korea, and Taiwan. China, North Korea, South Korea and ...
n languages.
More specialised applications of ISO 2022 include the
MARC-8 The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in libr ...
encoding system used in
MARC 21
MARC (machine-readable cataloging) standards are a set of digital formats for the description of items catalogued by libraries, such as books, DVDs, and digital resources. Computerized library catalogs and library management software need to str ...
library records.
Designation escape sequences
The escape sequences for switching to particular character sets or encodings are registered with the
ISO-IR
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
registry (except for those set apart for private use, the meanings of which are defined by vendors, or by protocol specifications such as
ARIB STD-B24) and follow the patterns defined within the standard. Character encodings making use of these escape sequences require data to be processed sequentially in a forward direction, since the correct interpretation of the data depends on previously encountered escape sequences.
Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of a line. Furthermore, the escape sequences declaring the national character sets may be absent if a specific ISO-2022-based encoding permits or requires this, and dictates that particular national character sets are to be used. For example, ISO-8859-1 states that no defining escape sequence is needed.
Multi-byte characters
To represent large character sets, ISO/IEC 2022 builds on
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
's property that one seven-bit character will normally define 94 graphic (printable) characters (in addition to space and 33 control characters); if only the C0 control codes (narrowly defined) are excluded, this can be expanded to 96 characters. Using two bytes, it is thus possible to represent up to 8,836 (94×94) characters; and, using three bytes, up to 830,584 (94×94×94) characters. Though the standard defines it, no registered character set uses three bytes (although
EUC-TW
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
's unregistered G2 does, as does the similarly unregistered
CCCII
The Chinese Character Code for Information Interchange () or CCCII is a character set developed by the Chinese Character Analysis Group in Taiwan. It was first published in 1980, and significantly expanded in 1982 and 1987.
It is used mostly by ...
).
For the two-byte character sets, the
code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
of each character is normally specified in so-called ''row-cell'' or ''
kuten
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current ...
''; zh, s=区位, p=qūwèi; form, which comprises two numbers between 1 and 94 inclusive, specifying a row; zh, s=区, p=qū; and cell; zh, c=位, p=wèi, l=position; of that character within the zone. For a three-byte set, an additional ''plane'' number is included at the beginning.
The escape sequences do not only declare which character set is being used, but also whether the set is single-byte or multi-byte (although not how many bytes it uses if it is multi-byte), and also whether each byte has 94 or 96 permitted values.
Code structure
Notation and nomenclature
ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters.
Escape sequence
In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters.
Examples
* In C and man ...
s allow any of a large registry of graphic character sets to be "designated" into one of four working sets, named G0 through G3, and shorter control sequences specify the working set that is "invoked" to interpret bytes in the stream.
Encoding byte values ("bit combinations") are often given in
column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash. Hence, for instance, codes 2/0 (0x20) through 2/15 (0x2F) inclusive may be referred to as "column 02". This is the notation used in the ISO/IEC 2022 / ECMA-35 standard itself. They may be described elsewhere using
hexadecimal
In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
, as is often used in this article, or using the corresponding ASCII characters, although the escape sequences are actually defined in terms of byte values, and the graphic assigned to that byte value may be altered without affecting the control sequence.
Byte values from the 7-bit ASCII graphic range (hexadecimal 0x20–0x7F), being on the left side of a character code table, are referred to as "GL" codes ''(with "GL" standing for "graphics left")'' while bytes from the "high ASCII" range (0xA0–0xFF), if available (i.e. in an 8-bit environment), are referred to as the "GR" codes ''("graphics right")''.
The terms "CL" (0x00–0x1F) and "CR" (0x80–0x9F) are defined for the control ranges, but the CL range always invokes the primary (C0) controls, whereas the CR range always either invokes the secondary (C1) controls or is unused.
Fixed coded characters
The
delete character
The delete control character (also called DEL or rubout) is the last character in the ASCII repertoire, with the code 127. It is supposed to do nothing and was designed to erase incorrect characters on paper tape. It is denoted as in caret notati ...
DEL (0x7F), the
escape character
In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
ESC (0x1B) and the
space character
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
SP (0x20) are designated "fixed" coded characters and are always available when G0 is invoked over GL, irrespective of what character sets are designated. They may not be included in graphical character sets, although other sizes or types of
whitespace character
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
may be.
General syntax of escape sequences
Sequences using the ESC (escape) character take the form
ESC ..
, where the ESC character is followed by zero or more intermediate bytes () from the range 0x20–0x2F, and one final byte () from the range 0x30–0x7E.
The first byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function. In all types of escape sequences, bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.
Control functions from some sets may make use of further bytes following the escape sequence proper. For example, the
ISO 6429
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
control function "", which can be represented using an escape sequence, is followed by zero or more bytes in the range 0x30–0x3F, then zero or more bytes in the range 0x20–0x2F, then by a single byte in the range 0x40–0x7E, the entire sequence being called a "control sequence".
Graphical character sets
Each of the four working sets G0 through G3 may be a 94-character set or a 94
n-character
multi-byte set. Additionally, G1 through G3 may be a 96- or 96
n-character set.
In a 96- or 96
n-character set, the bytes 0x20 through 0x7F when GL-invoked, or 0xA0 through 0xFF when GR-invoked, are allocated to and may be used by the set. In a 94- or 94
n-character set, the bytes 0x20 and 0x7F are not used.
When a 96- or 96
n-character set is invoked in the GL region, the space and delete characters (codes 0x20 and 0x7F) are not available until a 94- or 94
n-character set (such as the G0 set) is invoked in GL.
96-character sets cannot be designated to G0.
Registration of a set as a 96-character set does not necessarily mean that the 0x20/A0 and 0x7F/FF bytes are actually assigned by the set; some examples of graphical character sets which are registered as 96-sets but do not use those bytes include the G1 set of
I.S. 434, the box drawing set from
ISO/IEC 10367
ISO/IEC 10367:1991 is a standard developed by ISO/IEC JTC 1/SC 2, defining graphical character sets for use in character encodings implementing levels 2 and 3 of ISO/IEC 4873 (as opposed to ISO/IEC 8859, which defines character encodings at level ...
, and ISO-IR-164 (a subset of the G1 set of
ISO-8859-8
ISO/IEC 8859-8, ''Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represe ...
with only the letters, used by
CCITT
The ITU Telecommunication Standardization Sector (ITU-T) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU). It is responsible for coordinating standards for telecommunications and Information Commu ...
).
Combining characters
Characters are expected to be spacing characters, not combining characters, unless specified otherwise by the graphical set in question.
ISO 2022 / ECMA-35 also recognizes the use of the
backspace
Backspace () is the keyboard key that originally pushed the typewriter carriage one position backwards and in modern computer systems moves the display cursor one position backwards,"Backwards" means to the left for left-to-right languages. delete ...
and carriage return control characters as means of combining otherwise spacing characters, as well as the
CSI sequence "Graphic Character Combination" (GCC)
(
CSI 0x20 (SP) 0x5F (_)
).
Use of the backspace and carriage return in this manner is permitted by
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
but prohibited by
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
/ ECMA-43 and by
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
,
on the basis that it leaves the graphical character repertoire undefined. ISO/IEC 4873 / ECMA-43 does, however, permit the use of the GCC function provided that the sequence of characters is kept the same and merely displayed in one space, rather than being over-stamped to form a character with a different meaning.
Control character sets
Control character sets are classified as "primary" or "secondary" control code sets,
respectively also called "C0" and "C1" control code sets.
A C0 control set must contain the ESC (escape) control character at 0x1B
(a C0 set containing only ESC is registered as ISO-IR-104), whereas a C1 control set may not contain the escape control whatsoever.
Hence, they are entirely separate registrations, with a C0 set being only a C0 set and a C1 set being only a C1 set.
If codes from the C0 set of ISO 6429 / ECMA-48, i.e. the
ASCII control codes, appear in the C0 set, they are required to appear at their ISO 6429 / ECMA-48 locations.
Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB), or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.
A C0 control set is invoked over the CL range 0x00 through 0x1F,
whereas a C1 control function may be invoked over the CR range 0x80 through 0x9F (in an 8-bit environment) or by using escape sequences (in a 7-bit or 8-bit environment),
but not both. Which style of C1 invocation is used must be specified in the definition of the code version.
For example,
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
specifies CR bytes for the C1 controls which it uses (SS2 and SS3).
If necessary, which invocation is used may be communicated using
announcer sequences.
In the latter case, single control functions from the C1 control code set are invoked using "type Fe" escape sequences,
meaning those where the ESC control character is followed by a byte from columns 04 or 05 (that is to say,
ESC 0x40 (@)
through
ESC 0x5F (_)
).
Other control functions
Additional control functions are assigned to "type Fs" escape sequences (in the range
ESC 0x60 (`)
through
ESC 0x7E (~)
); these have permanently assigned meanings rather than depending on the C0 or C1 designations.
Registration of control functions to type "Fs" sequences must be approved by
ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...
.
Other single control functions may be registered to type "3Ft" escape sequences (in the range
ESC 0x23 (#) ..0x40 (@)
through
ESC 0x23 (#) ..0x7E (~)
),
although no "3Ft" sequences are currently assigned (as of 2019).
The following escape sequences are assigned for single control functions:
Escape sequences of type "Fp" (
ESC 0x30 (0)
through
ESC 0x3F (?)
) or of type "3Fp" (
ESC 0x23 (#) ..0x30 (0)
through
ESC 0x23 (#) ..0x3F (?)
) are reserved for single private use control codes, by prior agreement between parties.
Several such sequences of both types are used by
DEC terminals such as the
VT100
The VT100 is a video terminal, introduced in August 1978 by Digital Equipment Corporation (DEC). It was one of the first terminals to support ANSI escape codes for cursor control and other tasks, and added a number of extended codes for special f ...
, and are thus supported by
terminal emulator
A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote termin ...
s.
Shift functions
By default, GL codes specify G0 characters and GR codes (where available) specify G1 characters; this may be otherwise specified by prior agreement. The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below.
An 8-bit code may have GR codes specifying G1 characters, i.e. with its corresponding 7-bit code using
Shift In
Shift Out (SO) and Shift In (SI) are ASCII control characters 14 and 15, respectively (0x0E and 0x0F). These are sometimes also called "Control-N" and "Control-O".
The original meaning of those characters provided a way to shift a coloured ribbon ...
and
Shift Out
Shift Out (SO) and Shift In (SI) are ASCII control characters 14 and 15, respectively (0x0E and 0x0F). These are sometimes also called "Control-N" and "Control-O".
The original meaning of those characters provided a way to shift a coloured ribbon ...
to switch between the sets (e.g.
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, altho ...
), although some instead have GR codes specifying G2 characters, with the corresponding 7-bit code using a single-shift code to access the second set (e.g.
T.51).
The codes shown in the table below are the most common encodings of these control codes, conforming to
ISO/IEC 6429
ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
. The LS2, LS3, LS1R, LS2R and LS3R shifts are registered as single control functions and are always encoded as the escape sequences listed below,
whereas the others are part of a C0 or C1 control code set (as shown below, SI (LS0) and SO (LS1) are C0 controls and SS2 and SS3 are C1 controls), meaning that their coding and availability may vary depending on which control sets are designated: they must be present in the designated control sets if their functionality is used.
The C1 controls themselves, as mentioned above, may be represented using escape sequences or 8-bit bytes, but not both.
Alternative encodings of the single-shifts as C0 control codes are available in certain control code sets. For example, SS2 and SS3 are usually available at 0x19 and 0x1D respectively in
T.51 and
T.61.
This coding is currently recommended by ISO/IEC 2022 / ECMA-35 for applications requiring 7-bit single-byte representations of SS2 and SS3, and may also be used for SS2 only,
although older code sets with SS2 at 0x1C also exist,
and were mentioned as such in an earlier edition of the standard. The 0x8E and 0x8F coding of the single shifts as shown below is mandatory for
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
levels 2 and 3.
Although officially considered shift codes and named accordingly, single-shift codes are not always viewed as shifts,
and they may simply be viewed as prefix bytes (i.e. the first bytes in a multi-byte sequence),
since they do not require the encoder to keep the currently active set as
state
State may refer to:
Arts, entertainment, and media Literature
* ''State Magazine'', a monthly magazine published by the U.S. Department of State
* ''The State'' (newspaper), a daily newspaper in Columbia, South Carolina, United States
* ''Our S ...
, unlike locking shift codes. In 8-bit environments, either GL or GR, but not both, may be used as the single-shift area. This must be specified in the definition of the code version.
For instance,
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
specifies GL, whereas
packed EUC specifies GR. In 7-bit environments, only GL is used as the single-shift area.
If necessary, which single-shift area is used may be communicated using
announcer sequences.
The names "locking shift zero" (LS0) and "locking shift one" (LS1) refer to the same pair of C0 control characters (0x0F and 0x0E) as the names "shift in" (SI) and "shift out" (SO). However, the standard refers to them as LS0 and LS1 when they are used in 8-bit environments and as SI and SO when they are used in 7-bit environments.
The ISO/IEC 2022 / ECMA-35 standard permits, but discourages, invoking G1, G2 or G3 in both GL and GR simultaneously.
Registration of graphical and control code sets
The ''ISO International register of coded character sets to be used with escape sequences'' (ISO-IR) lists graphical character sets, control code sets, single control codes and so forth which have been registered for use with ISO/IEC 2022. The procedure for registering codes and sets with the ISO-IR registry is specified by ISO/IEC 2375. Each registration receives a unique escape sequence, and a unique registry entry number to identify it.
For example, the
CCITT
The ITU Telecommunication Standardization Sector (ITU-T) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU). It is responsible for coordinating standards for telecommunications and Information Commu ...
character set for
Simplified Chinese
Simplification, Simplify, or Simplified may refer to:
Mathematics
Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example
* Simplification of algebraic expressions, ...
is known as
ISO-IR-165
The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992.
It is defined in ITU T.101, annex C, which codifies ...
.
Registration of coded character sets with the ISO-IR registry identifies the documents specifying the character set or control function associated with an ISO/IEC 2022 non‑private-use escape sequence. This may be a standard document; however, registration does not create a new ISO standard, does not commit the ISO or IEC to adopt it as an international standard, and does not commit the ISO or IEC to add any of its characters to the
Universal Coded Character Set
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), whi ...
.
ISO-IR registered escape sequences are also used encapsulated in a
Formal Public Identifier A Formal Public Identifier (FPI) is a short piece of specially formatted text that may be used to uniquely identify a product, specification or document. One of their most common uses is as part of document type definitions, but they are also used i ...
to identify character sets in
SGML
The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
* Declarative: Markup should des ...
(ISO 8879). For example, the string can be used to identify the International Reference Version of
ISO 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
-1983,
and the
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
4.01 specification uses to identify Unicode. The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets.
Character set designations
Escape sequences to designate character sets take the form
ESC ..
. As mentioned above, the intermediate () bytes are from the range 0x20–0x2F, and the final () byte is from the range 0x30–0x7E. The first byte (or, for a multi-byte set, the first two) identifies the type of character set and the working set it is to be designated to, whereas the byte (and any additional bytes) identify the character set itself, as assigned in the ISO-IR register (or, for the private-use escape sequences, by prior agreement).
Additional bytes may be added before the byte to extend the byte range. This is currently only used with 94-character sets, where codes of the form
ESC ( !
have been assigned.
At the other extreme, no multibyte 96-sets have been registered, so the sequences below are strictly theoretical.
As with other escape sequence types, the range 0x30–0x3F is reserved for private-use bytes,
in this case for private-use character set definitions (which might include unregistered sets defined by protocols such as
ARIB STD-B24 or
MARC-8 The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in libr ...
,
or vendor-specific sets such as
DEC Special Graphics
DEC Special Graphics is a 7-bit character set developed by Digital Equipment Corporation. This was used very often to draw box-drawing characters, boxes on the VT100 video terminal and the many emulators, and used by bulletin board software. The IS ...
). However, in a graphical set designation sequence, if the second byte (for a single-byte set) or the third byte (for a double-byte set) is 0x20 (space), the set denoted is a "
dynamically redefinable character set" (DRCS) defined by prior agreement,
which is also considered private use.
A graphical set being considered a DRCS implies that it represents a font of exact glyphs, rather than a set of abstract characters.
The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with byte 0x40 (
@
);
however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as
World System Teletext
World System Teletext (WST) is the name of a standard for encoding and displaying teletext information, which is used as the standard for teletext throughout Europe today. It was adopted into the international standard ITU-R, CCIR 653 (now ITU-R BT ...
.
There are also three special cases for multi-byte codes. The code sequences
ESC $ @
,
ESC $ A
, and
ESC $ B
were all registered when the contemporary version of the standard allowed multi-byte sets only in G0, so must be accepted in place of the sequences
ESC $ ( @
through
ESC $ ( B
to designate to the G0 character set.
There are additional (rarely used) features for switching control character sets, but this is a single-level lookup, in that (as noted above) the C0 set is always invoked over CL, and the C1 set is always invoked over CR or by using escape codes. As noted above, it is required that any C0 character set include the ESC character at position 0x1B, so that further changes are possible. The control set designation sequences (as opposed to the graphical set ones) may also be used from within
ISO/IEC 10646
ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
(UCS/Unicode), in contexts where processing
ANSI escape code
ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape charac ...
s is appropriate, provided that each byte in the sequence is padded to the code unit size of the encoding.
A table of escape sequence bytes and the designation or other function which they perform is below.
Note that the registry of bytes is independent for the different types. The 94-character graphic set designated by
ESC ( A
through
ESC + A
is not related in any way to the 96-character set designated by
ESC - A
through
ESC / A
. And neither of those is related to the 94
n-character set designated by
ESC $ ( A
through
ESC $ + A
, and so on; the final bytes must be interpreted in context. (Indeed, without any intermediate bytes,
ESC A
is a way of specifying the C1 control code 0x81.)
Also note that C0 and C1 control character sets are independent; the C0 control character set designated by
ESC ! A
(which happens to be the NATS control set for newspaper text transmission) is not the same as the C1 control character set designated by
ESC " A
(the
CCITT
The ITU Telecommunication Standardization Sector (ITU-T) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU). It is responsible for coordinating standards for telecommunications and Information Commu ...
attribute control set for
Videotex
Videotex (or interactive videotex) was one of the earliest implementations of an end-user information system. From the late 1970s to early 2010s, it was used to deliver information (usually pages of text) to a user in computer-like format, typi ...
).
Interaction with other coding systems
The standard also defines a way to specify coding systems that do not follow its own structure.
A sequence is also defined for returning to ISO/IEC 2022; the registrations which support this sequence as encoded in ISO/IEC 2022 comprise (as of 2019) various
Videotex
Videotex (or interactive videotex) was one of the earliest implementations of an end-user information system. From the late 1970s to early 2010s, it was used to deliver information (usually pages of text) to a user in computer-like format, typi ...
formats,
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, and
UTF-1
UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...
.
A second byte of 0x2F (
/
) is included in the designation sequences of codes which do not use that byte sequence to return to ISO 2022; they may have their own means to return to ISO 2022 (such as a different or padded sequence) or none at all.
All existing registrations of the latter type (as of 2019) are either transparent raw data,
Unicode/UCS formats, or subsets thereof.
Of particular interest are the sequences which switch to
ISO/IEC 10646
ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
(
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
) formats which do not follow the ISO/IEC 2022 structure. These include UTF-8 (which does not reserve the range 0x80–0x9F for control characters), its predecessor UTF-1 (which mixes GR and GL bytes in multi-byte codes), and UTF-16 and UTF-32 (which use wider coding units).
Several codes were also registered for subsets (levels 1 and 2) of UTF-8, UTF-16 and UTF-32, as well as for three levels of
UCS-2
The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), whi ...
.
However, the only codes currently specified by ISO/IEC 10646 are the level-3 codes for UTF-8, UTF-16 and UTF-32 and the unspecified-level code for UTF-8, with the rest being listed as deprecated.
ISO/IEC 10646 stipulates that the
big-endian
In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
formats of UTF-16 and UTF-32 are designated by their escape sequences.
Of the sequences switching to UTF-8,
ESC % G
is the one supported by, for example,
xterm
In computing, xterm is the standard terminal emulator for the X Window System. It allows users to run programs which require a command-line interface.
If no particular program is specified, xterm runs the user's shell. An X display can show ...
.
Although use of a variant of the standard return sequence from UTF-16 and UTF-32 is permitted, the bytes of the escape sequence must be padded to the size of the code unit of the encoding (i.e.
001B 0025 0040
for UTF-16), i.e. the coding of the standard return sequence does not conform exactly to ISO/IEC 2022. For this reason, the designations for UTF-16 and UTF-32 use a without-standard-return syntax.
Code structure announcements
The sequence "announce code structure" (
ESC SP (0x20)
) is used to ''announce'' a specific code structure, or a specific group of ISO 2022 facilities which are used in a particular code version. Although announcements can be combined, certain contradictory combinations (specifically, using locking shift announcements 16–23 with announcements 1, 3 and 4) are prohibited by the standard, as is using additional announcements on top of
ISO/IEC 4873
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/ IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
level announcements 12–14
(which fully specify the permissible structural features). Announcement sequences are as follows:
ISO/IEC 2022 code versions
Six 7-bit ISO 2022 code versions (ISO-2022-CN, ISO-2022-CN-EXT, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2 and ISO-2022-KR) are defined by
IETF RFC
A Request for Comments (RFC) is a publication in a series from the principal technical development and standards-setting bodies for the Internet, most prominently the Internet Engineering Task Force (IETF). An RFC is authored by individuals or g ...
s, of which ISO-2022-JP and ISO-2022-KR have been extensively used in the past.
[ "Those encodings that have been extensively used in the past, or continue to be used today for some purposes, have been highlighted."] A number of other variants are defined by vendors, including
IBM.
Although UTF-8 is the preferred encoding in
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
, legacy content in ISO-2022-JP remains sufficiently widespread that the
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, lea ...
encoding standard retains support for it,
in contrast to mapping ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT
entirely to the
replacement character
Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0:
*, marks start of annotated text
*, marks start ...
,
due to concerns about
code injection
Code injection is the exploitation of a computer bug that is caused by processing invalid data. The injection is used by an attacker to introduce (or "inject") code into a vulnerable computer program and change the course of execution. The res ...
attacks such as
cross-site scripting
Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
.
8-bit code versions include
Extended Unix Code
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
.
The
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
encodings also follow ISO 2022, in a subset stipulated in ISO/IEC 4873.
Japanese e-mail versions
ISO-2022-JP
is a widely used encoding for Japanese, in particular in
e-mail
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
. It was introduced for use on the JUNET network and later codified in
IETF RFC
A Request for Comments (RFC) is a publication in a series from the principal technical development and standards-setting bodies for the Internet, most prominently the Internet Engineering Task Force (IETF). An RFC is authored by individuals or g ...
1468, dated 1993.
It has an advantage over other
encodings for Japanese in that it does not require
8-bit clean
''8-bit clean'' is an attribute of computer systems, communication channels, and other devices and software, that handle 8-bit character encodings correctly. Such encoding include the ISO 8859 series and the UTF-8 encoding of Unicode.
History
...
transmission. Microsoft calls it Code page 50220.
It starts in ASCII and includes the following escape sequences:
*
ESC ( B
to switch to ASCII (1 byte per character)
*
ESC ( J
to switch to
JIS X 0201-1976 (ISO/IEC 646:JP) Roman set (1 byte per character)
*
ESC $ @
to switch to
JIS X 0208-1978 (2 bytes per character)
*
ESC $ B
to switch to
JIS X 0208-1983 (2 bytes per character)
Use of the two characters added in JIS X 0208-1990 is permitted, but without including the IRR sequence, i.e. using the same escape sequence as JIS X 0208-1983.
Also, due to being registered before designating multi-byte sets except to G0 was possible, the escapes for JIS X 0208 do not include the second -byte .
The RFC notes that some existing systems did not distinguish
ESC ( B
from
ESC ( J
, or did not distinguish
ESC $ @
from
ESC $ B
, but stipulates that the escape sequences should not be changed by systems simply relaying messages such as e-mails.
The
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, lea ...
Encoding Standard referenced by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
handles
ESC ( B
and
ESC ( J
distinctly, but treats
ESC $ @
the same as
ESC $ B
when decoding, and uses only
ESC $ B
for JIS X 0208 when encoding.
The RFC also notes that some past systems had made erroneous use of the sequence
ESC ( H
to switch away from JIS X 0208, which is actually registered for
ISO-IR-11
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
(a Swedish variant of
ISO 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
and
World System Teletext
World System Teletext (WST) is the name of a standard for encoding and displaying teletext information, which is used as the standard for teletext throughout Europe today. It was adopted into the international standard ITU-R, CCIR 653 (now ITU-R BT ...
).
Versions with halfwidth katakana
Use of
ESC ( I
to switch to the
JIS X 0201-1976 Kana set (1 byte per character) is not part of the ISO-2022-JP profile,
but is also sometimes used.
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
allows it in a variant which it labels ISO-2022-JP-EXT (which also incorporates JIS X 0212 as described below, completing coverage of
EUC-JP
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
); this is close in both name and structure to an encoding denoted ISO-2022-JPext by
DEC, which furthermore adds a two-byte
user-defined region accessed with
ESC $ ( 0
to complete the coverage of
Super DEC Kanji.
The WHATWG/HTML5 variant permits decoding JIS X 0201 katakana in ISO-2022-JP input, but converts the characters to their JIS X 0208 equivalents upon encoding.
Microsoft's code page for ISO-2022-JP with JIS X 0201 kana additionally permitted is Code page 50221.
Other, older variants known as JIS7 and JIS8 build directly on the 7-bit and 8-bit encodings defined by
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, altho ...
and allow use of JIS X 0201 kana from G1 without escape sequences, using
Shift Out and Shift In or setting the eighth bit (GR-invoked), respectively.
They are not widely used;
JIS X 0208 support in extended 8-bit JIS X 0201 is more commonly achieved via
Shift JIS
Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunctio ...
. Microsoft's code page for JIS X 0201-based ISO 2022 with single-byte katakana via Shift Out and Shift In is Code page 50222.
ISO-2022-JP-2
is a multilingual extension of ISO-2022-JP, defined in
RFC 1554 (dated 1993), which permits the following escape sequences in addition to the ISO-2022-JP ones. The
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:
*
ESC $ A
to switch to
GB 2312-1980 (2 bytes per character)
*
ESC $ ( C
to switch to
KS X 1001-1992 (2 bytes per character)
*
ESC $ ( D
to switch to
JIS X 0212-1990 (2 bytes per character)
*
ESC . A
to switch to
ISO/IEC 8859-1
ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
high part, Extended Latin 1 set (1 byte per character) ''
esignated to G2'
*
ESC . F
to switch to
ISO/IEC 8859-7
ISO/IEC 8859-7:2003, ''Information technology — 8-bit single-byte coded graphic character sets — Part 7: Latin/Greek alphabet'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. I ...
high part, Basic Greek set (1 byte per character) ''
esignated to G2'
ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by
RFC 2237, dated 1997.
IBM Japanese TCP
IBM implements nine 7-bit ISO 2022 based encodings for Japanese, each using a different set of escape sequences: IBM-956, IBM-957, IBM-958, IBM-959, IBM-5052, IBM-5053, IBM-5054, IBM-5055 and ISO-2022-JP, which are collectively termed "TCP/IP Japanese coded character sets". CCSID 9148 is the standard (RFC 1468) ISO-2022-JP.
JIS X 0213
The
JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 (JIS2004) and 2012. As well as add ...
standard, first published in 2000, defines an updated version of ISO-2022-JP, without the ISO-2022-JP-2 extensions, named ISO-2022-JP-3. The additions made by JIS X 0213 compared to the base JIS X 0208 standard resulted in a new registration being made for the extended JIS plane 1, while the new plane 2 received its own registration. The further additions to plane 1 in the 2004 edition of the standard resulted in an additional registration being added to a further revision of the profile, dubbed ISO-2022-JP-2004. In addition to the basic ISO-2022-JP designation codes, the following designations are recognized:
*
ESC ( I
to switch to
JIS X 0201-1976 Kana set (1 byte per character)
*
ESC $ ( O
to switch to
JIS X 0213-2000 Plane 1 (2 bytes per character)
*
ESC $ ( P
to switch to
JIS X 0213-2000 Plane 2 (2 bytes per character)
*
ESC $ ( Q
to switch to
JIS X 0213-2004 Plane 1 (2 bytes per character, ISO-2022-JP-2004 only)
Other 7-bit versions
is defined in
RFC 1557, dated 1993.
It encodes
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
and the Korean double-byte
KS X 1001-1992,
previously named KS C 5601-1987. Unlike ISO-2022-JP-2, it makes use of the
Shift Out and Shift In characters
Shift Out (SO) and Shift In (SI) are ASCII control characters 14 and 15, respectively (0x0E and 0x0F). These are sometimes also called "Control-N" and "Control-O".
The original meaning of those characters provided a way to shift a coloured ribbon ...
to switch between them, after including
ESC $ ) C
once at the start of a line to designate KS X 1001 to G1.
and ISO-2022-CN-EXT are defined in
RFC 1922, dated 1996. They are 7-bit encodings making use both of the Shift Out and Shift In functions (to shift between G0 and G1), and of the 7-bit escape code forms of the single-shift functions SS2 and SS3 (to access G2 and G3).
They support the character sets
GB 2312
is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准 ...
(for
simplified Chinese
Simplification, Simplify, or Simplified may refer to:
Mathematics
Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example
* Simplification of algebraic expressions, ...
) and
CNS 11643
The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code or CSIC ( zh, tr=, t=中文標準交換碼), is officially the standard character set of Taiwan (Republic of China). In p ...
(for
traditional Chinese
A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays or ...
).
The basic ISO-2022-CN profile uses ASCII as its G0 (shift in) set, and also includes GB 2312 and the first two planes of CNS 11643 (due to these two planes being sufficient to represent all traditional Chinese characters from common
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
, to which the RFC provides a correspondence in an appendix):
*
ESC $ ) A
to switch to
GB 2312-1980 (2 bytes per character) ''
esignated to G1'
*
ESC $ ) G
to switch to
CNS 11643-1992 CNS may refer to:
Science and medicine
* Central nervous system
* Clinical nurse specialist
* Coagulase-negative staphylococcus
* Connectedness to nature scale
* Conserved non-coding sequence of DNA
* Crigler–Najjar syndrome
* Crystallography ...
Plane 1 (2 bytes per character) ''
esignated to G1'
*
ESC $ * H
to switch to CNS 11643-1992 Plane 2 (2 bytes per character) ''
esignated to G2'
The ISO-2022-CN-EXT profile permits the following additional sets and planes.
*
ESC $ ) E
to switch to
ISO-IR-165
The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992.
It is defined in ITU T.101, annex C, which codifies ...
(2 bytes per character) ''
esignated to G1'
*
ESC $ + I
to switch to CNS 11643-1992 Plane 3 (2 bytes per character) ''
esignated to G3'
*
ESC $ + J
to switch to CNS 11643-1992 Plane 4 (2 bytes per character) ''
esignated to G3'
*
ESC $ + K
to switch to CNS 11643-1992 Plane 5 (2 bytes per character) ''
esignated to G3'
*
ESC $ + L
to switch to CNS 11643-1992 Plane 6 (2 bytes per character) ''
esignated to G3'
*
ESC $ + M
to switch to CNS 11643-1992 Plane 7 (2 bytes per character) ''
esignated to G3'
The ISO-2022-CN-EXT profile further lists additional
Guobiao standard
The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China.
...
graphical sets as being permitted, but conditional on their being assigned registered ISO 2022 escape sequences:
* GB 12345 in G1
* GB 7589 or GB 13131 in G2
* GB 7590 or GB 13132 in G3
The character after the
ESC
(for single-byte character sets) or
ESC $
(for multi-byte character sets) specifies the type of character set and working set that is designated to. In the above examples, the character
(
(0x28) designates a 94-character set to the G0 character set, whereas
)
,
*
or
+
(0x29–0x2B) designates to the G1–G3 character sets.
ISO-2022-KR and ISO-2022-CN are used less frequently than ISO-2022-JP, and are sometimes deliberately not supported due to security concerns. Notably, the
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, lea ...
Encoding Standard used by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT (as well as
HZ-GB-2312
The HZ character encoding is an encoding of GB 2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee () of Stanford University, and subsequently codified in 1995 into RFC 1843.
The HZ, short f ...
) to the "replacement" decoder,
which maps all input to the
replacement character
Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0:
*, marks start of annotated text
*, marks start ...
(�), in order to prevent certain
cross-site scripting
Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
and related attacks, which utilize a difference in encoding support between the client and server.
Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and
UTF-16
UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
, they could not be given this treatment due to being much more frequently used in deployed content.
ISO/IEC 4873
A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by
Ecma International
Ecma International () is a nonprofit standards organization for information and communication systems. It acquired its current name in 1994, when the European Computer Manufacturers Association (ECMA) changed its name to reflect the organizatio ...
as ECMA-43.
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
defines 8-bit codes for ISO/IEC 4873 (or ECMA-43) level 1.
ISO/IEC 4873 / ECMA-43 defines three levels of encoding:
*Level 1, which includes a C0 set, the
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
G0 set, an optional C1 set and an optional single-byte (94-character or 96-character) G1 set. G0 is invoked over GL, and G1 is invoked over GR. Use of shift functions is not permitted.
*Level 2, which includes a (94-character or 96-character) single-byte G2 and/or G3 set in addition to a mandatory G1 set. Only the single-shift functions SS2 and SS3 are permitted (i.e. locking shifts are forbidden), and they invoke over the GL region (including
0x20 and 0x7F in the case of a 96-set). SS2 and SS3 must be available in C1 at 0x8E and 0x8F respectively. This minimal required C1 set for ISO 4873 is registered as ISO-IR-105.
*Level 3, which permits the GR locking-shift functions LS1R, LS2R and LS3R in addition to the single shifts, but otherwise has the same restrictions as level 2.
Earlier editions of the standard permitted non-ASCII assignments in the G0 set, provided that the
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
invariant positions were preserved, that the other positions were assigned to spacing (not combining) characters, that 0x23 was assigned to either
£ or
#, and that 0x24 was assigned to either
$ or
¤. For instance, the 8-bit encoding of
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, altho ...
is compliant with earlier editions. This was subsequently changed to fully specify the ISO/IEC 646:1991 IRV / ISO-IR No. 6 set (
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
).
The use of the
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
IRV (synchronised with ASCII since 1991) at ISO/IEC 4873 Level 1 with no C1 or G1 set, i.e. using the IRV in an 8-bit environment in which shift codes are not used and the high bit is always zero, is known as ISO 4873 DV, in which DV stands for "Default Version".
In cases where duplicate characters are available in different sets, the current edition of ISO/IEC 4873 / ECMA-43 only permits using these characters in the lowest numbered working set which they appear in.
For instance, if a character appears in both the G1 set and the G3 set, it must be used from the G1 set. However, use from other sets is noted as having been permitted in earlier editions.
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
defines complete encodings at level 1 of ISO/IEC 4873, and does not allow for use of multiple ISO/IEC 8859 parts together. It stipulates that
ISO/IEC 10367
ISO/IEC 10367:1991 is a standard developed by ISO/IEC JTC 1/SC 2, defining graphical character sets for use in character encodings implementing levels 2 and 3 of ISO/IEC 4873 (as opposed to ISO/IEC 8859, which defines character encodings at level ...
should be used instead for levels 2 and 3 of ISO/IEC 4873.
ISO/IEC 10367:1991 includes G0 and G1 sets matching those used by the first 9 parts of ISO/IEC 8859 (i.e. those which existed as of 1991, when it was published), and some supplementary sets.
Character set designation escape sequences are used for identifying or switching between versions during information interchange only if required by a further protocol, in which case the standard requires an ISO/IEC 2022 announcer sequence specifying the ISO/IEC 4873 level, followed by a complete set of escapes specifying the character set designations for C0, C1, G0, G1, G2 and G3 respectively (but omitting G2 and G3 designations for level 1), with an -byte of 0x7E denoting an empty set. Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:
Extended Unix Code
Extended Unix Code (EUC) is an 8-bit variable-width
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
system used primarily for
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
,
Korean
Korean may refer to:
People and culture
* Koreans, ethnic group originating in the Korean Peninsula
* Korean cuisine
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Chosŏn'gŭl
**Korean dialects and the Jeju language
** ...
, and
simplified Chinese
Simplification, Simplify, or Simplified may refer to:
Mathematics
Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example
* Simplification of algebraic expressions, ...
. It is based on ISO 2022, and only character sets which conform to the ISO 2022 structure can have EUC forms. Up to four coded character sets can be represented (in G0, G1, G2 and G3). The G0 set is invoked over GL, the G1 set is invoked over GR, and the G2 and G3 sets are (if present) invoked using the single shifts SS2 and SS3, which are used as CR bytes (i.e. 0x8E and 0x8F respectively) and invoke over GR (not GL).
Locking shift codes are not used.
[.]
The code assigned to the G0 set is
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
, or the country's national
ISO 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
character set such as KS-Roman (KS X 1003) or
JIS-Roman
Code page 895 (CCSID 895) is a 7-bit character set and is Japan's national ISO 646 variant. It is the Roman set (first or left half) of the JIS X 0201 (formerly JIS C 6220) Japanese Standard and is variously called Japan 7-Bit Latin, JISCII, JIS ...
(the lower half of
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, altho ...
).
Hence, 0x5C (
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin o ...
in US-ASCII) is used to represent a
Yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Renminbi, Chinese yuan currency, currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. ...
in some versions of EUC-JP and a
Won sign
The won sign , is a currency symbol. It represents the South Korean won, the North Korean won and, unofficially, the old Korean won.
Appearance
Its appearance is "W" (the first letter of "Won") with a horizontal strike going through the cent ...
in some versions of EUC-KR.
G1 is used for a 94x94 coded character set represented in two bytes. The
EUC-CN
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded chara ...
form of and
EUC-KR
Extended Unix Code (EUC) is a multibyte character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing ...
are examples of such two-byte EUC codes.
EUC-JP
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
includes characters represented by up to three bytes (i.e. SS3 plus two bytes) whereas a single character in
EUC-TW
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
can take up to four bytes (i.e. SS2 plus three bytes).
The EUC code itself does not make use of the announcer or designation sequences from ISO 2022; however, it corresponds to the following sequence of four announcer sequences, with meanings breaking down as follows.
Comparison with other encodings
Advantages
* As ISO/IEC 2022's entire range of graphical character encodings can be invoked over GL, the available glyphs are not significantly limited by an inability to represent GR and C1, such as in a system limited to 7-bit encodings. It accordingly enables the representation of large set of characters in such a system. Generally, this 7-bit compatibility is not really an advantage, except for backwards compatibility with older systems. The vast majority of modern computers use 8 bits for each byte.
* As compared to Unicode, ISO/IEC 2022 sidesteps
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature s ...
by using sequence codes to switch between discrete encodings for different East Asian languages. This avoids the issues associated with unification, such as difficulty supporting multiple
CJK languages with their associated character variants in a single document and font.
Disadvantages
* Since ISO/IEC 2022 is a stateful encoding, a program cannot jump in the middle of a block of text to search, insert or delete characters. This makes manipulation of the text very cumbersome and slow when compared to non-stateful encodings. Any jump in the middle of the text may require a backup to the previous escape sequence before the bytes following the escape sequence can be interpreted.
* Due to the stateful nature of ISO/IEC 2022, an identical and equivalent character may be encoded in different character sets, which may be designated to any of G0 through G3, which may be invoked using single shifts or by using locking shifts to GL or GR. Consequently, characters can be represented in multiple ways, meaning that two visually identical and equivalent strings can not be reliably compared for equality.
* Some systems, like
DICOM
Digital Imaging and Communications in Medicine (DICOM) is the standard for the communication and management of medical imaging information and related data. DICOM is most commonly used for storing and transmitting medical images enabling the integ ...
and several e-mail clients, use a variant of ISO-2022 (e.g. "ISO 2022 IR 100") in addition to supporting several other encodings.
This type of variation makes it difficult to portably transfer text between computer systems.
*
UTF-1
UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...
, the multi-byte
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
transformation format compatible with ISO/IEC 2022's representation of 8-bit control characters, has various disadvantages in comparison with
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, and switching from or to other charsets, as supported by ISO/IEC 2022, is typically unnecessary in Unicode documents.
* Because of its escape sequences, it is possible to construct attack byte sequences in which a malicious string (such as
cross-site scripting
Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
) is masked until it is decoded to Unicode, which may allow it to bypass sanitisation.
Use of this encoding is thus treated as suspicious by malware protection suites, and 7-bit ISO 2022 data (except for ISO-2022-JP) is mapped in its entirety to the
replacement character
Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0:
*, marks start of annotated text
*, marks start ...
in
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
to prevent attacks.
Restricted ISO 2022 8-bit code versions which do not use designation escapes or locking shift codes, such as
Extended Unix Code
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
, do not share this problem.
* Concatenation can pose issues. Profiles such as ISO-2022-JP specify that the stream starts in the ASCII state and must end in the ASCII state.
This is necessary to ensure that characters in concatenated ISO-2022-JP and/or ASCII streams will be interpreted in the correct set. This has the consequence that if a stream that ends in a multi-byte character is concatenated with one that starts with a multi-byte character, a pair of escape codes are generated switching to ASCII and immediately away from it. However, as stipulated in Unicode Technical Report #36 ("Unicode Security Considerations"), pairs of ISO 2022 escape sequences with no characters between them should generate a
replacement character
Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0:
*, marks start of annotated text
*, marks start ...
("�") to prevent them from being used to mask malicious sequences such as
cross-site scripting
Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may ...
.
Implementing this measure, e.g. in
Mozilla Thunderbird
Mozilla Thunderbird is a free and open-source cross-platform email client, personal information manager, news client, RSS and chat client developed by the Mozilla Foundation and operated by subsidiary MZLA Technologies Corporation. The project s ...
, has led to interoperability issues, with unexpected "�" characters being generated where two ISO-2022-JP streams have been concatenated.
See also
*
ISO 2709
ISO 2709 is an ISO standard for bibliographic descriptions, titled ''Information and documentation—Format for information exchange''.
It is maintained by the Technical Committee for Information and Documentation ( TC 9846).
History
In the lat ...
*
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
*
ISO-IR-102
T.61 is an ITU-T Recommendation for a Teletex character set. T.61 predated Unicode,
and was the primary character set in ASN.1 used in early versions of X.500 and X.509
for encoding strings containing characters used in Western European languag ...
*
C0 and C1 control codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
*
CJK
*
MARC standards
MARC (machine-readable cataloging) standards are a set of digital formats for the description of items catalogued by libraries, such as books, DVDs, and digital resources. Computerized library catalogs and library management software need to st ...
*
Mojibake
Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, ofte ...
*
luit
luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set. Whereas iconv converts the character set of strings or ...
*
ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...
Footnotes
References
Standards and registry indices cited
*
*
*
*
*
*
*
*
*
*
*
*
*
Registered code sets cited
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Internet Requests For Comment cited
*
*
*
*
*
Other published works cited
*
Further reading
*
External links
ISO/IEC 2022:1994 equivalent to ISO/IEC 2022 and freely downloadable.
International Register of Coded Character Sets to be Used with Escape Sequences a full list of assigned character sets and their escape sequences
*
Ken Lunde
Ken Roger Lunde (, born 12 August 1965 in Madison, Wisconsin)Lunde, 2008. is an American specialist in information processing for East Asian languages.
Academic Background
Ken majored in linguistics at University of Wisconsin–Madison in 1985, w ...
's
ttp://users.monash.edu/~jwb/cjk.inf CJK.INF a document on encoding Chinese, Japanese, and Korean (CJK) languages, including a discussion of the various variants of ISO/IEC 2022.
{{DEFAULTSORT:ISO IEC 2022
Character sets
Ecma standards
#02022