Extended Unix Code (EUC) is a multibyte
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
system used primarily for
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
,
Korean
Korean may refer to:
People and culture
* Koreans, ethnic group originating in the Korean Peninsula
* Korean cuisine
* Korean culture
* Korean language
**Korean alphabet, known as Hangul or Chosŏn'gŭl
**Korean dialects and the Jeju language
** ...
, and
simplified Chinese
Simplification, Simplify, or Simplified may refer to:
Mathematics
Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example
* Simplification of algebraic expressions, ...
.
The most commonly used EUC codes are
variable-length encodings with a character belonging to an compliant coded character set (such as
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
) taking one byte, and a character belonging to a 94x94 coded character set (such as ) represented in two bytes. The
EUC-CN
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
form of and
EUC-KR
Extended Unix Code (EUC) is a multibyte character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing ...
are examples of such two-byte EUC codes.
EUC-JP
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
includes characters represented by up to three bytes, including an initial , whereas a single character in
EUC-TW
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
can take up to four bytes.
Modern applications are more likely to use
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially
EUC-KR
Extended Unix Code (EUC) is a multibyte character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing ...
for South Korea.
Encoding structure
The structure of EUC is based on the standard, which specifies a system of graphical character sets which can be represented with a sequence of the 94 7-bit bytes
0x21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (94
2) characters, or 830584 (94
3) characters. Although initially 0x20 and 0x7F were always the
space
Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually consider ...
and and 0xA0 and 0xFF were unused, later editions of allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for
C0 and C1 control codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
.
EUC is a family of 8-bit profiles of , as opposed to 7-bit profiles such as
ISO-2022-JP
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
. As such, only compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to an compliant coded character set such as
US-ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
, () or (the lower half of ) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared).
If US-ASCII is used, this makes the code an
extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
encoding; the most common deviation from US-ASCII is that 0x5C (
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin o ...
in US-ASCII) is often used to represent a
Yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Renminbi, Chinese yuan currency, currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. ...
in EUC-JP (see below) and a
won sign
The won sign , is a currency symbol. It represents the South Korean won, the North Korean won and, unofficially, the old Korean won.
Appearance
Its appearance is "W" (the first letter of "Won") with a horizontal strike going through the cent ...
in EUC-KR.
The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the
kuten
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current ...
code); this allows software to easily distinguish whether a particular byte in a
character string
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
belongs to the code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes (0x8E) and (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.
The EUC code itself does not make use of the announcement and designation sequences from .
However, the code specification is equivalent to the following sequence of four announcement sequences, with meanings breaking down as follows.
Fixed-length format
The ISO-2022-based
variable-length encoding described above is sometimes referred to as the ''EUC packed format'', which is the encoding format usually labelled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called the EUC complete two-byte format. This represents:
* Code set 0 as two bytes in the range 0x21–0x7E (except that the first may be 0x00).
* Code set 1 as two bytes in the range 0xA0–0xFF (except that the first may be 0x80).
* Code set 2 as a byte in the range 0x21–0x7E (or 0x00) followed by a byte in the range 0xA0–0xFF.
* Code set 3 as a byte in the range 0xA0–0xFF (or 0x80) followed by a byte in the range 0x21–0x7E.
Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.
These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange.
EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".
Only the packed format is included in the
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, lea ...
Encoding Standard used by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
.
EUC-CN
EUC-CN
is the usual encoded form of the standard for
simplified Chinese characters
Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
. Unlike the case of Japanese
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current ...
and
ISO-2022-JP
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
, is not normally used in a 7-bit code version, although a variant form called
HZ (which delimits text with ASCII sequences) was sometimes used on
USENET
Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis conceived the idea in 1979, and it was ...
.
An ASCII character is represented in its usual encoding. A character from is represented by two bytes, both from the range 0xA1–0xFE.
Related Mainland Chinese encoding systems
748 code
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of , but is not –compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
and other non–ISO 2022–compliant
DBCS
A double-byte character set (DBCS) is a character encoding in which either all characters (including control characters) are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set ( ...
encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
IBM code pages 1380, 1381, 1382 and 1383
IBM code page 1381 (
CCSID
A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several encoding (so called "transformation") forms, like UTF-8, UTF-16 and UTF ...
1381) comprises the single-byte
code page 1115 Code page 1115 (CCSID 1115), also known as Simplified Chinese PC Data, is a single byte character set (SBCS) used by IBM in its PC DOS operating system in China.
This code page is intended for use with code page 1380 (Simplified Chinese double b ...
(CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380), which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880
user-defined characters with lead bytes 0x8D through 0xA0.
IBM code page 1383 (CCSID 1383) comprises the single-byte
code page 367
__NOTOC__
Year 367 ( CCCLXVII) was a common year starting on Monday of the Julian calendar. At the time, it was known as the Year of the Consulship of Lupicinus and Iovanus (or, less frequently, year 1120 '' Ab urbe condita''). The denominatio ...
and the double-byte code page 1382 (CPGID 1382 as CCSID 1382), which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312. The alternative CCSID 5479 is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes the IBM-selected and user-defined characters.
GBK and GB 18030
GBK is an extension to . It defines an extended form of the EUC-CN encoding capable of representing a larger array of
CJK characters
In internationalization, CJK characters is a collective term for the Chinese, Japanese, and Korean languages, all of which include Chinese characters and derivatives in their writing systems, sometimes paired with other scripts. Collectively, the ...
sourced largely from , including
traditional Chinese
A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays or ...
characters and characters used only in
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and
C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.
Variants of GBK are implemented by
Windows code page 936 (the
Microsoft Windows
Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
code page
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some co ...
for simplified Chinese), and by IBM's code page 1386.
The Unicode-based character encoding defines an extension of GBK capable of encoding the entirety of
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
. However, Unicode encoded as is a
variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other
Unicode transformation format
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
s such as
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
.
Mac OS Chinese Simplified
Other EUC-CN variants deviating from the EUC mechanism include the
Mac OS
Two major famlies of Mac operating systems were developed by Apple Inc.
In 1984, Apple debuted the operating system that is now known as the "Classic" Mac OS with its release of the original Macintosh System Software. The system, rebranded "M ...
Chinese Simplified script (known as Code page 10008 or
x-mac-chinesesimp
).
It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE and 0xFF for the
U with umlaut (ü), two special font metric characters, the
non-breaking space
In word processing and digital typesetting, a non-breaking space, , also called NBSP, required space, hard space, or fixed space (though it is not of fixed width), is a space character that prevents an automatic line break at its position. In s ...
, the
copyright sign
The copyright symbol, or copyright sign, (a circled capital letter C for copyright), is the symbol used in copyright notices for works other than sound recordings. 17 U.S.C. The use of the symbol is described by the Universal Copyright Conv ...
(©), the
trademark sign (™) and the ellipsis (…) respectively.
This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).
This use of 0xA0, 0xFD, 0xFE and 0xFF matches
Apple's Shift_JIS variant.
Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8.
These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from
GB 6345.1
The CCITT Chinese Primary Set is a multi-byte graphic character set for Chinese communications created for the Consultative Committee on International Telephone and Telegraph (CCITT) in 1992.
It is defined in ITU T.101, annex C, which codifies D ...
,
both extensions are included by GB/T 12345 (the Traditional Chinese variant of GB 2312),
and both extensions are included by
GB 18030
GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet ...
(the successor to GB 2312).
EUC-JP
EUC-JP is a
variable-length encoding used to represent the elements of three
Japanese character set standards, namely , , and . Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.
0.1% of all web pages use EUC-JP since August 2018,
while 2.5% of websites in Japanese use this encoding (less used than , or
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
). It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding (51932 and 20932).
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by
ISO-2022-JP
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike
Shift JIS
Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunctio ...
).
A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes and
(similarly to , its Shift_JIS-based counterpart).
Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used or its extensions (
Windows code page 932 on
Microsoft Windows
Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
, and
MacJapanese
Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunct ...
on
classic Mac OS
Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Macintosh family of personal computers by Apple Computer from 1984 to 2001, starting with System 1 and ending with Mac OS 9. The ...
), although it became heavily used by
Unix
Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
or Unix-like
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs.
Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
s (except for
HP-UX
HP-UX (from "Hewlett Packard Unix") is Hewlett Packard Enterprise's proprietary implementation of the Unix operating system, based on Unix System V (initially System III) and first released in 1984. Current versions support HPE Integrity Ser ...
). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.
Characters are encoded as follows:
* As an EUC/
ISO 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
compliant encoding, the
C0 control characters
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
, space and DEL are represented as in ASCII.
* A graphical character from
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
(code set 0) is represented as its usual one-byte representation, in the range 0x21 – 0x7E. While some variants of EUC-JP encode the
lower half of here, most encode ASCII,
including the W3C/WHATWG Encoding standard used by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
, and so does EUC-JIS-2004.
While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS (the ASCII
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin o ...
), U+005C may be displayed as a
Yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Renminbi, Chinese yuan currency, currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. ...
by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of .
* A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of is encoded here, which is effectively a superset of standard .
* A character from the ''upper half'' of (
half-width kana are katakana characters displayed compressed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ''ka'' is カ while the half-width form is カ. ...
, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual representation in the range 0xA1 – 0xDF. This set may contain
IBM vendor extensions in some variants.
* A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1–0xFE, i.e. with the high bit set. In addition to standard , code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the
OSF.
In EUC-JIS-2004, the second plane of is encoded here,
which does not collide with the allocated rows in standard .
Some implementations of EUC-JIS-2004, such as the one used by
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
, allow both and plane 2 characters in this set.
Related Japanese encoding methods
Vendor extensions to EUC-JP (from, for example, the
Open Software Foundation
The Open Software Foundation (OSF) was a not-for-profit industry consortium for creating an open standard for an implementation of the operating system Unix. It was formed in 1988 and merged with X/Open in 1996, to become The Open Group.
Despite ...
,
IBM or
NEC
is a Japanese multinational corporation, multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It prov ...
) were often allocated within the individual code sets,
as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).
However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.
DEC Kanji
Digital Equipment Corporation
Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president unt ...
defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0 is not required to be left-padded with null bytes (similarly to the packed format).
JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212.
In the basic "DEC Kanji" encoding, only the first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1.
The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets.
It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters.
HP-16
Hewlett-Packard
The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...
defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which is a variant of
Shift JIS
Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunctio ...
. HP-16 encodes using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure:
* Lead bytes 0xA1–C2, trail bytes 0x21–7E
* Lead bytes 0xC3–E3, trail bytes 0x21–3F
* Lead bytes 0xC3–E1, trail bytes 0x40–64
IKIS
The IKIS (Interactive Kanji Information System) encoding used by
Data General
Data General Corporation was one of the first minicomputer firms of the late 1960s. Three of the four founders were former employees of Digital Equipment Corporation (DEC).
Their first product, 1969's Data General Nova, was a 16-bit minicomputer ...
resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with the box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters.
Adaptations of EUC-JP for EBCDIC
KEIS (Kanji-processing Extended Information System) is an
EBCDIC
Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six- ...
encoding used by
Hitachi
() is a Japanese multinational corporation, multinational Conglomerate (company), conglomerate corporation headquartered in Chiyoda, Tokyo, Japan. It is the parent company of the Hitachi Group (''Hitachi Gurūpu'') and had formed part of the Ni ...
,
with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a
stateful
In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system.
The set of states a system can oc ...
encoding. Specifically, the sequence switches to single-byte mode and the sequence switches to double-byte mode. However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the —0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters,
and the remainder are used for corporate-defined characters, including both kanji and non-kanji.
JEF (Japanese-processing Extended Feature)
is an EBCDIC encoding used on
Fujitsu
is a Japanese multinational information and communications technology equipment and services corporation, established in 1935 and headquartered in Tokyo. Fujitsu is the world's sixth-largest IT services provider by annual revenue, and the la ...
FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to a double-byte DBCS-Host mode using shifting sequences (where switches to single-byte mode and switches to double-byte mode).
Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP.
The lead byte range is extended back to 0x41, with 0x80–A0 designated for user definition; lead bytes 0x41–7F are assigned row numbers 101 through 163 for
kuten
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current ...
purposes, although row 162 (lead byte 0x7E) is unused.
Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji.
EUC-KR
EUC-KR is a
variable-length encoding to represent Korean text using two coded character sets, (formerly KS C 5601)
and either (, formerly ) or
US-ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
, depending on variant. (formerly ) stipulates the encoding and dubbed it as EUC-KR.
A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).
It is usually referred to as Wansung ( ko, 완성, Wanseong, lit=precomposed) in the
Republic of Korea
South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korean Peninsula and sharing a land border with North Korea. Its western border is formed by the Yellow Sea, while its east ...
. IBM refers to the double-byte component as Code page 971, and to EUC-KR with ASCII as Code page 970. It is implemented as Code page 20949 ("Korean Wansung")
and Code page 51949 ("EUC Korean") by Microsoft.
, less than 0.07% of all web pages globally use EUC-KR,
but 4.5% of South Korean web pages use EUC-KR. Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (
macOS
macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
, other Unix-like OSes, and Windows), but its use has been very slowly shifting to
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
as it gains popularity, especially on Linux and macOS.
As with most other encodings,
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
is now preferred for new use, solving problems with consistency between platforms and vendors.
Related Korean encoding systems
Unified Hangul Code
A common extension of EUC-KR is the
Unified Hangul Code
Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C ...
( ko, 통합형 한글 코드, Tonghabhyeong Hangeul Kodeu, label=none, or ko, 통합 완성형, Tonghab Wansunghyung, label=none), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261 or 1363 by IBM.
IBM's code page 949 is a different, unrelated, EUC-KR extension.
Unified Hangul Code extends EUC-KR by using codes which do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in
Johab
KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer.
KS X 1001 is encoded by the most common leg ...
and Unicode. The
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
/
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, lea ...
Encoding Standard used by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
incorporates the Unified Hangul Code extensions into its definition of EUC-KR.
Mac OS Korean (HangulTalk)
Other encodings incorporating EUC-KR as a subset include the Mac OS Korean script (known as Code page 10003 or
x-mac-korean
),
which was used by HangulTalk (MacOS-KH), the Korean localisation of the
classic Mac OS
Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Macintosh family of personal computers by Apple Computer from 1984 to 2001, starting with System 1 and ending with Mac OS 9. The ...
. It was developed by Elex Computer ( ko, 일렉스, label=none), who were at the time the authorised distributor of Apple Macintosh computers in South Korea.
HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within the EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylised
dingbat
In typography, a dingbat (sometimes more formally known as a printer's ornament or printer's character) is an ornament, specifically, a glyph used in typesetting, often employed to create box frames, (similar to box-drawing characters) or a ...
s.
Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to
combining sequences, to approximate mappings with an appended
private-use character as a modifier for round-trip purposes, or to private-use characters.
Apple also uses certain single-byte codes outside of the EUC-KR plane for additional characters: 0x80 for a
required space, 0x81 for a
won sign
The won sign , is a currency symbol. It represents the South Korean won, the North Korean won and, unofficially, the old Korean won.
Appearance
Its appearance is "W" (the first letter of "Won") with a horizontal strike going through the cent ...
(₩), 0x82 for an
en dash
The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...
(–), 0x83 for a
copyright sign
The copyright symbol, or copyright sign, (a circled capital letter C for copyright), is the symbol used in copyright notices for works other than sound recordings. 17 U.S.C. The use of the symbol is described by the Universal Copyright Conv ...
(©), 0x84 for a wide
underscore
An underscore, ; also called an underline, low line, or low dash; is a line drawn under a segment of text. In proofreading, underscoring is a convention that says "set this text in italic type", traditionally used on Manuscript (publishing), man ...
(_) and 0xFF for an
ellipsis
The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
(…).
Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN,
see above), some are within the lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84).
EUC-KP
Similarly to KS X 1001, the North Korean
KPS 9566
KPS 9566 ("''DPRK Standard Korean Graphic Character Set for Information Interchange''") is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 speci ...
standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP. More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code.
EUC-TH
Although certain single-byte encodings such as the
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
series technically conform to the EUC structure, they are rarely labelled as EUC. However, is used on
Solaris
Solaris may refer to:
Arts and entertainment Literature, television and film
* ''Solaris'' (novel), a 1961 science fiction novel by Stanisław Lem
** ''Solaris'' (1968 film), directed by Boris Nirenburg
** ''Solaris'' (1972 film), directed by ...
as a label for
TIS-620
Thai Industrial Standard 620-2533, commonly referred to as TIS-620, is the most common character set and character encoding for the Thai language. The standard is published by the Thai Industrial Standards Institute (TISI), an organ of the Mini ...
.
EUC-TW
EUC-TW is a
variable-length encoding that supports US-ASCII and 16 planes of , each of which is 94x94. It is a rarely used encoding for
traditional Chinese characters
Traditional Chinese characters are one type of standard Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the clerical change and mostly remained in the same structure they took at ...
as used in
Taiwan
Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
. Variants of
Big5
Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.
The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
are much more common than EUC-TW, although Big5 only encodes the first two planes of CNS 11643
hanzi
Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
, while
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
is becoming more common.
* As an EUC/
ISO 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
encoding, the
C0 control characters
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
, ASCII space and DEL are encoded as in ASCII.
* A graphical character from US-ASCII (G0, code set 0) is encoded in GL as its usual single byte representation (0x21–0x7E).
* A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).
* A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:
** The first byte is always 0x8E (Single Shift 2).
** The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.
** The third and fourth bytes are in GR (0xA1–0xFE).
Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.
See also
*
CJK
*
Japanese language and computers
In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write in English is ...
*
Korean language and computers
The writing system of the Korean language is a syllabic alphabet of character parts () organized into character blocks () representing syllables. The character parts cannot be written from left to right on the computer, as in many Western lan ...
*
Chinese character encoding
In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character enc ...
Notes
References
External links
EUC-JP codeset table(minus the ASCII and halfwidth parts)
Code Page Identifiersmentions the 748 code
Description of the EUC-TW code(in Chinese)
Manual page of EUC-JISX0213in the Perl Encode module
International Register of Coded Character Sets to be Used With Escape Sequencesection 2.4 (p.14f.) with the coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
Chinese, Japanese, and Korean character set standards and encoding systems
{{Character encoding
Character sets
Chinese-language computing
Encodings of Asian languages
Encodings of Japanese
Korean-language computing