Code page 949 (IBM)
   HOME

TheInfoList



OR:

IBM code page 949 (IBM-949) is a
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
which has been used by IBM to represent
Korean language Korean ( South Korean: , ''hangugeo''; North Korean: , ''chosŏnmal'') is the native language for about 80 million people, mostly of Korean descent. It is the official and national language of both North Korea and South Korea (geographic ...
text on computers. It is a variable-width encoding which represents the characters from the Wansung code defined by the
South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the southern part of the Korean Peninsula and sharing a land border with North Korea. Its western border is formed by the Yellow Sea, while its eas ...
n standard
KS X 1001 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common le ...
in a format compatible with
EUC-KR Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
, but adds IBM extensions for additional
hanja Hanja (Hangul: ; Hanja: , ), alternatively known as Hancha, are Chinese characters () used in the writing of Korean. Hanja was used as early as the Gojoseon period, the first ever Korean kingdom. (, ) refers to Sino-Korean vocabulary, ...
, additional precomposed
Hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
syllables, and user-defined characters. Giving values in hexadecimal, bytes 0x00 through 0x7F are used for single byte KS X 1003 (
ISO 646 ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in ...
:KR) characters, a similar set to ASCII but with a
won sign The won sign , is a currency symbol. It represents the South Korean won, the North Korean won and, unofficially, the old Korean won. Appearance Its appearance is "W" (the first letter of "Won") with a horizontal strike going through the cent ...
rather than a backslash. Bytes 0x80 through 0x84 are used for IBM single byte extension characters. Lead bytes 0x8F through 0xA0 are used for IBM double byte extension characters. Lead bytes 0xA1 through 0xFE are used for Wansung code (
KS X 1001 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common le ...
characters in EUC-KR form, double byte), but with some unused space opened up for user-defined use. Although both are sometimes named "cp949", IBM-949 is different from Windows code page 949 (IBM-1363), which is Microsoft's Unified Hangul Code, a different extension of EUC-KR. It should also not be confused with IBM's implementation of plain EUC-KR ( IBM-970). Code page 949 in
OS/2 OS/2 (Operating System/2) is a series of computer operating systems, initially created by Microsoft and IBM under the leadership of IBM software designer Ed Iacobucci. As a result of a feud between the two companies over how to position OS/2 r ...
is the IBM code page; however, a third-party patch exists to change this.


Terminology and encoding labelling

Both IBM-949 and
Unified Hangul Code Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C ...
(Windows-949) are known as "code page 949" (or "cp949") although they share only the EUC-KR subset in common. Neither has a standardised IANA-registered label to identify it. Although UHC is included in the
WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, l ...
Encoding Standard, with labels including "windows-949", IBM-949 is not. IBM-949 therefore is not permitted in
HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
. Although the meaning of the label "ibm-949" (and conversely "windows-949" and "ms949") is unambiguous where these labels are supported, the interpretation of the encoding labels "949" and "cp949" consequently varies between implementations. For example,
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...
uses "cp949", "949", "ibm-949" and "x-IBM949" to refer to IBM-949, and additionally the labels "cp949c", "ibm-949c" and "x-IBM949C" to refer to an variant which uses unmodified
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
mappings for 0x20–7E (resulting in duplicate mappings for the backslash), while (of the labels incorporating the code page number 949) only "ms949" and "windows-949" are assigned to UHC. This is in contrast to
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, which recognises both "cp949" and "949" (in addition to the more explicit "ms949" and "uhc", but not "windows-949") as labels for UHC, and does not include an IBM-949 codec. The code page 949 used by Korean language versions of
OS/2 OS/2 (Operating System/2) is a series of computer operating systems, initially created by Microsoft and IBM under the leadership of IBM software designer Ed Iacobucci. As a result of a feud between the two companies over how to position OS/2 r ...
is the IBM code page; to add support for the entire Unicode set of Korean syllables, a third-party patch exists to replace it with the Microsoft code page. IBM-949 is a
variable width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer. Most common variable-width encodings are ...
defined as the combination of two fixed-width code pages, the single-byte Code page 1088 and the double-byte
Code page 951 Code page 951 is a code page number used for different purposes by IBM and Microsoft. * IBM uses the code page number 951 for their double-byte PC Data KS code, the double byte component of their code page 949, an encoding for the Korean language. ...
.


History

A version of Code page 951 (a DBCS-PC, i.e. double-byte non- EUC non-
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
, code), the double-byte component for IBM-949, is defined in the September 1992 revision of IBM Corporate Specification C-H 3-3220-125, along with Code page 834 (a DBCS-Host, i.e. double-byte EBCDIC, code), which is the double byte component of
Code page 933 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common leg ...
. This version of Code page 949/951 considered the entire lead byte range 0x8F–A0 to be a user-defined region, and included only standard Wansung assignments and user-defined areas, thus not including some characters which Code page 933/834 included. Some later versions, such as that implemented by
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...
(ICU), shrink the user-defined region to include these characters as extensions. The earlier October 1989 revision of C-H 3-3220-125 had instead defined Code page 926 as its DBCS-PC code, which encoded the same characters as IBM-834 in a layout differing from both IBM-951 and IBM-834, which had a different lead byte range and was not an EUC-KR extension. IBM-926 was combined with Code page 891 or Code page 1040 (respectively 8-bit N-byte Hangul Code and an extension thereof; compare how
Shift JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjuncti ...
extends 8-bit
JIS X 0201 JIS X 0201, a Japanese Industrial Standard developed in 1969 (then called JIS C 6220 until the JIS category reform), was the first Japanese electronic character set to become widely used. It is either a 7-bit encoding or an 8-bit encoding, altho ...
) to form IBM-934 or IBM-944 respectively. Code page 944/926 are now
deprecated In several fields, especially computing, deprecation is the discouragement of use of some terminology, feature, design, or practice, typically because it has been superseded or is no longer considered efficient or safe, without completely removing ...
in favour of IBM-949. The 1992 revision designates code page 926 as "restricted" ("limited to the particular environment for which
t is T, or t, is the twentieth letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''tee'' (pronounced ), plural ''tees''. It is der ...
registered") and does not give its chart or mappings from the other code pages, and CCSID 944 is categorised as "coexistence and migration" (contrast "interoperable" for CCSID 949). International Components for Unicode includes
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
mappings for IBM-949 and IBM-933, but its IBM-944 mapping was removed in 2001.


Single byte codes


Double byte codes


Lead bytes 0x8F–99, 0xC9, 0xFE (user defined ranges)

IBM-949 is designed to support a maximum of 1880 UDC (user-defined characters), including ranges within unused rows of the Wansung plane, and ranges outside the Wansung plane. In this version, the lead bytes 0x8F–A0 contain a maximum of 1692 UDC, and lead bytes 0xC9 and 0xFE contain a maximum of 94 each (i.e. with trail bytes 0xA0–FE). However, when the extensions to support the entire double-byte repertoire of IBM-933 are implemented, they use lead bytes 0x9A–A0, resulting in a smaller maximum number of characters left for user definition. When mapped to Unicode, 0xC9A1–C9FE (between the syllable and hanja ranges) are mapped to the Unicode
Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
code points U+E000–E05D, while 0xFEA1–FEFE (between the end of the hanja range and the end of the plane) are mapped to U+E05E–E0BB. Outside the Wansung plane, 0x8FA0–9AA5 (where the second byte is in the range 0xA1–FE) are mapped to the Private Use Area code points U+E0BC–E4CA. The last of these ranges cuts into the start of the 0x9A row (shown below). Collectively these private use ranges cover the code points U+E000..E4CA, allowing 1227 UDC to be mapped from IBM-949 to Unicode. The separate private use area range U+F843..F86E is used by IBM to map some characters within the extended hanja range. This follows early recommendations from the Unicode Consortium that corporate characters be allocated from U+F8FF downward and user-defined characters be allocated from U+E000 upward, and is part of a larger corporate private use area scheme which is defined internally by IBM, and uses the range U+F83D..F8FF. (Included with )


Lead bytes 0x9A–9D (extended symbols and hanja)

According to the 1992 specification, this entire range is user-defined. As implemented in the codec contributed to ICU by IBM, however, 0x9AA1 through 0x9AA5 are the end of the user-defined range. The remainder of this range includes some non-Hangul characters included in
Code page 933 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common leg ...
but not in Wansung code. 0x9AA6 through 0x9AAB contain miscellaneous technical or mathematical symbols. The remainder contains
hanja Hanja (Hangul: ; Hanja: , ), alternatively known as Hancha, are Chinese characters () used in the writing of Korean. Hanja was used as early as the Gojoseon period, the first ever Korean kingdom. (, ) refers to Sino-Korean vocabulary, ...
additional to those included in
KS X 1001 KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common le ...
, although some are mapped by IBM to the Private Use Area. } , , , , , , , , , , - , , , , , , , , , , , , , , , , , , - , , , , , , , , , , , , , , , , , , - , , , , , , , , , , , , , , , , , , - , , , , , , , , , , , , , , , , , , - , , , , , , , , , , , , , , , , ,


Lead bytes 0x9E–A0 (extended hanja and syllables)

According to the 1992 specification, this entire range is user-defined. As implemented in the codec contributed to ICU by IBM, 0x9EA1 through 0x9EAC contain the remainder of the extended hanja. The rest of the range contains a few additional
Hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
syllables which are not available in pre-composed form in pure
EUC-KR Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
. Unlike Unified Hangul Code, this is insufficient to support all non-partial
Johab KS X 1001, "''Code for Information Interchange (Hangul and Hanja)''", formerly called KS C 5601, is a South Korean coded character set standard to represent hangul and hanja characters on a computer. KS X 1001 is encoded by the most common leg ...
syllables absent in Wansung code. Significant amongst these are 뢔 (rwae, 0x9EFC), 쌰 (ssya, 0x9FE6), 쎼 (ssye, 0x9FED), 쓔 (ssyu, 0x9FF3) and 쬬 (jjyo, 0xA0C1), which correspond to the beginnings of the standard Wansung characters 뢨, 썅, 쏀, 쓩, and 쭁 respectively, when partly entered in an
input method editor An input method (or input method editor, commonly abbreviated IME) is an operating system component or program that enables users to generate characters not natively available on their input devices by using sequences of characters (or mouse o ...
.


Lead bytes 0xA1–C8, 0xCA–FD (standard Wansung)


See also

* LMBCS-17 *
Code page 951 Code page 951 is a code page number used for different purposes by IBM and Microsoft. * IBM uses the code page number 951 for their double-byte PC Data KS code, the double byte component of their code page 949, an encoding for the Korean language. ...
* Windows-949


Footnotes


References

{{character encoding 949 Encodings of Asian languages