HOME

TheInfoList



OR:

ISO/IEC 8859 is a joint ISO and
IEC The International Electrotechnical Commission (IEC; in French: ''Commission électrotechnique internationale'') is an international standards organization that prepares and publishes international standards for all electrical, electronic and r ...
series of standards for 8-bit
character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
s. The series of standards consists of numbered parts, such as
ISO/IEC 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
,
ISO/IEC 8859-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. I ...
, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded. ISO/IEC 8859 parts 1, 2, 3, and 4 were originally
Ecma International Ecma International () is a nonprofit standards organization for information and communication systems. It acquired its current name in 1994, when the European Computer Manufacturers Association (ECMA) changed its name to reflect the organization ...
standard ECMA-94.


Introduction

While the bit patterns of the 95 printable
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
characters are sufficient to exchange information in modern
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
, most other languages that use
Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the ...
s need additional symbols not covered by ASCII. ISO/IEC 8859 sought to remedy this problem by utilizing the eighth bit in an 8-bit
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
to allow positions for another 96 printable characters. Early encodings were limited to 7 bits because of restrictions of some data transmission protocols, and partially for historical reasons. However, more characters were needed than could fit in a single 8-bit character encoding, so several mappings were developed, including at least ten suitable for various Latin alphabets. The ISO/IEC 8859 standard parts only define printable characters, although they explicitly set apart the byte ranges 0x00–1F and 0x7F–9F as "combinations that do not represent graphic characters" (i.e. which are reserved for use as
control characters In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than th ...
) in accordance with
ISO/IEC 4873 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
; they were designed to be used in conjunction with a separate standard defining the control functions associated with these bytes, such as ISO 6429 or
ISO 6630 The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
. To this end a series of encodings registered with the
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Interne ...
add the C0 control set (control characters mapped to bytes 0 to 31) from
ISO 646 ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in ...
and the C1 control set (control characters mapped to bytes 128 to 159) from ISO 6429, resulting in full 8-bit character maps with most, if not all, bytes assigned. These sets have ISO-8859-''n'' as their preferred
MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
name or, in cases where a preferred MIME name is not specified, their canonical name. Many people use the terms ISO/IEC 8859-''n'' and ISO-8859-''n'' interchangeably. ISO/IEC 8859-11 did not get such a charset assigned, presumably because it was almost identical to
TIS 620 Thai Industrial Standard 620-2533, commonly referred to as TIS-620, is the most common character set and character encoding for the Thai language. The standard is published by the Thai Industrial Standards Institute (TISI), an organ of the Min ...
.


Characters

The ISO/IEC 8859 standard is designed for reliable information exchange, not
typography Typography is the art and technique of arranging type to make written language legible, readable and appealing when displayed. The arrangement of type involves selecting typefaces, point sizes, line lengths, line-spacing ( leading), an ...
; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
and ISO/IEC 8859 standards, or use
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
instead. An inexact rule based on practical experience states that if a character or symbol was not already part of a widely used data-processing character set and was also not usually provided on typewriter keyboards for a national language, it did not get in. Hence the directional double quotation marks ''«'' and ''»'' used for some European languages were included, but not the directional double quotation marks ''“'' and ''”'' used for English and some other languages. French did not get its ''œ'' and ''Œ'' ligatures because they could be typed as 'oe'. Likewise, ''Ÿ'', needed for all-caps text, was dropped as well. Albeit under different codepoints, these three characters were later reintroduced with ISO/IEC 8859-15 in 1999, which also introduced the new
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists o ...
character €. Likewise Dutch did not get the ''ij'' and ''IJ'' letters, because Dutch speakers had become used to typing these as two letters instead. Romanian did not initially get its ''Ș''/''ș'' and ''Ț''/''ț'' ( with comma) letters, because these letters were initially unified with ''Ş''/''ş'' and ''Ţ''/''ţ'' ( with cedilla) by the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intentio ...
, considering the shapes with comma beneath to be glyph variants of the shapes with cedilla. However, the letters with explicit comma below were later added to the Unicode standard and are also in
ISO/IEC 8859-16 ISO/IEC 8859-16:2001, ''Information technology — 8-bit single-byte coded graphic character sets — Part 16: Latin alphabet No. 10'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 200 ...
. Most of the ISO/IEC 8859 encodings provide diacritic marks required for various European languages using the Latin script. Others provide non-Latin alphabets:
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
,
Cyrillic The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...
,
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
,
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
and Thai. Most of the encodings contain only spacing characters, although the Thai, Hebrew, and Arabic ones do also contain
combining characters In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode also ...
. The standard makes no provision for the scripts of East Asian languages ('' CJK''), as their ideographic
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s require many thousands of code points. Although it uses Latin based characters, Vietnamese does not fit into 96 positions (without using combining diacritics such as in Windows-1258) either. Each Japanese syllabic alphabet (hiragana or katakana, see
Kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters ( kanji) used phonetically to transcribe Japanese, the most ...
) would fit, as in JIS X 0201, but like several other alphabets of the world they are not encoded in the ISO/IEC 8859 system.


The parts of ISO/IEC 8859

ISO/IEC 8859 is divided into the following parts: Each part of ISO/IEC 8859 is designed to support languages that often borrow from each other, so the characters needed by each language are usually accommodated by a single part. However, there are some characters and language combinations that are not accommodated without transcriptions. Efforts were made to make conversions as smooth as possible. For example, German has all of its seven special characters at the same positions in all Latin variants (1–4, 9, 10, 13–16), and in many positions the characters only differ in the diacritics between the sets. In particular, variants 1–4 were designed jointly, and have the property that every encoded character appears either at a given position or not at all.


Table

At position 0xA0 there's always the non breaking space and 0xAD is mostly the soft hyphen, which only shows at line breaks. Other empty fields are either unassigned or the system used is not able to display them. There are new additions as ISO/IEC 8859-7:2003 and ISO/IEC 8859-8:1999 versions. LRM stands for
left-to-right mark The left-to-right mark (LRM) is a control character (an invisible formatting character) used in computerized typesetting (including word processing in a program like Microsoft Word) of text containing a mix of left-to-right scripts (such as Latin ...
(U+200E) and RLM stands for right-to-left mark (U+200F).


Relationship to Unicode and the UCS

Since 1991, the Unicode Consortium has been working with ISO and IEC to develop the
Unicode Standard Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
and ISO/IEC 10646: the Universal Character Set (UCS) in tandem. Newer editions of ISO/IEC 8859 express characters in terms of their Unicode/UCS names and the ''U+nnnn'' notation, effectively causing each part of ISO/IEC 8859 to be a Unicode/UCS character encoding scheme that maps a very small subset of the UCS to single 8-bit bytes. The first 256 characters in Unicode and the UCS are identical to those in ISO/IEC-8859-1 (
Latin-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...
). Single-byte character sets including the parts of ISO/IEC 8859 and derivatives of them were favoured throughout the 1990s, having the advantages of being well-established and more easily implemented in software: the equation of one byte to one character is simple and adequate for most single-language applications, and there are no combining characters or variant forms. As Unicode-enabled operating systems became more widespread, ISO/IEC 8859 and other legacy encodings became less popular. While remnants of ISO 8859 and single-byte character models remain entrenched in many operating systems, programming languages, data storage systems, networking applications, display hardware, and end-user application software, most modern computing applications use Unicode internally, and rely on conversion tables to map to and from other encodings, when necessary.


Current status

The ISO/IEC 8859 standard was maintained by ISO/IEC Joint Technical Committee 1, Subcommittee 2, Working Group 3 (ISO/IEC JTC 1/SC 2/WG 3). In June 2004, WG 3 disbanded, and maintenance duties were transferred to SC 2. The standard is not currently being updated, as the Subcommittee's only remaining
working group A working group, or working party, is a group of experts working together to achieve specified goals. The groups are domain-specific and focus on discussion or activity around a specific subject area. The term can sometimes refer to an interdis ...
, WG 2, is concentrating on development of Unicode's
Universal Coded Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of character (computing), characters defined by the international standard International Organization for Standardization, ISO/International Electrotechnical Commission, IEC  ...
. The
WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, l ...
Encoding Standard, which specifies the character encodings permitted in
HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
which compliant browsers must support, includes most parts of ISO/IEC 8859, except for parts 1, 9 and 11, which are instead interpreted as
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
,
Windows-1254 Windows-1254 is a code page used under Microsoft Windows (and for the web), to write Turkish that it was designed for (which is its dominant user, even though it can be used for some other languages too). Characters with codepoints A0 through FF ...
and
Windows-874 ISO/IEC 8859-11:2001, ''Information technology — 8-bit single-byte coded graphic character sets — Part 11: Latin/Thai alphabet'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. I ...
respectively. Authors of new pages and the designers of new protocols are instructed to use
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
instead.


See also

* List of computer character sets *
Number Forms Number Forms is a Unicode block containing Unicode compatibility characters that have specific meaning as numbers, but are constructed from other characters. They consist primarily of vulgar fractions and Roman numerals. In addition to the cha ...
*
RPL character set The RPL character set is an 8-bit character set and encoding used by most RPL calculators manufactured by Hewlett-Packard as well as by the HP 82240B thermo printer. It is sometimes referred to simply as "ECMA-94" in documentation, although i ...
(An ISO/IEC 8859-1 superset on HP calculators, referred to as "ECMA-94" as well) * DEC Multinational Character Set (MCS) * DEC National Replacement Character Set (NRCS)


Notes


References

* Published versions of each part of ISO/IEC 8859 are available, for a fee, from th
ISO catalogue site
and from th
IEC Webstore
* PDF versions of the final drafts of some parts of ISO/IEC 8859 as submitted to the ISO/IEC JTC 1/SC 2/WG 3 for review & publication are available at th
WG 3 web site
*
ISO/IEC 8859-1:1998
- 8-bit single-byte coded graphic character sets, Part 1: Latin alphabet No. 1 ''(draft dated February 12, 1998, published April 15, 1998)'' *
ISO/IEC 8859-4:1998
- 8-bit single-byte coded graphic character sets, Part 4: Latin alphabet No. 4 ''(draft dated February 12, 1998, published July 1, 1998)'' *
ISO/IEC 8859-7:1999
- 8-bit single-byte coded graphic character sets, Part 7: Latin/Greek alphabet ''(draft dated June 10, 1999; superseded by ISO/IEC 8859-7:2003, published October 10, 2003)'' *
ISO/IEC 8859-10:1998
- 8-bit single-byte coded graphic character sets, Part 10: Latin alphabet No. 6 ''(draft dated February 12, 1998, published July 15, 1998)'' *
ISO/IEC 8859-11:1999
- 8-bit single-byte coded graphic character sets, Part 11: Latin/Thai character set ''(draft dated June 22, 1999; superseded by ISO/IEC 8859-11:2001, published 15 December 2001)'' *
ISO/IEC 8859-13:1998
- 8-bit single-byte coded graphic character sets, Part 13: Latin alphabet No. 7 ''(draft dated April 15, 1998, published October 15, 1998)'' *
ISO/IEC 8859-15:1998
- 8-bit single-byte coded graphic character sets, Part 15: Latin alphabet No. 9 ''(draft dated August 1, 1997; superseded by ISO/IEC 8859-15:1999, published March 15, 1999)'' *
ISO/IEC 8859-16:2000
- 8-bit single-byte coded graphic character sets, Part 16: Latin alphabet No. 10 ''(draft dated November 15, 1999; superseded by ISO/IEC 8859-16:2001, published July 15, 2001)'' * ECMA standards, which in intent correspond exactly to the ISO/IEC 8859 character set standards, can be found at: *
Standard ECMA-94
8-Bit Single Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4 ''2nd edition (June 1986)'' *

8-Bit Single-Byte Coded Graphic Character Sets - Latin/Cyrillic Alphabet ''3rd edition (December 1999)'' *

8-Bit Single-Byte Coded Graphic Character Sets - Latin/Arabic Alphabet ''2nd edition (December 2000)'' *

8-Bit Single-Byte Coded Graphic Character Sets - Latin/Greek Alphabet ''(December 1986)'' *

8-Bit Single-Byte Coded Graphic Character Sets - Latin/Hebrew Alphabet ''2nd edition (December 2000)'' *

8-Bit Single-Byte Coded Graphic Character Sets - Latin Alphabet No. 5 ''2nd edition (December 1999)'' *

8-Bit Single-Byte Coded Character Sets - Latin Alphabet No. 6 ''3rd edition (December 2000)'' * ISO/IEC 8859-1 to Unicod
mapping tables
as plain text files are at the Unicode FTP site. * Informal descriptions and code charts for most ISO/IEC 8859 standards are available i

{{DEFAULTSORT:ISO IEC 8859 Character sets Ecma standards #08859