Several binary representations of 8-bit

character sets Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...

for common

Western European Western Europe is the western region of Europe. The region's countries and territories vary depending on context. The concept of "the West" appeared in Europe in juxtaposition to "the East" and originally applied to the ancient Mediterranean ...

languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German,

Dutch Dutch commonly refers to: * Something of, from, or related to the Netherlands * Dutch people () * Dutch language () Dutch may also refer to: Places * Dutch, West Virginia, a community in the United States * Pennsylvania Dutch Country People ...

English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...

, Danish, Swedish, Norwegian, and Icelandic, which use the

Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the ...

, a few additional letters and ones with precomposed

diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...

s, some

punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...

, and various

symbol A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different co ...

s (including some

Greek letters The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as ...

). Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, and

Classical Latin Classical Latin is the form of Literary Latin recognized as a literary standard by writers of the late Roman Republic and early Roman Empire. It was used from 75 BC to the 3rd century AD, when it developed into Late Latin. In some later per ...

. ''This material is technically obsolete, having been functionally replaced by

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...

. However it continues to have historical interest.''

Summary

The

ISO-8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12 ...

series of

8-bit In computer architecture, 8-bit integers or other data units are those that are 8 bits wide (1 octet). Also, 8-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers or data bu ...

encodes all

Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...

character sets used in Europe, albeit that the same

code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...

s have multiple uses that caused some difficulty (including

mojibake Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, oft ...

, or garbled characters, and communication issues). The arrival of

, with a unique code point for every

glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...

, resolved these issues. *

ISO/IEC 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...

or Latin-1 is the most used and also defines the first 256 codes in

. *

ISO/IEC 8859-15 ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...

modifies

ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...

to fully support Estonian, Finnish and French and add the

euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It cons ...

. *

Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It ...

is a superset of

that includes the

printable characters ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...

from

and popular

such as curved

quotation mark Quotation marks (also known as quotes, quote marks, speech marks, inverted commas, or talking marks) are punctuation marks used in pairs in various writing systems to set off direct speech, a quotation, or a phrase. The pair consists of an ...

s (also known as smart quotes, such as in

Microsoft Word Microsoft Word is a word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other platforms includin ...

settings and similar programs). It is common that web page tools for

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...

use Windows-1252 but label the web page as using ISO-8859-1, this has been addressed in

HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTM ...

, which mandates that pages labeled as ISO-8859-1 must be interpreted as Windows-1252. * IBM CP437, being intended for

only, has very little in the way of accented letters (particularly

uppercase Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writing ...

) but has far more graphics characters than the other IBM

code page In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some ...

s listed here and also some

mathematical Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...

and Greek characters that are useful as technical

s. * IBM CP850 has all the

that

has (albeit arranged differently) and still manages to have enough graphics characters to build a usable text-mode

user interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine fr ...

. * IBM CP858 differs from CP850 only by one character — a ''dotless i'' ( ı), rarely used outside Turkey and with no

equivalent provided, was replaced by ''euro currency sign'' ( €). * IBM CP859 contains all the

that

has, so unlike CP850 it supports the

, Estonian, Finnish and French. * IBM code pages 037, 500, and 1047 are

EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...

encodings that include all of the

characters. * The

Mac OS Roman Mac OS Roman is a character encoding created by Apple Computer, Inc. for use by Macintosh computers. It is suitable for representing text in English and several other Western languages. Mac OS Roman encodes 256 characters, the first 128 of whic ...

character set (often referred to as MacRoman and known by the

IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Intern ...

as simply MACINTOSH) has most, but not all, of the same characters as

but in a very different arrangement; and it also adds many technical and mathematical characters (though it lacks the important ×) and more

s. Older

Macintosh The Mac (known as Macintosh until 1999) is a family of personal computers designed and marketed by Apple Inc. Macs are known for their ease of use and minimalist designs, and are popular among students, creative professionals, and software en ...

web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used ...

s were known to munge the few characters that were in

but not their native

character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...

when editing text from

Web sites A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wikipe ...

. Conversely, in Web material prepared on an older Macintosh, many characters were displayed incorrectly when read by other

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also ...

s. The Macintosh Latin encoding, a modification of Mac OS Roman to support ISO/IEC 8859-1, was created by the creators of

Kermit (protocol) Kermit is a computer file transfer/management protocol and a set of communications software tools primarily used in the early years of personal computing in the 1980s. It provides a consistent approach to file transfer, terminal emulation, scri ...

to solve this problem.

History

The earlier seven- bit U.S.

American Standard Code for Information Interchange ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...

('ASCII') encoding has characters sufficient to properly represent only a few languages such as English, Latin, Malay and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most US-supplied computer platforms, use of ASCII was unavoidable except where there was a strong national computing industry. There was the

ISO 646 ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in ...

group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages. Most computers internally used eight-bit bytes but communication (seen as inherently unreliable) used seven data bits plus one

parity bit A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes ...

. In time, it became common to use all eight bits for data, creating space for another 128 characters. In the early days most of these were system specific, but gradually the

ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-1 ...

standards emerged to provide some cross-platform similarity to enable information interchange. Towards the end of the 20th century, as storage and memory costs fell, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to

as their main internal representation. However, as Windows did not support the

UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...

method of encoding Unicode (preferring UTF-16), many applications continued to be restricted to these legacy character sets.

The euro sign

The

introduction of the euro Introduction, The Introduction, Intro, or The Intro may refer to: General use * Introduction (music), an opening section of a piece of music * Introduction (writing), a beginning section to a book, article or essay which states its purpose and ...

and its associated

() introduced significant pressure on computer systems developers to support this new symbol, and most 8-bit character sets had to be adapted in some way. * Apple with MacRoman and

Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, t ...

with Solaris OS simply replaced the generic currency sign (). This caused difficulty in some places because organisations had found other uses for its

, such as the company logo. * ISO introduced a further variant of ISO 8859,

ISO 8859-15 ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...

, which replaced the generic currency sign with the euro sign as well as making some other replacements of symbols with letters with diacritics. ISO 8859-15 never received widespread adoption. * With

, Microsoft placed the euro sign in a gap (position 80_hex) in the existing C1 control codes, a decision that other vendors considered counter-architectural. Whilst these decisions had limited effect for documents that were only used within a single computer (or at least within a single vendor's "

digital ecosystem A digital ecosystem is a distributed, adaptive, open socio-technical system with properties of self-organisation, scalability and sustainability inspired from natural ecosystems. Digital ecosystem models are informed by knowledge of natural ecosys ...

"), it meant that documents containing a euro sign would fail to render as expected when interchanged between ecosystems. All of these issues have been resolved as operating systems have been upgraded to support

as standard, which encodes the euro sign at U+20AC (decimal 8364).

Comparison table

Code points to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because o ...

coding standard defines the original specification for the mapping of the first 0-127 characters. The table is arranged by

code point. Character sets are referred to here by their

names in

upper case Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writing ...

. * The mappings for the IBM code pages are from the

site supplied by

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...

. The Unicode Consortium's document has links to sources giving the differences between IBM's and Microsoft's mappings for these code pages. * IBM437 and IBM850 defined printable characters for the control code ranges. While these could not be used when printing text through

DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicatio ...

, as they would be trapped before reaching the screen, they could be used by applications that used screen memory directly. * Macintosh has an Apple logo at 0xF0, and translates it to U+F8FF in the

Private Use Area In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and near ...

for Unicode.

Notes

References

{{DEFAULTSORT:Western Latin Character Sets (Computing) Character sets Articles with unsupported PUA characters History of computing