HOME

TheInfoList



OR:

Extended ASCII is a repertoire of
character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...
s that include (most of) the original 96
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the
American National Standards Institute The American National Standards Institute (ANSI ) is a private nonprofit organization that oversees the development of voluntary consensus standards for products, services, processes, systems, and personnel in the United States. The organiz ...
(ANSI) had updated its standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case. The ISO standard
ISO 8859 ISO/IEC 8859 is a joint International Organization for Standardization, ISO and International Electrotechnical Commission, IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC ...
was the first international standard to formalise a (limited) expansion of the ASCII character set: of the many language variants it encoded,
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology— 8-bit single-byte coded graphic character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 19 ...
("ISO Latin 1")which supports most Western European languages is best known in the West. There are many other extended ASCII encodings (more than 220 DOS and Windows codepages).
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding si ...
("the other" major character code) likewise developed many extended variants (more than 186 EBCDIC codepages) over the decades. All modern
operating system An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...
s use
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
which supports thousands of characters. However, extended ASCII remains important in the history of computing, and supporting multiple extended ASCII character sets required software to be written in ways that made it much easier to support the
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
encoding method later on.


History

ASCII was designed in the 1960s for
teleprinter A teleprinter (teletypewriter, teletype or TTY) is an electromechanical device that can be used to send and receive typed messages through various communications channels, in both point-to-point (telecommunications), point-to-point and point- ...
s and
telegraphy Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore is a method of telegraphy, whereas pi ...
, and some computing. Early teleprinters were electromechanical, having no microprocessor and just enough electromechanical memory to function. They fully processed one character at a time, returning to an idle state immediately afterward; this meant that any control sequences had to be only one character long, and thus a large number of codes needed to be reserved for such controls. They were typewriter-derived impact printers, and could only print a fixed set of glyphs, which were cast into a metal type element or elements; this also encouraged a minimum set of glyphs. Seven-bit ASCII improved over prior five- and six-bit codes. Of the 27=128 codes, 33 were used for controls, and 95 carefully selected
printable character In ISO/IEC 646 (commonly known as ASCII) and related standards including ISO 8859 and Unicode, a graphic character, also known as printing character (or printable character), is any character intended to be written, printed, or otherwise display ...
s (94
glyph A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...
s and one space), which include the English alphabet (uppercase and lowercase), digits, and 31 punctuation marks and symbols: all of the symbols on a standard US typewriter plus a few selected for programming tasks. Some popular peripherals only implemented a 64-printing-character subset: Teletype Model 33 could not transmit "a" through "z" or five less-common symbols (, , , , and ). and when they received such characters they instead printed "A" through "Z" (forced
all caps In typography, text or font in all caps (short for "all capitals") contains capital letters without any lowercase letters. For example: All-caps text can be seen in legal documents, advertisements, newspaper headlines, and the titles on book co ...
) and five other mostly-similar symbols (, , , , and ). The ASCII character set is barely large enough for US English use, lacks many glyphs common in
typesetting Typesetting is the composition of text for publication, display, or distribution by means of arranging physical ''type'' (or ''sort'') in mechanical systems or '' glyphs'' in digital systems representing '' characters'' (letters and other ...
, and is far too small for universal use. Many more letters and symbols are desirable, useful, or required to directly represent letters of alphabets other than English, more kinds of punctuation and spacing, more mathematical operators and symbols (× ÷ ⋅ ≠ ≥ ≈ π etc.), some unique symbols used by some programming languages,
ideogram An ideogram or ideograph (from Ancient Greek, Greek 'idea' + 'to write') is a symbol that is used within a given writing system to represent an idea or concept in a given language. (Ideograms are contrasted with phonogram (linguistics), phono ...
s,
logogram In a written language, a logogram (from Ancient Greek 'word', and 'that which is drawn or written'), also logograph or lexigraph, is a written character that represents a semantic component of a language, such as a word or morpheme. Chine ...
s, box-drawing characters, etc. The biggest problem for computer users around the world was the needs of their local alphabets. ASCII's English alphabet almost accommodates European languages, if accented letters are written without accents or two-character approximations, such as for , are used. Modified local variants of 7-bit ASCII appeared promptly, trading some lesser-used symbols for highly desired symbols or letters, such as replacing with on UK Teletypes, with in Japan or in Korea, etc. At least 29 variant sets resulted. Twelve codepoints were modified by at least one national set, leaving only 82 "invariant" codes. Programming languages however had assigned meaning to many of the replaced characters, work-arounds were devised such as C three-character sequences and to represent and . Languages with dissimilar basic alphabets could use transliteration, such as replacing all the Latin letters with the closest match Cyrillic letters (resulting in odd but somewhat readable text when English was printed in Cyrillic or vice versa). Schemes were also devised so that two letters could be overprinted (often with the
backspace Backspace (, ⌫) is the keyboard key that in typewriters originally pushed the carriage one position backwards, and in modern computer systems typically moves the display cursor one position backwards,The meaning of "backwards" depends on the dir ...
control between them) to produce accented letters. Users were not comfortable with any of these compromises and they were often poorly supported. When computers and peripherals standardized on eight-bit
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s in the 1970s, it became obvious that computers and software could handle text that uses 256-character sets at almost no additional cost in programming, and no additional cost for storage (assuming that the unused 8th bit of each byte was not reused in some way, such as error checking, Boolean fields, or packing 8 characters into 7 bytes). This would allow ASCII to be used unchanged and provide 128 more characters. Many manufacturers devised 8-bit character sets consisting of ASCII plus up to 128 of the unused codes. Thus encodings which covered all the major Western European (and Latin American) languages and more could be made. 128 additional characters is still not enough to cover all purposes, all languages, or even all European languages, so the emergence of many proprietary and national ASCII-derived 8-bit character sets was inevitable. Translating between these sets (
transcoding Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859). This is usually done in cases where a target ...
) is complex (especially if a character is not in both sets); and was often not done, producing
mojibake Mojibake (; , 'character transformation') is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often ...
(semi-readable resulting text, often users learned how to manually decode it). There were eventually attempts at cooperation or coordination by national and international standards bodies in the late 1990s, but manufacturer-proprietary sets remained the most popular by far, primarily because the international standards excluded characters popular in or peculiar to specific cultures.


Proprietary extensions

Various proprietary modifications and extensions of ASCII appeared on
mainframe computer A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
s and
minicomputer A minicomputer, or colloquially mini, is a type of general-purpose computer mostly developed from the mid-1960s, built significantly smaller and sold at a much lower price than mainframe computers . By 21st century-standards however, a mini is ...
s especially in universities, to meet their need to support teaching of mathematics, science and languages.
Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
started to add European characters to their extended 7-bit / 8-bit ASCII character set HP Roman Extension around 1978/1979 for use with their workstations, terminals and printers. This later evolved into the widely used regular 8-bit character sets HP Roman-8 and HP Roman-9 (as well as a number of variants).
Atari Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French holding company Atari SA (formerly Infogrames) and its focus is on "video games, consumer hardware, licensing and bl ...
and Commodore
home computer Home computers were a class of microcomputers that entered the market in 1977 and became common during the 1980s. They were marketed to consumers as affordable and accessible computers that, for the first time, were intended for the use of a s ...
s added many graphic symbols to their non-standard ASCII (Respectively, ATASCII and PETSCII, based on the original ASCII standard of 1963). The TRS-80 character set for the
TRS-80 The TRS-80 Micro Computer System (TRS-80, later renamed the Model I to distinguish it from successors) is a desktop microcomputer developed by American company Tandy Corporation and sold through their Radio Shack stores. Launched in 1977, it is ...
home computer added 64 semigraphics characters (0x80 through 0xBF) that implemented low-resolution block graphics. (Each block-graphic character displayed as a 2x3 grid of pixels, with each block pixel effectively controlled by one of the lower 6 bits.)
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
introduced eight-bit extended ASCII codes on the original
IBM PC The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the List of IBM Personal Computer models, IBM PC model line and the basis for the IBM PC compatible ''de facto'' standard. Released on ...
and later produced variations for different languages and cultures. IBM called such character sets '' code pages'' and assigned reference numbers both to those they themselves invented as well as to many invented and used by other manufacturers. Accordingly, character sets are very often indicated by their IBM code page number. In ASCII-compatible code pages, the lower 128 characters maintained their standard ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used
code page 437 Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or MS-DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (di ...
, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters. The larger character set made it possible to create documents in a combination of languages such as English and French (though French computers usually use code page 850), but not, for example, in English and
Greek Greek may refer to: Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group *Greek language, a branch of the Indo-European language family **Proto-Greek language, the assumed last common ancestor of all kno ...
(which required code page 737).
Apple Computer Apple Inc. is an American multinational corporation and technology company headquartered in Cupertino, California, in Silicon Valley. It is best known for its consumer electronics, software, and services. Founded in 1976 as Apple Computer Co ...
introduced their own eight-bit extended ASCII codes in
Mac OS Mac operating systems were developed by Apple Inc. in a succession of two major series. In 1984, Apple debuted the operating system that is now known as the classic Mac OS with its release of the original Macintosh System Software. The system ...
, such as Mac OS Roman. The Apple LaserWriter also introduced the Postscript character set.
Digital Equipment Corporation Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president until ...
(DEC) developed the Multinational Character Set, which had fewer characters but more letter and diacritic combinations. It was supported by the VT220 and later DEC
computer terminal A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. Most early computers only had a front panel to input or display ...
s. This later became the basis for other character sets such as the Lotus International Character Set (LICS), ECMA-94 and
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology— 8-bit single-byte coded graphic character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 19 ...
.


ISO 8859

In 1987, the
International Organization for Standardization The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries. M ...
(ISO) published a set of standards for eight-bit ASCII extensions, ISO 8859. The most popular of these was
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology— 8-bit single-byte coded graphic character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 19 ...
(also called "ISO Latin 1") which contains characters sufficient for the most common Western European languages. Other standards in the 8859 group included
ISO 8859-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. I ...
for Eastern European languages using the
Latin script The Latin script, also known as the Roman script, is a writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae in Magna Graecia. The Gree ...
and ISO 8859-5 for languages using the
Cyrillic script The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic languages, Slavic, Turkic languages, Turkic, Mongolic languages, Mongolic, Uralic languages, Uralic, C ...
, and others. One notable way in which the ISO standards differ from some vendor-specific extended ASCII character sets, is that the first 32 codepoints in the extension block are reserved in the ISO standard for control use and are not available for printable characters. This policy emulated the C0 control codes block that occupy the first 32 codepoints of ASCII. This aspect of the standard was almost universally ignored by other extended ASCII sets.


Windows-1252

Microsoft intended to use ISO 8859 standards in Windows, but soon replaced C1 control codes with additional characters, making the proprietary Windows-1252 character set. The added characters included "curly"
quotation mark Quotation marks are punctuation marks used in pairs in various writing systems to identify direct speech, a quotation, or a phrase. The pair consists of an opening quotation mark and a closing quotation mark, which may or may not be the sam ...
s, the
em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen ...
, the
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
, and the French and Finnish letters from ISO-8859-15. This became the most-used extended ASCII in the world, and often is used on the web even when 8859-1 is specified.


Character set confusion

In order to correctly interpret and display text data (sequences of characters) that includes extended codes, software that reads or receives the text must use the ''specific'' encoding that text was written in. Choosing the wrong encoding causes the display of often wildly-incorrect characters, known by the Japanese term
mojibake Mojibake (; , 'character transformation') is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often ...
. Because ASCII is common between all "extended ASCII" encodings, using the wrong one results in readable English (or any language that is using only A-Z), as well as numbers and most punctuation surviving. Many
communications protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any variation of a physical quantity. The protocol defines the rules, syntax, semantics (computer science), sem ...
s, most importantly
SMTP The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typi ...
and
HTTP HTTP (Hypertext Transfer Protocol) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, wher ...
, require the character encoding of content to be tagged with
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
-assigned character set identifiers, in an attempt to get software to interpret multiple encodings correctly. However the vast majority of software relies on a system setting indicating the user's preferred encoding, or compiles in an assumed setting. In modern times, Unicode has replaced almost all uses of non-ASCII encodings. Because many Internet standards use ISO 8859-1, and because Microsoft Windows (for most languages used in Western Europe and the Americas) uses the CP1252 superset of ISO 8859-1, in general it is safe to assume that any byte stream that is not valid
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
is in CP1252 or the system setting.


See also

* * Digraphs and trigraphs (programming) * *
List of Unicode characters As of Unicode version 16.0, there are 292,531 assigned character (computing), characters with code points, covering 168 modern and historical Script (Unicode), scripts, as well as multiple symbol sets. As it is WP:CHOKING, not technically possib ...
*


Notes


References


External links


A short page on ASCII, with the OEM 8-bit chart and the ANSI 8-bit chart
{{character encoding Character sets ASCII