Extended ASCII is a repertoire of

character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...

s that include (most of) the original 96

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...

character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the

American National Standards Institute The American National Standards Institute (ANSI ) is a private non-profit organization that oversees the development of voluntary consensus standards for products, services, processes, systems, and personnel in the United States. The orga ...

(ANSI) had updated its standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case. The ISO standard ISO 8859 was the first international standard to formalise a (limited) expansion of the ASCII character set: of the many language variants it encoded, ISO 8859-1 ("ISO Latin 1")which supports most Western European languages is best known in the West. There are many other extended ASCII encodings (more than 220 DOS and Windows codepages). EBCDIC ("the other" major character code) likewise developed many extended variants (more than 186 EBCDIC codepages) over the decades. The technology has largely been rendered technically obsolete by

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...

, which has code points for all the characters encoded in the various attempts to extend ASCII. All modern

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...

s use this technology. Nevertheless the topic remains important in the history of computing.

History

ASCII was designed in the 1960s for

teleprinter A teleprinter (teletypewriter, teletype or TTY) is an electromechanical device that can be used to send and receive typed messages through various communications channels, in both point-to-point and point-to-multipoint configurations. Init ...

s and telegraphy, and some computing. Early teleprinters were electromechanical, having no microprocessor and just enough electromechanical memory to function. They fully processed one character at a time, returning to an idle state immediately afterward; this meant that any control sequences had to be only one character long, and thus a large number of codes needed to be reserved for such controls. They were typewriter-derived

impact printer In computing, a printer is a peripheral machine which makes a persistent representation of graphics or text, usually on paper. While most output is human-readable, bar code printers are an example of an expanded use for printers. Diffe ...

s, and could only print a fixed set of glyphs, which were cast into a metal type element or elements; this also encouraged a minimum set of glyphs. Seven-bit ASCII improved over prior five- and six-bit codes. Of the 2⁷=128 codes, 33 were used for controls, and 95 carefully selected

printable character In ISO/IEC 646 (commonly known as ASCII) and related standards including ISO 8859 and Unicode, a graphic character is any character intended to be written, printed, or otherwise displayed in a form that can be read by humans. In other words, it i ...

s (94 glyphs and one space), which include the English alphabet (uppercase and lowercase), digits, and 31 punctuation marks and symbols: all of the symbols on a standard US typewriter plus a few selected for programming tasks. Some popular peripherals only implemented a 64-printing-character subset:

Teletype Model 33 The Teletype Model 33 is an electromechanical teleprinter designed for light-duty office use. It is less rugged and cost less than earlier Teletype machines. The Teletype Corporation introduced the Model 33 as a commercial product in 1963 af ...

could not transmit "a" through "z" or five less-common symbols ("`", "", and "~"). and when they received such characters they instead printed "A" through "Z" (forced all caps) and five other mostly-similar symbols ("@", " , "\", ", and "^"). The ASCII character set is barely large enough for US English use and lacks many glyphs common in

typesetting Typesetting is the composition of text by means of arranging physical ''type'' (or ''sort'') in mechanical systems or '' glyphs'' in digital systems representing '' characters'' (letters and other symbols).Dictionary.com Unabridged. Random ...

, and far too small for universal use. Many more letters and symbols are desirable, useful, or required to directly represent letters of alphabets other than English, more kinds of punctuation and spacing, more mathematical operators and symbols (× ÷ ⋅ ≠ ≥ ≈ π etc.), some unique symbols used by some programming languages,

ideogram An ideogram or ideograph (from Greek "idea" and "to write") is a graphic symbol that represents an idea or concept, independent of any particular language, and specific words or phrases. Some ideograms are comprehensible only by famili ...

logogram In a written language, a logogram, logograph, or lexigraph is a written character that represents a word or morpheme. Chinese characters (pronounced '' hanzi'' in Mandarin, ''kanji'' in Japanese, ''hanja'' in Korean) are generally logograms, ...

s, box-drawing characters, etc. For years, applications were designed around the 64-character set and/or the 95-character set, so several characters acquired new uses. For example, ASCII lacks "÷", so most programming languages use "/" to indicate division. The biggest problem for computer users around the world was other alphabets. ASCII's English alphabet almost accommodates European languages, if accented letters are replaced by non-accented letters or two-character approximations. Modified variants of 7-bit ASCII appeared promptly, trading some lesser-used symbols for highly desired symbols or letters, such as replacing "#" with "£" on UK Teletypes, "\" with "¥" in Japan or "₩" in Korea, etc. At least 29 variant sets resulted. 12 code points were modified by at least one modified set, leaving only 82 "invariant" codes. Programming languages however had assigned meaning to many of the replaced characters, work-arounds were devised such as C three-character sequences "??<" and "??>" to represent "". Languages with dissimilar basic alphabets could use transliteration, such as replacing all the Latin letters with the closest match Cyrillic letters (resulting in odd but somewhat readable text when English was printed in Cyrillic or vice versa). Schemes were also devised so that two letters could be overprinted (often with the backspace control between them) to produce accented letters. Users were not comfortable with any of these compromises and they were often poorly supported. When computers and peripherals standardized on eight-bit

byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...

s in the 1970s, it became obvious that computers and software could handle text that uses 256-character sets at almost no additional cost in programming, and no additional cost for storage. (Assuming that the unused 8th bit of each byte was not reused in some way, such as error checking, Boolean fields, or packing 8 characters into 7 bytes.) This would allow ASCII to be used unchanged and provide 128 more characters. Many manufacturers devised 8-bit character sets consisting of ASCII plus up to 128 of the unused codes. Since Eastern Europe were politically separated at the time, 8-bit encodings which covered all the more used European (and Latin American) languages, such as Danish, Dutch, French, German, Portuguese, Spanish, Swedish and more could be made, often called "Latin" or "Roman". 128 additional characters is still not enough to cover all purposes, all languages, or even all European languages, so the emergence of ''many'' proprietary and national ASCII-derived 8-bit character sets was inevitable. Translating between these sets (

transcoding Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859). This is usually done in cases where a target d ...

) is complex (especially if a character is not in both sets); and was often not done, producing mojibake (semi-readable resulting text, often users learned how to manually decode it). There were eventually attempts at cooperation or coordination by national and international standards bodies in the late 1990s, but manufacture proprietary sets remained the most popular by far, primarily because the standards excluded many popular characters.

Proprietary extensions

Various proprietary modifications and extensions of ASCII appeared on non- EBCDIC mainframe computers and minicomputers, especially in universities.

Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...

started to add European characters to their extended 7-bit / 8-bit ASCII character set HP Roman Extension around 1978/1979 for use with their workstations, terminals and printers. This later evolved into the widely used regular 8-bit character sets

HP Roman-8 In computing HP Roman is a family of character sets consisting of HP Roman Extension, HP Roman-8, HP Roman-9 and several variants. Originally introduced by Hewlett-Packard around 1978, revisions and adaptations were published several times up t ...

and

HP Roman-9 In computing HP Roman is a family of character sets consisting of HP Roman Extension, HP Roman-8, HP Roman-9 and several variants. Originally introduced by Hewlett-Packard around 1978, revisions and adaptations were published several times up t ...

(as well as a number of variants).

Atari Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French publisher Atari SA through a subsidiary named Atari Interactive. The original Atari, Inc., founded in Sunnyvale, Ca ...

and Commodore home computers added many graphic symbols to their non-standard ASCII (Respectively, ATASCII and PETSCII, based on the original ASCII standard of 1963). The

TRS-80 character set The TRS-80 computer manufacturered by Tandy / Radio Shack contains an 8-bit character set. It is partially derived from ASCII, and shares the code points from 32 - 95 on the standard model. Code points 96 - 127 are supported on models that have bee ...

for the

TRS-80 The TRS-80 Micro Computer System (TRS-80, later renamed the Model I to distinguish it from successors) is a desktop microcomputer launched in 1977 and sold by Tandy Corporation through their Radio Shack stores. The name is an abbreviation of ' ...

home computer added 64 semigraphics characters (0x80 through 0xBF) that implemented low-resolution block graphics. (Each block-graphic character displayed as a 2x3 grid of pixels, with each block pixel effectively controlled by one of the lower 6 bits.) IBM introduced eight-bit extended ASCII codes on the original

IBM PC The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the IBM PC model line and the basis for the IBM PC compatible de facto standard. Released on August 12, 1981, it was created by a team ...

and later produced variations for different languages and cultures. IBM called such character sets ''

code pages In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some ...

'' and assigned numbers to both those they themselves invented as well as many invented and used by other manufacturers. Accordingly, character sets are very often indicated by their IBM code page number. In ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS computers built for the North American market, for example, used

code page 437 Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters ( d ...

, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters. The larger character set made it possible to create documents in a combination of languages such as

English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...

and

French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...

(though French computers usually use code page 850), but not, for example, in English and Greek (which required code page 737).

Apple Computer Apple Inc. is an American multinational technology company headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling in 2021) and, as of June 2022, is the world's biggest company ...

introduced their own eight-bit extended ASCII codes in Mac OS, such as Mac OS Roman. The

Apple LaserWriter The LaserWriter is a laser printer with built-in PostScript interpreter sold by Apple, Inc. from 1985 to 1988. It was one of the first laser printers available to the mass market. In combination with WYSIWYG publishing software like PageMaker, ...

also introduced the Postscript character set.

Digital Equipment Corporation Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president un ...

(DEC) developed the Multinational Character Set, which had fewer characters but more letter and diacritic combinations. It was supported by the VT220 and later DEC computer terminals. This later became the basis for other character sets such as the Lotus International Character Set (LICS), ECMA-94 and ISO 8859-1.

ISO 8859 and proprietary adaptations

Eventually, ISO released this standard as ISO 8859 describing its own set of eight-bit ASCII extensions. The most popular is ISO 8859-1, also called ''ISO Latin 1'', which contained characters sufficient for the most common Western European languages. Variations were standardized for other languages as well: ISO 8859-2 for Eastern European languages and

ISO 8859-5 ISO/IEC 8859-5:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...

for Cyrillic languages, for example. One notable way in which ISO character sets differ from code pages is that the character positions 128 to 159, corresponding to ASCII control characters with the high-order bit set, are specifically unused and undefined in the ISO standards, though they had often been used for printable characters in proprietary code pages, a breaking of ISO standards that was almost universal. Microsoft later created code page 1252, a compatible superset of ISO 8859–1 with extra characters in the ISO unused range. Code page 1252 is the standard character encoding of western European language versions of

Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...

, including English versions. ISO 8859-1 is the common 8-bit character encoding used by the

X Window System The X Window System (X11, or simply X) is a windowing system for bitmap displays, common on Unix-like operating systems. X provides the basic framework for a GUI environment: drawing and moving windows on the display device and interacting wi ...

, and most

Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...

standards used it before

Character set confusion

The meaning of each extended code point can be different in every encoding. In order to correctly interpret and display text data (sequences of characters) that includes extended codes, hardware and software that reads or receives the text must use the ''specific'' extended ASCII encoding that applies to it. Applying the wrong encoding causes irrational substitution of many or all extended characters in the text. Software can use a fixed encoding selection, or it can select from a palette of encodings by defaulting, checking the computer's nation and language settings, reading a declaration in the text, analyzing the text, asking the user, letting the user select or override, and/or defaulting to last selection. When text is transferred between computers that use different operating systems, software, and encodings, applying the wrong encoding can be commonplace. Because the full English alphabet and the most-used characters in English are included in the seven-bit code points of ASCII, which are common to all encodings (even most proprietary encodings), English-language text is less damaged by interpreting it with the wrong encoding, but text in other languages can display as mojibake (complete nonsense). Because many Internet standards use ISO 8859-1, and because Microsoft Windows (using the code page 1252 superset of ISO 8859-1) is the dominant operating system for personal computers today, unannounced use of ISO 8859-1 is quite commonplace, and may generally be assumed unless there are indications otherwise. Many

communications protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synch ...

s, most importantly SMTP and

HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...

, require the character encoding of content to be tagged with IANA-assigned character set identifiers.

Usage in computer-readable languages

For programming languages and document languages such as C and

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...

, the principle of Extended ASCII is important, since it enables many different encodings and therefore many human languages to be supported with little extra programming effort in the software that interprets the computer-readable language files. The principle of Extended ASCII means that: *all ASCII bytes (0x00 to 0x7F) have the same meaning in ''all'' variants of extended ASCII, *bytes that are not ASCII bytes are used only for free text and not for tags, keywords, or other features that have special meaning to the interpreting software. A computer language that supports Extended ASCII can also support

UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...

without any changes, this was a major factor in UTF-8's popularity.

References

External links

Roman Czyborra's Unicode and extended ASCII information pagesA short page on ASCII, with the OEM 8-bit chart and the ANSI 8-bit chart
{{character encoding Character sets ASCII