Web pages authored using HyperText Markup Language (

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in a HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes. In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to

Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It ...

encoding). It was extended to

ISO 10646 ISO is the most common abbreviation for the International Organization for Standardization. ISO or Iso may also refer to: Business and finance * Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007 * Iso ...

(which is basically equivalent to Unicode) by . It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. The relationship between

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...

and HTML tends to be a difficult topic for many computer professionals, document authors, and

web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...

users alike. The accurate representation of text in web pages from different

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

s and

writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form ...

s is complicated by the details of

character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...

markup language Markup language refers to a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document ...

syntax,

font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...

, and varying levels of support by

web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...

HTML document characters

Web pages are typically

XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...

documents. Both types of documents consist, at a fundamental level, of

character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...

s, which are

grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...

s and grapheme-like units, independent of how they manifest in

computer storage Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a compute ...

systems and

network Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...

s. An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML ''document character set'' : a character repertoire wherein each character is assigned a unique, non-negative integer ''code point''. This set is defined in the HTML 4.0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by

and ISO/IEC 10646: the

Universal Character Set The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), whi ...

(UCS). Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...

document, which, while not having an explicit "document character" layer of

abstraction Abstraction in its main sense is a conceptual process wherein general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or "concrete") signifiers, first principles, or other methods. "An abstr ...

, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author. Regardless of whether the document is HTML or XHTML, when stored on a

file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...

or transmitted over a network, the document's characters are ''encoded'' as a sequence of

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

octet Octet may refer to: Music * Octet (music), ensemble consisting of eight instruments or voices, or composition written for such an ensemble ** String octet, a piece of music written for eight string instruments *** Octet (Mendelssohn), 1825 compos ...

s (''

byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...

s'') according to a particular character encoding. This encoding may either be a

Unicode Transformation Format Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...

, like

UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...

, that can directly encode any Unicode character, or a legacy encoding, like

, that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of

numeric character references A numeric character reference (NCR) is a common markup (computer programming), markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of character (computing), characters that, in turn ...

. For example, ☺ (☺) is used to indicate a smiling face character in the Unicode character set.

Character encoding

In order to support all Unicode characters without resorting to numeric character references, a web page must have an encoding covering all of Unicode. The most popular is

, where the

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...

characters, such as English letters, digits, and some other common characters are preserved unchanged against ASCII. This makes HTML code (such as <br> and </div>) unchanged compared to ASCII. Characters outside the ASCII range are stored in 2–4 bytes. It is also possible to use

UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...

where most characters are stored as two bytes with varying

endianness In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...

, which is supported by modern browsers but less commonly used.

Numeric character references

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a

numeric character reference A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML ...

: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a

decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...

number for the Unicode code point, or a

hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...

number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet. The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 合 instead of 合).

Named character entities

In HTML 4, there is a standard set of 252 named ''character entities'' for characters - some common, some obscure - that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers. Character entities can be included in an HTML document via the use of ''entity references'', which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents : the

em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...

character "—" even if the character encoding used doesn't contain that character. For the full list, see:

List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...

Character encoding determination

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used.

Encoding information

When a document is transmitted via a

MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...

message or a transport that uses MIME content types such as an

HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...

response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=UTF-8. Other external means of declaring encoding are permitted but rarely used. If the document uses a Unicode encoding, the encoding info might also be present in the form of a

byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of th ...

(BOM). Finally, the encoding can be declared via the HTML syntax. For the text/html serialisation then, as long as the page is encoded in an extension of

(such as

, and thus, not if the page is using

), a meta element, like <meta http-equiv="content-type" content="text/html; charset=UTF-8"> or (starting with

HTML5 HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...

) <meta charset="UTF-8"> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration. The meta attribute plays no role in HTML served as XML.

Encoding defaults

An encoding default applies when there is no external or internal encoding declaration and also no byte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as text/html) varies depending on the localization of the browser. For a system set up mainly for Western European languages, it will generally be

. For Cyrillic alphabet locales, the default is typically

Windows-1251 Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages. On the web, it is the second most-used si ...

. For a browser from a location where ''legacy'' multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.

Encoding trends

Because of the legacy of 8-bit text representations in

programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...

s and

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...

s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8 which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification, to UTF-8.

Byte order mark/Unicode sniffing

For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml"), the byte order mark (BOM) is an effective way to transmit encoding information within an HTML document. For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings. (Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see UTF-16BE,

UTF-16LE UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...

, UTF-32LE and

UTF-32BE UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode cod ...

.) The use of the BOM character (U+FEFF) means that the encoding automatically declares itself to any processing application. Processing applications need only look for an initial 0x0000FEFF, 0xFEFF or 0xEFBBBF in the byte stream to identify the document as UTF-32, UTF-16 or UTF-8 encoded respectively. No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications. In most circumstances, the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script). If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be "<" (U+003C) can be used to determine a UTF-8/UTF-16/UTF-32 encoding.

Encoding overriding

Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding auto-detection algorithm that works in concert with or ''in the case of the BOM and in case of HTML served as XML'' against the manual override. For HTML documents which are text/html serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari for both XML and text/html serializations do not permit the encoding to be overridden whenever the page includes the BOM. For HTML documents serialized with the preferred XML label application/xhtml+xml, manual encoding override is not permitted. To override the encoding of such an XML document would mean that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) do allow the encoding of XHTML documents to be manually overridden.

Web browser support

Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points: Some web browsers, such as

Mozilla Firefox Mozilla Firefox, or simply Firefox, is a free and open-source web browser developed by the Mozilla Foundation and its subsidiary, the Mozilla Corporation. It uses the Gecko rendering engine to display web pages, which implements current and a ...

Opera Opera is a form of theatre in which music is a fundamental component and dramatic roles are taken by singers. Such a "work" (the literal translation of the Italian word "opera") is typically a collaboration between a composer and a librett ...

Safari A safari (; ) is an overland journey to observe wild animals, especially in eastern or southern Africa. The so-called "Big Five" game animals of Africa – lion, leopard, rhinoceros, elephant, and Cape buffalo – particularly form an importa ...

and

Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated IE or MSIE) is a series of graphical user interface, graphical web browsers developed by Microsoft which was used in the Microsoft Wind ...

(from version 7 on), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of

Unicode block A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes. Typically, proposals such as the ad ...

s, as long as appropriate

fonts In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mode ...

are present in the

. Older browsers, such as

Netscape Navigator Netscape Navigator was a web browser, and the original browser of the Netscape line, from versions 1 to 4.08, and 9.x. It was the flagship product of the Netscape Communications Corp and was the dominant web browser in terms of usage share in ...

4.77 and

Internet Explorer 6 Microsoft Internet Explorer 6 (IE6) is a graphical web browser developed by Microsoft for Windows operating systems. Released on August 24, 2001, it is the sixth, and by now discontinued, version of Internet Explorer and the successor to Internet ...

, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they ''will'' display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others. For displaying characters outside the

Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

, such as the Gothic letter faihu, which is a variant of the runic letter fehu in the table above, some systems (like Windows 2000) need manual adjustments of their settings.

Frequency of usage

According to internal data from

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

's web index, in December 2007 the

Unicode encoding became the most frequently used encoding on web pages, overtaking both

(US) and 8859-1/

1252 Year 1252 ( MCCLII) was a leap year starting on Monday (link will display the full calendar) of the Julian calendar. Events By place Europe * April 6 – Saint Peter of Verona is assassinated by Carino of Balsamo. * May 15 – P ...

(Western European).

Mark Davis Mark Davis may refer to: Entertainers *Mark Davis (talk show host), American radio talk show host * Mark Jonathan Davis (born 1965), American actor/singer and creator of Richard Cheese *Mark Davis, American bassist and founding member for the band ...

Moving to Unicode 5.1
Official Google blog, 5 May 2008

References

External links

Unicode in XML and other Markup Languages
- a W3C & Unicode Consortium joint publication that describes issues and provides guidelines relating to Unicode in markup languages
Latin-1"Special"
an
Mathematical, Greek and Symbolic
named character entity definitions for HTML 4.01
UnicodeMap.org
- Browse Unicode characters, ranges, and other information
SIL's freeware fonts, editors and documentationAlan Wood’s Unicode Resources
- Unicode fonts and information. *http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm The International Phonetic Alphabet in Unicode *http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs *http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities
Table of Unicode characters from 1 to 65535
- shows how they look in one's browser

* ttp://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode- how to fix display problems
w3.org via web.archive.org
- Original HTML5 Citation Reference saved via Wayback Machine {{Unicode navigation HTML Unicode