character encoding detection
   HOME

TheInfoList



OR:

Character encoding detection, charset detection, or code page detection is the process of
heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
ally guessing the
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, such as a HTTP header is either not available, or is assumed to be untrustworthy. This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform
language detection In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, s ...
. This process is not foolproof because it depends on statistical data. In general, incorrect charset detection leads to mojibake. One of the few cases where charset detection works reliably is detecting
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is ''extremely'' unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city
München Munich ( ; german: München ; bar, Minga ) is the capital and most populous city of the German state of Bavaria. With a population of 1,558,395 inhabitants as of 31 July 2020, it is the third-largest city in Germany, after Berlin and Ha ...
were shown as München, due to the code deciding it was an
ISO-8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-1 ...
encoding before (or without) even testing to see if it was UTF-8. UTF-16 is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common characters ''must'' be checked for, relying on a test to see that the text is valid UTF-16 fails: the
Windows operating system Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ser ...
would mis-detect the phrase " Bush hid the facts" (without a newline) in ASCII as Chinese
UTF-16LE UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
, since all the bytes for assigned Unicode characters in UTF-16. Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognising them relies on identifying language features, such as letter frequencies or spellings. Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. HTML documents served across the web by
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
should have their encoding stated
out-of-band Out-of-band activity is activity outside a defined telecommunications frequency band, or, metaphorically, outside of any primary communication channel. Protection from falsing is among its purposes. Examples General usage * Out-of-band agreement ...
using the header. Content-Type: text/html;charset=UTF-8 An isolated HTML document, such as one being edited as a file on disk, may imply such a header by a meta tag within the file: or with a new meta type in HTML5 If the document is Unicode, then some UTF encodings explicitly label the document with an embedded initial byte order mark (BOM).


See also

* International Components for Unicode - A library that can perform charset detection. * Language identification * Content sniffing *
Browser sniffing Browser sniffing (also known as browser detection) is a set of techniques used in websites and web applications in order to determine the web browser a visitor is using, and to serve browser-appropriate content to the visitor. It is also used to de ...
, a similar heuristic technique for determining the capabilities of a web browser, before serving content to it.


References


External links


IMultiLanguage2::DetectInputCodepage



Reference for ''cpdetector'' charset detection



Java port of Mozilla Charset Detectors

Delphi/Pascal port of Mozilla Charset Detectors

''uchardet'', C++ fork of Mozilla Charset Detectors; includes Bash command-line tool

C# port of Mozilla Charset Detectors

HEBCI, a technique for detecting the character set used in form submissions

Frequency distributions of English trigraphs
{{character encoding, state=collapsed Character encoding