HOME

TheInfoList



OR:

Character encoding detection, charset detection, or code page detection is the process of
heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
ally guessing the
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, such as a HTTP header is either not available, or is assumed to be untrustworthy. This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform
language detection In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, s ...
. This process is not foolproof because it depends on statistical data. In general, incorrect charset detection leads to
mojibake Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, ofte ...
. One of the few cases where charset detection works reliably is detecting
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is ''extremely'' unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city
München Munich ( ; german: München ; bar, Minga ) is the capital and most populous city of the German state of Bavaria. With a population of 1,558,395 inhabitants as of 31 July 2020, it is the third-largest city in Germany, after Berlin and Ha ...
were shown as München, due to the code deciding it was an
ISO-8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-1 ...
encoding before (or without) even testing to see if it was UTF-8.
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common characters ''must'' be checked for, relying on a test to see that the text is valid UTF-16 fails: the
Windows operating system Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ser ...
would mis-detect the phrase "
Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...
" (without a newline) in ASCII as Chinese
UTF-16LE UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
, since all the bytes for assigned Unicode characters in UTF-16. Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half with
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognising them relies on identifying language features, such as letter frequencies or spellings. Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding. HTML documents served across the web by
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...
should have their encoding stated
out-of-band Out-of-band activity is activity outside a defined telecommunications frequency band, or, metaphorically, outside of any primary communication channel. Protection from falsing is among its purposes. Examples General usage * Out-of-band agreement ...
using the header. Content-Type: text/html;charset=UTF-8 An isolated HTML document, such as one being edited as a file on disk, may imply such a header by a meta tag within the file: or with a new meta type in HTML5 If the document is Unicode, then some UTF encodings explicitly label the document with an embedded initial
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of th ...
(BOM).


See also

*
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environm ...
- A library that can perform charset detection. *
Language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...
*
Content sniffing Content sniffing, also known as media type sniffing or MIME sniffing, is the practice of inspecting the content of a byte stream to attempt to deduce the file format of the data within it. Content sniffing is generally used to compensate for a lack ...
*
Browser sniffing Browser sniffing (also known as browser detection) is a set of techniques used in websites and web applications in order to determine the web browser a visitor is using, and to serve browser-appropriate content to the visitor. It is also used to de ...
, a similar heuristic technique for determining the capabilities of a web browser, before serving content to it.


References


External links


IMultiLanguage2::DetectInputCodepage



Reference for ''cpdetector'' charset detection



Java port of Mozilla Charset Detectors

Delphi/Pascal port of Mozilla Charset Detectors

''uchardet'', C++ fork of Mozilla Charset Detectors; includes Bash command-line tool

C# port of Mozilla Charset Detectors

HEBCI, a technique for detecting the character set used in form submissions

Frequency distributions of English trigraphs
{{character encoding, state=collapsed Character encoding