Popularity of text encodings
   HOME

TheInfoList



OR:

There are many methods of translating text into digital data, such as
Baudot code The Baudot code is an early character encoding for telegraphy invented by Émile Baudot in the 1870s. It was the predecessor to the International Telegraph Alphabet No. 2 (ITA2), the most common teleprinter code in use until the advent of ASCII. ...
,
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six- ...
, and
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
, and the relative usage levels of them can provide insight into their usability, and historical trends can show the progress of new methods. Exact measurements are not possible. Counts of numbers of documents are different than counts weighed by actual use or visibility of those documents. The encoding popularity varies depending on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pure
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
text is valid ASCII or
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...
or
CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...
or UTF-8. "Tags" may indicate a document encoding, but when this is incorrect this may be silently corrected by display software (for instance the HTML spec says that the tag for ISO-8859-1 should be treated as CP1252), so counts of tags may not be accurate.


Popularity on the World Wide Web

UTF-8 has been the most common encoding for the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
since 2008. , UTF-8 accounts for on average 97.8% (previously up to 98.0%) of all web pages (and 99.1% of top 10,000 pages and 984 of the top 1,000 highest ranked web pages, the next most popular encoding,
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...
, is used by 15 of those sites). Although many pages only use ASCII characters to display content, few websites now declare their encoding to only be ASCII instead of UTF-8. Virtually all countries and over 97% all of the tracked languages have 95% or more use of UTF-8 encodings on the web. See below for the major alternative encodings: The second-most popular encoding varies depending on locale, and is typically more efficient for the associated language. One such encoding is the Chinese
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet ...
standard, which is a full
Unicode Transformation Format Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
, still 95.0% of
website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google Search, Google, Facebook, Amaz ...
s in China and territories use UTF-8 with it (effectively) the next popular encoding.
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
is another popular Chinese (for traditional characters) encoding and is next-most popular in Taiwan after UTF-8 at 96.0%, and it's second but used less elsewhere such as in Hong Kong. The single-byte
Windows-1251 Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages. On the web, it is the second most-used si ...
is twice as efficient for the
Cyrillic script The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic languages, Slavic, Turkic languages, Turkic, Mongolic languages, ...
and still 93.7% of Russian websites use UTF-8. E.g. Greek and Hebrew encodings are also twice as efficient, and UTF-8 has over 99% use for those languages. Japanese language websites have relatively low UTF-8 use compared to most other countries at 95.0% followed by the legacy
Shift JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunctio ...
and then
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
encoding. South Korea has 95.3% UTF-8 use, with the rest of websites mainly using
EUC-KR Extended Unix Code (EUC) is a multibyte character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing ...
which is more efficient for Korean text. With the exception of
GB 18030 GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet ...
(and
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
and
UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
), other (legacy) encodings were designed for specific languages, and do not support all Unicode characters. ,
Breton Breton most often refers to: *anything associated with Brittany, and generally ** Breton people ** Breton language, a Southwestern Brittonic Celtic language of the Indo-European language family, spoken in Brittany ** Breton (horse), a breed **Ga ...
has the lowest UTF-8 use on the web of any tracked language, with 87% use. Well over a third of the languages tracked have 100.0% use of UTF-8 on the web, such as
Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...
,
Marathi Marathi may refer to: *Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India *Marathi language, the Indo-Aryan language spoken by the Marathi people *Palaiosouda, also known as Marathi, a small island in Greece See also * * ...
,
Telugu Telugu may refer to: * Telugu language, a major Dravidian language of India *Telugu people, an ethno-linguistic group of India * Telugu script, used to write the Telugu language ** Telugu (Unicode block), a block of Telugu characters in Unicode S ...
,
Tamil Tamil may refer to: * Tamils, an ethnic group native to India and some other parts of Asia **Sri Lankan Tamils, Tamil people native to Sri Lanka also called ilankai tamils **Tamil Malaysians, Tamil people native to Malaysia * Tamil language, nativ ...
, Javanese, Pañjābī/Punjabi,
Gujarati Gujarati may refer to: * something of, from, or related to Gujarat, a state of India * Gujarati people, the major ethnic group of Gujarat * Gujarati language, the Indo-Aryan language spoken by them * Gujarati languages, the Western Indo-Aryan sub- ...
, Farsi/Persian,
Hausa Hausa may refer to: * Hausa people, an ethnic group of West Africa * Hausa language, spoken in West Africa * Hausa Kingdoms, a historical collection of Hausa city-states * Hausa (horse) or Dongola horse, an African breed of riding horse See also * ...
,
Pashto Pashto (,; , ) is an Eastern Iranian language in the Indo-European language family. It is known in historical Persian literature as Afghani (). Spoken as a native language mostly by ethnic Pashtuns, it is one of the two official languages ...
,
Kannada Kannada (; ಕನ್ನಡ, ), originally romanised Canarese, is a Dravidian language spoken predominantly by the people of Karnataka in southwestern India, with minorities in all neighbouring states. It has around 47 million native s ...
, Lao,
Kurdish languages Kurdish (, ) is a language or a group of languages spoken by Kurds in the geo-cultural region of Kurdistan and the Kurdish diaspora. Kurdish constitutes a dialect continuum, belonging to Western Iranian languages in the Indo-European language ...
, Tagalog, Somali, Khmer/Cambodian, isiZulu/Zulu,
Turkmen Turkmen, Türkmen, Turkoman, or Turkman may refer to: Peoples Historical ethnonym * Turkoman (ethnonym), ethnonym used for the Oghuz Turks during the Middle Ages Ethnic groups * Turkmen in Anatolia and the Levant (Seljuk and Ottoman-Turkish desc ...
,
Tajik Tajik, Tadjik, Tadzhik or Tajikistani may refer to: * Someone or something related to Tajikistan * Tajiks, an ethnic group in Tajikistan, Afghanistan and Uzbekistan * Tajik language, the official language of Tajikistan * Tajik (surname) * Tajik cu ...
(has 4 different scripts), and a lot of the languages with the fewest speakers (often with their own scrips) such as,
Armenian Armenian may refer to: * Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia * Armenians, the national people of Armenia, or people of Armenian descent ** Armenian Diaspora, Armenian communities across the ...
, Mongolian ( which has a top-to-bottom script), Maldivian (
Thaana Thaana, Taana or Tāna (  ) is the present writing system of the Maldivian language spoken in the Maldives. Thaana has characteristics of both an abugida (diacritic, vowel-killer strokes) and a true alphabet (all vowels are written), ...
), Greenlandic (
Kalaallisut Kalaallisut may refer to: * Greenlandic language * West Greenlandic West Greenlandic ( da, vestgrønlandsk), also known as Kalaallisut, is the primary language of Greenland and constitutes the Greenlandic language, spoken by the vast majority of ...
) and also
sign language Sign languages (also known as signed languages) are languages that use the visual-manual modality to convey meaning, instead of spoken words. Sign languages are expressed through manual articulation in combination with non-manual markers. Sign l ...
s.


Popularity for local text files

Local storage on computers has considerably more use of "legacy" single-byte encodings than on the web. Attempts to update to UTF-8 have been blocked by editors that do not display or write UTF-8 unless the first character in a file is a
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of th ...
, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output. UTF-16 files are also fairly common on Windows, but not in other systems.


Popularity internally in software

In the memory of a computer program, usage of
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
is very common, particularly in Windows but and also in
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
, Qt, and many other cross-platform software libraries. Compatibility with the Windows API is a major reason for this. At one time it was believed by many (and is still believed today by some) that having fixed-size code units offers computational advantages, which led many systems, in particular Windows, to use the fixed-size UCS-2 with two bytes per character. This is false: strings are almost never randomly accessed, and sequential access is the same speed. In addition, even UCS-2 was not "fixed size" if
combining characters In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode also ...
are considered, and when Unicode exceeded 65536 code points it had to be replaced with the non-fixed-sized UTF-16 anyway. Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer. So newer software systems are starting to use UTF-8.
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environm ...
(ICU) has historically only used
UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
, and still does only for Java, while it now supports UTF-8 (for C/C++ and other languages indirectly), e.g. used that way by Microsoft; supported as the "Default Charset" including the correct handling of "illegal UTF-8". The default string primitive used in newer programing languages, such as Go,
Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g. ...
,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
and
Swift Swift or SWIFT most commonly refers to: * SWIFT, an international organization facilitating transactions between banks ** SWIFT code * Swift (programming language) * Swift (bird), a family of birds It may also refer to: Organizations * SWIFT, ...
5, assume UTF-8 encoding.
PyPy PyPy () is an implementation of the Python programming language. PyPy often runs faster than the standard implementation CPython because PyPy uses a just-in-time compiler. Most Python code runs well on PyPy except for code that depends on CPytho ...
is also using UTF-8 for its strings, and Python is looking into storing all strings with UTF-8. Microsoft now recommends the use of UTF-8 for applications using the
Windows API The Windows API, informally WinAPI, is Microsoft's core set of application programming interfaces (APIs) available in the Microsoft Windows operating systems. The name Windows API collectively refers to several different platform implementations th ...
, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.


References

{{reflist Character encoding Unicode Transformation Formats Character sets