Garbage characters
   HOME

TheInfoList



OR:

Mojibake Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, oft ...
( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
. The result is a systematic replacement of symbols with completely unrelated ones, often from a different
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
. This display may include the generic
replacement character Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0: *, marks start of annotated text *, marks start ...
("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
and
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
). Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake. Symptoms of this failed rendering include blocks with the
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
displayed in hexadecimal or using the generic replacement character. Importantly, these replacements are ''valid'' and are the result of correct error handling by the software.


Causes

To correctly reproduce the original text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved. As mojibake is the instance of non-compliance between these, it can be achieved by manipulating the data itself, or just relabeling it. Mojibake is often seen with text data that have been tagged with a wrong encoding; it may not even be tagged at all, but moved between computers with different default encodings. A major source of trouble are
communication protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchroniza ...
s that rely on settings on each computer rather than sending or storing metadata together with the data. The differing default settings between computers are in part due to differing deployments of
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
among
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also i ...
families, and partly the legacy encodings' specializations for different
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s of human languages. Whereas Linux distributions mostly switched to
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
in 2004, Microsoft Windows generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages. For some
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s, an example being
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
, several encodings have historically been employed, causing users to see mojibake relatively often. As a Japanese example, the word ''mojibake'' "文字化け" stored as
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
might be incorrectly displayed as "ハクサ�ス、ア", "ハクサ嵂ス、ア" ( MS-932), or "ハクサ郾ス、ア" ( Shift JIS-2004). The same text stored as
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
is displayed as "譁�蟄怜喧縺�" if interpreted as Shift JIS. This is further exacerbated if other locales are involved: the same UTF-8 text appears as "文字化け" in software that assumes text to be in the
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
or ISO-8859-1 encodings, usually labelled Western, or (for example) as "鏂囧瓧鍖栥亼" if interpreted as being in a GBK (Mainland China) locale.


Underspecification

If the encoding is not specified, it is up to the software to decide it by other means. Depending on the type of software, the typical solution is either configuration or
charset detection Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when sp ...
heuristics. Both are prone to mis-prediction in not-so-uncommon scenarios. The encoding of
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operat ...
s is affected by locale setting, which depends on the user's language, brand of
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also i ...
and possibly other conditions. Therefore, the assumed encoding is systematically wrong for files that come from a computer with a different setting, or even from a differently localized software within the same system. For Unicode, one solution is to use a
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
, but for
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
and other machine readable text, many parsers don't tolerate this. Another is storing the encoding as metadata in the file system. File systems that support
extended file attributes Extended file attributes are file system features that enable users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or re ...
can store this as user.charset. This also requires support in software that wants to take advantage of it, but does not disturb other software. While a few encodings are easy to detect, in particular UTF-8, there are many that are hard to distinguish (see
charset detection Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when sp ...
). A
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
may not be able to distinguish a page coded in
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
and another in Shift-JIS if the coding scheme is not assigned explicitly using
HTTP headers The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, w ...
sent along with the documents, or using the
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
document's
meta tag Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can ...
s that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers; see
character encodings in HTML While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special ch ...
.


Mis-specification

Mojibake also occurs when the encoding is wrongly specified. This often happens between encodings that are similar. For example, the Eudora email client for
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ser ...
was known to send emails labelled as ISO-8859-1 that were in reality
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
. Windows-1252 contains extra printable characters in the C1 range (the most frequently seen being curved
quotation marks Quotation marks (also known as quotes, quote marks, speech marks, inverted commas, or talking marks) are punctuation marks used in pairs in various writing systems to set off direct speech, a quotation, or a phrase. The pair consists of an ...
and extra
dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen ...
es), that were not displayed properly in software complying with the ISO standard; this especially affected software running under other operating systems such as
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, an ...
.


User oversight

Of the encodings still in common use, many originated from taking
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
and appending atop it; as a result, these encodings are partially compatible with each other. Examples of this include Windows-1252 and ISO 8859-1. People thus may mistake the expanded encoding set they are using with plain ASCII.


Overspecification

When there are layers of protocols, each trying to specify the encoding based on different information, the least certain information may be misleading to the recipient. For example, consider a web server serving a static HTML file over HTTP. The character set may be communicated to the client in any number of 3 ways: * in the HTTP header. This information can be based on server configuration (for instance, when serving a file off disk) or controlled by the application running on the server (for dynamic websites). * in the file, as an HTML meta tag (http-equiv or charset) or the encoding attribute of an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
declaration. This is the encoding that the author meant to save the particular file in. * in the file, as a
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
. This is the encoding that the author's editor actually saved it in. Unless an accidental encoding conversion has happened (by opening it in one encoding and saving it in another), this will be correct. It is, however, only available in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
encodings such as UTF-8 or UTF-16.


Lack of hardware or software support

Much older hardware is typically designed to support only one character set and the character set typically cannot be altered. The character table contained within the display firmware will be localized to have characters for the country the device is to be sold in, and typically the table differs from country to country. As such, these systems will potentially display mojibake when loading text generated on a system from a different country. Likewise, many early operating systems do not support multiple encoding formats and thus will end up displaying mojibake if made to display non-standard text—early versions of Microsoft Windows and
Palm OS Palm OS (also known as Garnet OS) was a mobile operating system initially developed by Palm, Inc., for personal digital assistants (PDAs) in 1996. Palm OS was designed for ease of use with a touchscreen-based graphical user interface. It is pro ...
for example, are localized on a per-country basis and will only support encoding standards relevant to the country the localized version will be sold in, and will display mojibake if a file containing a text in a different encoding format from the version that the OS is designed to support is opened.


Resolutions

Applications using
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
as a default encoding may achieve a greater degree of interoperability because of its widespread use and backward compatibility with
US-ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
. UTF-8 also has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings. The difficulty of resolving an instance of mojibake varies depending on the application within which it occurs and the causes of it. Two of the most common applications in which mojibake may occur are
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
s and
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current ...
s. Modern browsers and word processors often support a wide array of character encodings. Browsers often allow a user to change their rendering engine's encoding setting on the fly, while word processors allow the user to select the appropriate encoding when opening a file. It may take some
trial and error Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying. According to W.H. Thorpe, the term was devised by C. Lloyd Morgan (18 ...
for users to find the correct encoding. The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game. In this case, the user must change the operating system's encoding settings to match that of the game. However, changing the system-wide encoding settings can also cause Mojibake in pre-existing applications. In
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...
or later, a user also has the option to use Microsoft AppLocale, an application that allows the changing of per-application locale settings. Even so, changing the operating system encoding settings is not possible on earlier operating systems such as
Windows 98 Windows 98 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of Microsoft Windows operating systems. The second operating system in the 9x line, it is the successor to Windows 95, and was released to ...
; to resolve this issue on earlier operating systems, a user would have to use third party font rendering applications.


Problems in different writing systems


English

Mojibake in English texts generally occurs in punctuation, such as em dashes (—), en dashes (–), and curly quotes (“,”,‘,’), but rarely in character text, since most encodings agree with
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
on the encoding of the
English alphabet The alphabet for Modern English is a Latin-script alphabet consisting of 26 letters, each having an upper- and lower-case form. The word ''alphabet'' is a compound of the first two letters of the Greek alphabet, ''alpha'' and '' beta''. ...
. For example, the
pound sign The pound sign is the symbol for the pound unit of sterling – the currency of the United Kingdom and previously of Great Britain and of the Kingdom of England. The same symbol is used for other currencies called pound, such as the Gibralt ...
"£" will appear as "£" if it was encoded by the sender as
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
but interpreted by the recipient as
CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...
or ISO 8859-1. If iterated using CP1252, this can lead to "£", "£", "£", etc. Some computers did, in older eras, have vendor-specific encodings which caused mismatch also for English text.
Commodore Commodore may refer to: Ranks * Commodore (rank), a naval rank ** Commodore (Royal Navy), in the United Kingdom ** Commodore (United States) ** Commodore (Canada) ** Commodore (Finland) ** Commodore (Germany) or ''Kommodore'' * Air commodore ...
brand 8-bit computers used
PETSCII PETSCII (''PET Standard Code of Information Interchange''), also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C1 ...
encoding, particularly notable for inverting the upper and lower case compared to standard
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
. PETSCII printers worked fine on other computers of the era, but flipped the case of all letters. IBM mainframes use the
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
encoding which does not match ASCII at all.


Other Western European languages

The alphabets of the
North Germanic languages The North Germanic languages make up one of the three branches of the Germanic languages—a sub-family of the Indo-European languages—along with the West Germanic languages and the extinct East Germanic languages. The language group is also ...
,
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
,
Finnish Finnish may refer to: * Something or someone from, or related to Finland * Culture of Finland * Finnish people or Finns, the primary ethnic group in Finland * Finnish language, the national language of the Finnish people * Finnish cuisine See also ...
,
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
, French,
Portuguese Portuguese may refer to: * anything of, from, or related to the country and nation of Portugal ** Portuguese cuisine, traditional foods ** Portuguese language, a Romance language *** Portuguese dialects, variants of the Portuguese language ** Portu ...
and
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...
are all extensions of the
Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and th ...
. The additional characters are typically the ones that become corrupted, making texts only mildly unreadable with mojibake: * å, ä, ö in
Finnish Finnish may refer to: * Something or someone from, or related to Finland * Culture of Finland * Finnish people or Finns, the primary ethnic group in Finland * Finnish language, the national language of the Finnish people * Finnish cuisine See also ...
and
Swedish Swedish or ' may refer to: Anything from or related to Sweden, a country in Northern Europe. Or, specifically: * Swedish language, a North Germanic language spoken primarily in Sweden and Finland ** Swedish alphabet, the official alphabet used by ...
*à, ç, è, é, ï, í, ò, ó, ú, ü in
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
* æ, ø, å in
Norwegian Norwegian, Norwayan, or Norsk may refer to: *Something of, from, or related to Norway, a country in northwestern Europe * Norwegians, both a nation and an ethnic group native to Norway * Demographics of Norway *The Norwegian language, including ...
and
Danish Danish may refer to: * Something of, from, or related to the country of Denmark People * A national or citizen of Denmark, also called a "Dane," see Demographics of Denmark * Culture of Denmark * Danish people or Danes, people with a Danish a ...
*á, é, ó, ij, è, ë, ï in
Dutch Dutch commonly refers to: * Something of, from, or related to the Netherlands * Dutch people () * Dutch language () Dutch may also refer to: Places * Dutch, West Virginia, a community in the United States * Pennsylvania Dutch Country People E ...
*ä, ö, ü, and ß in
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
*á, ð, í, ó, ú, ý, æ, ø in Faroese *á, ð, é, í, ó, ú, ý, þ, æ, ö in Icelandic *à, â, ç, è, é, ë, ê, ï, î, ô, ù, û, ü, ÿ, æ, œ in French *à, è, é, ì, ò, ù in
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance language *** Regional Ita ...
*á, é, í, ñ, ó, ú, ü, ¡, ¿ in
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...
*à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in
Portuguese Portuguese may refer to: * anything of, from, or related to the country and nation of Portugal ** Portuguese cuisine, traditional foods ** Portuguese language, a Romance language *** Portuguese dialects, variants of the Portuguese language ** Portu ...
( ü no longer used) *á, é, í, ó, ú in
Irish Irish may refer to: Common meanings * Someone or something of, from, or related to: ** Ireland, an island situated off the north-western coast of continental Europe ***Éire, Irish language name for the isle ** Northern Ireland, a constituent unit ...
*à, è, ì, ò, ù in
Scottish Gaelic Scottish Gaelic ( gd, Gàidhlig ), also known as Scots Gaelic and Gaelic, is a Goidelic language (in the Celtic branch of the Indo-European language family) native to the Gaels of Scotland. As a Goidelic language, Scottish Gaelic, as well ...
* £ in
British English British English (BrE, en-GB, or BE) is, according to Lexico, Oxford Dictionaries, "English language, English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in ...
… and their uppercase counterparts, if applicable. These are languages for which the ISO-8859-1 character set (also known as ''Latin 1'' or ''Western'') has been in use. However, ISO-8859-1 has been obsoleted by two competing standards, the backward compatible
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
, and the slightly altered
ISO-8859-15 ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...
. Both add the Euro sign € and the French œ, but otherwise any confusion of these three character sets does not create mojibake in these languages. Furthermore, it is always safe to interpret ISO-8859-1 as Windows-1252, and fairly safe to interpret it as ISO-8859-15, in particular with respect to the Euro sign, which replaces the rarely used currency sign (¤). However, with the advent of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
, mojibake has become more common in certain scenarios, e.g. exchange of text files between
UNIX Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, an ...
and
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ser ...
computers, due to UTF-8's incompatibility with Latin-1 and Windows-1252. But UTF-8 has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings, so this was most common when many had software not supporting UTF-8. Most of these languages were supported by MS-DOS default CP437 and other machine default encodings, except ASCII, so problems when buying an operating system version were less common. Windows and MS-DOS are not compatible however. In Swedish, Norwegian, Danish and German, vowels are rarely repeated, and it is usually obvious when one character gets corrupted, e.g. the second letter in "kÃ⁠¤rlek" (', "love"). This way, even though the reader has to guess between å, ä and ö, almost all texts remain legible. Finnish text, on the other hand, does feature repeating vowels in words like ' ("wedding night") which can sometimes render text very hard to read (e.g. ' appears as "hÃ⁠¤Ã⁠¤yÃ⁠¶"). Icelandic and Faroese have ten and eight possibly confounding characters, respectively, which thus can make it more difficult to guess corrupted characters; Icelandic words like ' ("outstanding hospitality") become almost entirely unintelligible when rendered as "þjóðlöð". In German, ' ("letter salad") is a common term for this phenomenon, and in Spanish, ' (literally deformation). Some users transliterate their writing when using a computer, either by omitting the problematic diacritics, or by using digraph replacements (å → aa, ä/æ → ae, ö/ø → oe, ü → ue etc.). Thus, an author might write "ueber" instead of "über", which is standard practice in German when umlauts are not available. The latter practice seems to be better tolerated in the German language sphere than in the Nordic countries. For example, in Norwegian, digraphs are associated with archaic Danish, and may be used jokingly. However, digraphs are useful in communication with other parts of the world. As an example, the Norwegian football player
Ole Gunnar Solskjær Ole Gunnar Solskjær (; born 26 February 1973) is a Norwegian professional football manager and former player who played as a forward who last managed Premier League club Manchester United. As a player, Solskjær spent the majority of his car ...
had his name spelled "SOLSKJAER" on his back when he played for Manchester United. An artifact of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
misinterpreted as ISO-8859-1, "Ring meg nå" (""), was seen in an SMS scam raging in Norway in June 2014.


Central and Eastern European

Users of Central and
Eastern Europe Eastern Europe is a subregion of the European continent. As a largely ambiguous term, it has a wide range of geopolitical, geographical, ethnic, cultural, and socio-economic connotations. The vast majority of the region is covered by Russia, whic ...
an languages can also be affected. Because most computers were not connected to any network during the mid- to late-1980s, there were different character encodings for every language with
diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
al characters (see
ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
and
KOI-8 KOI-8 (КОИ-8) is an 8-bit character set standardized in GOST 19768-74. Маркелова Л. Н. Эксплуатация программоуправляемой вычислительной машины «Искра 226». — М.: Ма ...
), often also varying by operating system.


Hungarian

Hungarian is another affected language, which uses the 26 basic English characters, plus the accented forms á, é, í, ó, ú, ö, ü (all present in the Latin-1 character set), plus the two characters ő and ű, which are not in Latin-1. These two characters can be correctly encoded in Latin-2, Windows-1250 and Unicode. Before Unicode became common in e-mail clients, e-mails containing Hungarian text often had the letters ő and ű corrupted, sometimes to the point of unrecognizability. It is common to respond to an e-mail rendered unreadable (see examples below) by character mangling (referred to as "betűszemét", meaning "letter garbage") with the phrase "Árvíztűrő tükörfúrógép", a nonsense phrase (literally "Flood-resistant mirror-drilling machine") containing all accented characters used in Hungarian.


=Examples

=


Polish

Prior to the creation of
ISO 8859-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ...
in 1987, users of various computing platforms used their own
character encodings Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
such as AmigaPL on Amiga, Atari Club on Atari ST and Masovia, IBM CP852,
Mazovia Mazovia or Masovia ( pl, Mazowsze) is a historical region in mid-north-eastern Poland. It spans the North European Plain, roughly between Łódź and Białystok, with Warsaw being the unofficial capital and largest city. Throughout the centurie ...
and Windows CP1250 on IBM PCs. Polish companies selling early
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
computers created their own mutually-incompatible ways to encode Polish characters and simply reprogrammed the
EPROM An EPROM (rarely EROM), or erasable programmable read-only memory, is a type of programmable read-only memory (PROM) chip that retains its data when its power supply is switched off. Computer memory that can retrieve stored data after a power s ...
s of the video cards (typically CGA, EGA, or
Hercules Hercules (, ) is the Roman equivalent of the Greek divine hero Heracles, son of Jupiter and the mortal Alcmena. In classical mythology, Hercules is famous for his strength and for his numerous far-ranging adventures. The Romans adapted the ...
) to provide
hardware code page In computing, a hardware code page (HWCP) refers to a code page supported natively by a hardware device such as a display adapter or printer. The glyphs to present the characters are stored in the alphanumeric character generator's resident re ...
s with the needed glyphs for Polish—arbitrarily located without reference to where other computer sellers had placed them. The situation began to improve when, after pressure from academic and user groups,
ISO 8859-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ...
succeeded as the "Internet standard" with limited support of the dominant vendors' software (today largely replaced by Unicode). With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as (, lit. "little shrubs").


Russian and other Cyrillic alphabets

Mojibake may be colloquially called ( ) in
Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...
, which was and remains complicated by several systems for encoding Cyrillic. The
Soviet Union The Soviet Union,. officially the Union of Soviet Socialist Republics. (USSR),. was a List of former transcontinental countries#Since 1700, transcontinental country that spanned much of Eurasia from 1922 to 1991. A flagship communist state, ...
and early
Russian Federation Russia (, , ), or the Russian Federation, is a transcontinental country spanning Eastern Europe and Northern Asia. It is the largest country in the world, with its internationally recognised territory covering , and encompassing one-eig ...
developed KOI encodings (, , which translates to "Code for Information Exchange"). This began with Cyrillic-only 7-bit KOI7, based on
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
but with Latin and some other characters replaced with Cyrillic letters. Then came 8-bit KOI8 encoding that is an ASCII extension which encodes Cyrillic letters only with high-bit set octets corresponding to 7-bit codes from KOI7. It is for this reason that KOI8 text, even Russian, remains partially readable after stripping the eighth bit, which was considered as a major advantage in the age of
8BITMIME The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typical ...
-unaware email systems. For example, words "" , encoded in KOI8 and then passed through the high bit stripping process, end up rendered as " OLA RUSSKOGO qZYKA". Eventually KOI8 gained different flavors for Russian and Bulgarian (KOI8-R), Ukrainian (KOI8-U), Belarusian alphabet, Belarusian (KOI8-RU) and even Tajik Cyrillic alphabet, Tajik (KOI8-T). Meanwhile, in the West, Code page 866 supported Ukrainian language, Ukrainian and Belarusian language, Belarusian as well as Russian/ Bulgarian language, Bulgarian in
MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few ope ...
. For Microsoft Windows,
Code Page 1251 Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages. On the web, it is the second most-used si ...
added support for Serbian and other Slavic variants of Cyrillic. Most recently, the
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
encoding includes
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s for practically all the characters of all the world's languages, including all Cyrillic characters. Before Unicode, it was necessary to match text encoding with a font using the same encoding system. Failure to do this produced unreadable gibberish whose specific appearance varied depending on the exact combination of text encoding and font encoding. For example, attempting to view non-Unicode Cyrillic text using a font that is limited to the Latin alphabet, or using the default ("Western") encoding, typically results in text that consists almost entirely of vowels with diacritical marks. (KOI8 "" (, library) becomes "âÉÂÌÉÏÔÅËÁ".) Using Windows codepage 1251 to view text in KOI8 or vice versa results in garbled text that consists mostly of capital letters (KOI8 and codepage 1251 share the same ASCII region, but KOI8 has uppercase letters in the region where codepage 1251 has lowercase, and vice versa). In general, Cyrillic gibberish is symptomatic of using the wrong Cyrillic font. During the early years of the Russian sector of the World Wide Web, both KOI8 and codepage 1251 were common. As of 2017, one can still encounter HTML pages in codepage 1251 and, rarely, KOI8 encodings, as well as Unicode. (An estimated 1.7% of all web pages worldwide – all languages included – are encoded in codepage 1251.) Though the HTML standard includes the ability to specify the encoding for any given web page in its source, this is sometimes neglected, forcing the user to switch encodings in the browser manually. In Bulgarian language, Bulgarian, mojibake is often called (), meaning "monkey's lphabet. In Serbian, it is called (), meaning "
trash Trash may refer to: Garbage * Garbage, unwanted or undesired waste material ** Litter, material discarded in inappropriate places ** Municipal solid waste, unwanted or undesired waste material generated in a municipal environment Arts, enter ...
". Unlike the former USSR, South Slavs never used something like KOI8, and Code Page 1251 was the dominant Cyrillic encoding there before Unicode. Therefore, these languages experienced fewer encoding incompatibility troubles than Russian. In the 1980s, Bulgarian computers used their own MIK encoding, which is superficially similar to (although incompatible with) CP866.


Yugoslav languages

Croatian, Bosnian, Serbian (the seceding varieties of
Serbo-Croatian Serbo-Croatian () – also called Serbo-Croat (), Serbo-Croat-Bosnian (SCB), Bosnian-Croatian-Serbian (BCS), and Bosnian-Croatian-Montenegrin-Serbian (BCMS) – is a South Slavic language and the primary language of Serbia, Croatia, Bosnia an ...
language) and
Slovenian Slovene or Slovenian may refer to: * Something of, from, or related to Slovenia, a country in Central Europe * Slovene language, a South Slavic language mainly spoken in Slovenia * Slovenes The Slovenes, also known as Slovenians ( sl, Sloven ...
add to the basic Latin alphabet the letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć, Ž (only č/Č, š/Š and ž/Ž in Slovenian; officially, although others are used when needed, mostly in foreign names, as well). All of these letters are defined in Latin-2 and
Windows-1250 Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Czech (which is its main user with half its use, though Czech has 96.6% use of UTF-8, an ...
, while only some (š, Š, ž, Ž, Đ) exist in the usual OS-default
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
, and are there because of some other languages. Although Mojibake can occur with any of these characters, the letters that are not included in Windows-1252 are much more prone to errors. Thus, even nowadays, "šđčćž ŠĐČĆŽ" is often displayed as "šðèæž ŠÐÈÆŽ", although ð, è, æ, È, Æ are never used in Slavic languages. When confined to basic ASCII (most user names, for example), common replacements are: š→s, đ→dj, č→c, ć→c, ž→z (capital forms analogously, with Đ→Dj or Đ→DJ depending on word case). All of these replacements introduce ambiguities, so reconstructing the original from such a form is usually done manually if required. The
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. I ...
encoding is important because the English versions of the Windows operating system are most widespread, not localized ones. The reasons for this include a relatively small and fragmented market, increasing the price of high quality localization, a high degree of software piracy (in turn caused by high price of software compared to income), which discourages localization efforts, and people preferring English versions of Windows and other software. The drive to differentiate Croatian from Serbian, Bosnian from Croatian and Serbian, and now even Montenegrin from the other three creates many problems. There are many different localizations, using different standards and of different quality. There are no common translations for the vast amount of computer terminology originating in English. In the end, people use adopted English words ("kompjuter" for "computer", "kompajlirati" for "compile," etc.), and if they are unaccustomed to the translated terms may not understand what some option in a menu is supposed to do based on the translated phrase. Therefore, people who understand English, as well as those who are accustomed to English terminology (who are most, because English terminology is also mostly taught in schools because of these problems) regularly choose the original English versions of non-specialist software. When Cyrillic script is used (for Macedonian and partially Serbian), the problem is similar to other Cyrillic-based scripts. Newer versions of English Windows allow the code page to be changed (older versions require special English versions with this support), but this setting can be and often was incorrectly set. For example, Windows 98 and Windows Me can be set to most non-right-to-left single-byte code pages including 1250, but only at install time.


Caucasian languages

The writing systems of certain
languages of the Caucasus The Caucasian languages comprise a large and extremely varied array of languages spoken by more than ten million people in and around the Caucasus Mountains, which lie between the Black Sea and the Caspian Sea. Linguistic comparison allows t ...
region, including the scripts of
Georgian Georgian may refer to: Common meanings * Anything related to, or originating from Georgia (country) ** Georgians, an indigenous Caucasian ethnic group ** Georgian language, a Kartvelian language spoken by Georgians **Georgian scripts, three scrip ...
and
Armenian Armenian may refer to: * Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia * Armenians, the national people of Armenia, or people of Armenian descent ** Armenian Diaspora, Armenian communities across the ...
, may produce mojibake. This problem is particularly acute in the case of
ArmSCII ArmSCII or ARMSCII is a set of obsolete single-byte character encodings for the Armenian alphabet defined by Armenian national standard 166–9. ArmSCII is an acronym for Armenian Standard Code for Information Interchange, similar to ASCII for th ...
or ARMSCII, a set of obsolete character encodings for the Armenian alphabet which have been superseded by Unicode standards. ArmSCII is not widely used because of a lack of support in the computer industry. For example, Microsoft Windows does not support it.


Asian encodings

Another type of mojibake occurs when text is erroneously parsed in a multi-byte encoding, such as one of the encodings for
East Asian languages The East Asian languages are a language family (alternatively ''macrofamily'' or ''superphylum'') proposed by Stanley Starosta in 2001. The proposal has since been adopted by George van Driem. Classifications Early proposals Early proposals of s ...
. With this kind of mojibake more than one (typically two) characters are corrupted at once, e.g. "k舐lek" () in Swedish, where "" is parsed as "舐". Compared to the above mojibake, this is harder to read, since letters unrelated to the problematic å, ä or ö are missing, and is especially problematic for short words starting with å, ä or ö such as "än" (which becomes "舅"). Since two letters are combined, the mojibake also seems more random (over 50 variants compared to the normal three, not counting the rarer capitals). In some rare cases, an entire text string which happens to include a pattern of particular word lengths, such as the sentence "
Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...
", may be misinterpreted.


Vietnamese

In
Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...
, the phenomenon is called or . It can occur when a computer tries to encode diacritic character defined in
Windows-1258 Windows-1258 is a code page used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is compatible with neither the Vietnamese standard ( TCVN 5712 / VSCII), nor the various other encodin ...
, TCVN3 or VNI to UTF-8. ''Chữ ma'' was common in Vietnam when using Windows XP computers or cheap mobile phones.


Japanese

In
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
, the same phenomenon is, as mentioned, called . It is a particular problem in Japan due to the numerous different encodings that exist for Japanese text. Alongside Unicode encodings like UTF-8 and UTF-16, there are other standard encodings, such as Shift-JIS (Windows machines) and
EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
(UNIX systems). Mojibake, as well as being encountered by Japanese users, is also often encountered by non-Japanese when attempting to run software written for the Japanese market.


Chinese

In
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
, the same phenomenon is called ''Luàn mǎ'' (
Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Chinese, Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally writte ...
,
Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions, ...
,
Traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
, meaning 'chaotic code'), and can occur when computerised text is encoded in one
Chinese character encoding In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character enc ...
but is displayed using the wrong encoding. When this occurs, it is often possible to fix the issue by switching the character encoding without loss of data. The situation is complicated because of the existence of several Chinese character encoding systems in use, the most common ones being:
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
,
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
, and
Guobiao The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China. ...
(with several backward compatible versions), and the possibility of Chinese characters being encoded using Japanese encoding. It is easy to identify the original encoding when ''luanma'' occurs in Guobiao encodings: An additional problem is caused when encodings are missing characters, which is common with rare or antiquated characters that are still used in personal or place names. Examples of this are
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...
ese politicians Wang Chien-shien ()'s "煊",
Yu Shyi-kun You Si-kun (; born 25 April 1948), also romanized Yu Shyi-kun, is a Taiwanese politician serving as a member and the president of the Legislative Yuan. He was one of the founding members of the Democratic Progressive Party (DPP), and is know ...
()'s "堃" and singer
David Tao David Tao (), born Tao Xuzhong () (born 11 July 1969), is a Taiwanese Golden Melody Award-winning singer-songwriter. He is well known for creating a crossover genre of R&B and hard rock tunes which has now become his signature style and for hav ...
()'s "喆" missing in
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...
, ex-PRC Premier
Zhu Rongji Zhu Rongji (; IPA: ; born 23 October 1928) is a retired Chinese politician who served as Premier of the People's Republic of China from 1998 to 2003 and CCP Politburo Standing Committee member from 1992 to 2002 along with the Chinese Communist ...
()'s "镕" missing in
GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准 ...
, copyright symbol "©" missing in GBK. Newspapers have dealt with this problem in various ways, including using software to combine two existing, similar characters; using a picture of the personality; or simply substituting a homophone for the rare character in the hope that the reader would be able to make the correct inference.


Indic text

A similar effect can occur in Brahmic or Indic scripts of
South Asia South Asia is the southern subregion of Asia, which is defined in both geographical Geography (from Greek: , ''geographia''. Combination of Greek words ‘Geo’ (The Earth) and ‘Graphien’ (to describe), literally "earth descr ...
, used in such Indo-Aryan or Indic languages as Hindustani (Hindi-Urdu),
Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...
, Punjabi,
Marathi Marathi may refer to: *Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India *Marathi language, the Indo-Aryan language spoken by the Marathi people *Palaiosouda, also known as Marathi, a small island in Greece See also * * ...
, and others, even if the character set employed is properly recognized by the application. This is because, in many Indic scripts, the rules by which individual letter symbols combine to create symbols for syllables may not be properly understood by a computer missing the appropriate software, even if the glyphs for the individual letter forms are available. One example of this is the old Wikipedia logo, which attempts to show the character analogous to "wi" (the first syllable of "Wikipedia") on each of many puzzle pieces. The puzzle piece meant to bear the
Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental writing system), based on the ...
character for "wi" instead used to display the "wa" character followed by an unpaired "i"
modifier Modifier may refer to: * Grammatical modifier, a word that modifies the meaning of another word or limits its meaning ** Compound modifier, two or more words that modify a noun ** Dangling modifier, a word or phrase that modifies a clause in an am ...
vowel, easily recognizable as mojibake generated by a computer not configured to display Indic text. The logo as redesigned has fixed these errors. The idea of Plain Text requires the operating system to provide a font to display Unicode codes. This font is different from OS to OS for Singhala and it makes orthographically incorrect glyphs for some letters (syllables) across all operating systems. For instance, the 'reph', the short form for 'r' is a diacritic that normally goes on top of a plain letter. However, it is wrong to go on top of some letters like 'ya' or 'la' in specific contexts. For Sanskritic words or names inherited by modern languages, such as कार्य, IAST: ''kārya'', or आर्या, IAST: ''āryā'', it is apt to put it on top of these letters. By contrast, for similar sounds in modern languages which result from their specific rules, it is not put on top, such as the word करणाऱ्या, IAST: ''karaṇāryā'', a stem form of the common word करणारा/री, IAST: ''karaṇārā/rī'', in the
Marathi language Marathi (; ''Marāṭhī'', ) is an Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and additional official language in the state of Goa. It is one of t ...
. But it happens in most operating systems. This appears to be a fault of internal programming of the fonts. In Mac OS and iOS, the muurdhaja l (dark l) and 'u' combination and its long form both yield wrong shapes. Some Indic and Indic-derived scripts, most notably Lao, were not officially supported by
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...
until the release of
Vista Vista usually refers to a distant view. Vista may also refer to: Software *Windows Vista, the line of Microsoft Windows client operating systems released in 2006 and 2007 * VistA, (Veterans Health Information Systems and Technology Architecture) ...
. However, various sites have made free-to-download fonts.


Burmese

Due to Western sanctions and the late arrival of Burmese language support in computers, much of the early Burmese localization was homegrown without international cooperation. The prevailing means of Burmese support is via the
Zawgyi font Zawgyi font is a predominant typeface used for Burmese language text on websites. It is also known as Zawgyi-One or zawgyi1 font although updated versions of this font were not named Zawgyi-two. Prior to 2019, it was the most popular font on Bur ...
, a font that was created as a
Unicode font A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even on ...
but was in fact only partially Unicode compliant. In the Zawgyi font, some
codepoint In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s for Burmese script were implemented as specified in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
, but others were not. The Unicode Consortium refers to this as ''ad hoc font encodings''. With the advent of mobile phones, mobile vendors such as Samsung and Huawei simply replaced the Unicode compliant system fonts with Zawgyi versions. Due to these ''ad hoc'' encodings, communications between users of Zawgyi and Unicode would render as garbled text. To get around this issue, content producers would make posts in both Zawgyi and Unicode. Myanmar government has designated 1 October 2019 as "U-Day" to officially switch to Unicode. The full transition is estimated to take two years.


African languages

In certain
writing systems of Africa The writing systems of Africa refer to the current and historical practice of writing systems on the African continent, both indigenous and those introduced. Today, the Latin script is commonly encountered across Africa, especially in the Western ...
, unencoded text is unreadable. Texts that may produce mojibake include those from the Horn of Africa such as the Ge'ez script in
Ethiopia Ethiopia, , om, Itiyoophiyaa, so, Itoobiya, ti, ኢትዮጵያ, Ítiyop'iya, aa, Itiyoppiya officially the Federal Democratic Republic of Ethiopia, is a landlocked country in the Horn of Africa. It shares borders with Eritrea to the ...
and Eritrea, used for Amharic, Tigre, and other languages, and the Somali language, which employs the
Osmanya alphabet The Osmanya script ( so, Farta Cismaanya 𐒍𐒖𐒇𐒂𐒖 𐒋𐒘𐒈𐒑𐒛𐒒𐒕𐒖), also known as Far Soomaali (𐒍𐒖𐒇 𐒘𐒝𐒈𐒑𐒛𐒘, "Somali writing") and, in Arabic, as ''al-kitābah al-ʿuthmānīyah'' (الكتا ...
. In
Southern Africa Southern Africa is the southernmost subregion of the African continent, south of the Congo and Tanzania. The physical location is the large part of Africa to the south of the extensive Congo River basin. Southern Africa is home to a number o ...
, the
Mwangwego alphabet The Mwangwego script is an abugida writing system developed for Malawian languages and other African Bantu languages by linguist Nolence Mwangwego in 1977. It is one of several indigenous scripts invented for local language communities in Africa. ...
is used to write languages of
Malawi Malawi (; or aláwi Tumbuka: ''Malaŵi''), officially the Republic of Malawi, is a landlocked country in Southeastern Africa that was formerly known as Nyasaland. It is bordered by Zambia to the west, Tanzania to the north and northeas ...
and the Mandombe alphabet was created for the
Democratic Republic of the Congo The Democratic Republic of the Congo (french: République démocratique du Congo (RDC), colloquially "La RDC" ), informally Congo-Kinshasa, DR Congo, the DRC, the DROC, or the Congo, and formerly and also colloquially Zaire, is a country in ...
, but these are not generally supported. Various other writing systems native to
West Africa West Africa or Western Africa is the westernmost region of Africa. The United Nations defines Western Africa as the 16 countries of Benin, Burkina Faso, Cape Verde, The Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Liberia, Mali, M ...
present similar problems, such as the
N'Ko alphabet N'Ko () is a script devised by Solomana Kante in 1949, as a modern writing system for the Mandé languages of West Africa. The term ''N'Ko'', which means ''I say'' in all Mandé languages, is also used for the Mandé literary standard written ...
, used for
Manding languages The Manding languages (sometimes spelt Manden) are a dialect continuum within the Mande language family spoken in West Africa. Varieties of Manding are generally considered (among native speakers) to be mutually intelligible – dependent on exp ...
in Guinea, and the
Vai syllabary The Vai syllabary is a syllabic writing system devised for the Vai language by Momolu Duwalu Bukele of Jondu, in what is now Grand Cape Mount County, Liberia. Bukele is regarded within the Vai community, as well as by most scholars, as the s ...
, used in Liberia.


Arabic

Another affected language is
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
(see below). The text becomes unreadable when the encodings do not match.


Examples

The examples in this article do not have UTF-8 as browser setting, because UTF-8 is easily recognisable, so if a browser supports UTF-8 it should recognise it automatically, and not try to interpret something else as UTF-8.


See also

*
Code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
*
Replacement character Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0: *, marks start of annotated text *, marks start ...
*
Substitute character In computer data, a substitute character (␚) is a control character that is used to pad transmitted data in order to send it in blocks of fixed size, or to stand in place of a character that is recognized to be invalid, erroneous or unreprese ...
* Newline – The conventions for representing the line break differ between Windows and Unix systems. Though most software supports both conventions (which is trivial), software that must preserve or display the difference (e.g.
version control system In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
s and
data comparison In computing, file comparison is the calculation and display of the differences and similarities between data objects, typically text files such as source code. The methods, implementations, and results are typically called a diff, after the Un ...
tools) can get substantially more difficult to use if not adhering to one convention. *
Byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
– The most
in-band In telecommunications, in-band signaling is the sending of control information within the same band or channel used for data such as voice or video. This is in contrast to out-of-band signaling which is sent over a different channel, or even o ...
way to store the encoding together with the data – prepend it. This is by intention invisible to humans using compliant software, but will by design be perceived as "garbage characters" to incompliant software (including many
interpreters Interpreting is a translational activity in which one produces a first and final target-language output on the basis of a one-time exposure to an expression in a source language. The most common two modes of interpreting are simultaneous interp ...
). * HTML entities – An encoding of special characters in HTML, mostly optional, but required for certain characters to escape interpretation as markup. While failure to apply this transformation is a vulnerability (see
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability m ...
), applying it too many times results in garbling of these characters. For example, the quotation mark " becomes ", ", " and so on. *
Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...


References


External links

* * {{Character encoding Character encoding Computer errors Nonsense