HOME

TheInfoList



OR:

Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended
character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
. The result is a systematic replacement of symbols with completely unrelated ones, often from a different
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
. This display may include the generic replacement character ("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
and UTF-16). Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake. Symptoms of this failed rendering include blocks with the
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
displayed in
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, he ...
or using the generic replacement character. Importantly, these replacements are ''valid'' and are the result of correct error handling by the software.


Causes

To correctly reproduce the original text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved. As mojibake is the instance of non-compliance between these, it can be achieved by manipulating the data itself, or just relabeling it. Mojibake is often seen with text data that have been tagged with a wrong encoding; it may not even be tagged at all, but moved between computers with different default encodings. A major source of trouble are
communication protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics and synchroniza ...
s that rely on settings on each computer rather than sending or storing
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
together with the data. The differing default settings between computers are in part due to differing deployments of
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
among
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
families, and partly the legacy encodings' specializations for different
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s of human languages. Whereas
Linux distribution A Linux distribution (often abbreviated as distro) is an operating system made from a software collection that includes the Linux kernel and, often, a package management system. Linux users usually obtain their operating system by downloading one ...
s mostly switched to
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
in 2004,
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...
generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages. For some
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s, an example being Japanese, several encodings have historically been employed, causing users to see mojibake relatively often. As a Japanese example, the word ''mojibake'' "文字化け" stored as EUC-JP might be incorrectly displayed as "ハクサ�ス、ア", "ハクサ嵂ス、ア" ( MS-932), or "ハクサ郾ス、ア" (
Shift JIS-2004 Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjun ...
). The same text stored as
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
is displayed as "譁�蟄怜喧縺�" if interpreted as Shift JIS. This is further exacerbated if other locales are involved: the same UTF-8 text appears as "文字化け" in software that assumes text to be in the
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
or
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
encodings, usually labelled Western, or (for example) as "鏂囧瓧鍖栥亼" if interpreted as being in a GBK (Mainland China) locale.


Underspecification

If the encoding is not specified, it is up to the software to decide it by other means. Depending on the type of software, the typical solution is either configuration or
charset detection Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when sp ...
heuristics. Both are prone to mis-prediction in not-so-uncommon scenarios. The encoding of
text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operat ...
s is affected by locale setting, which depends on the user's language, brand of
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
and possibly other conditions. Therefore, the assumed encoding is systematically wrong for files that come from a computer with a different setting, or even from a differently localized software within the same system. For Unicode, one solution is to use a
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
, but for
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the ...
and other machine readable text, many parsers don't tolerate this. Another is storing the encoding as metadata in the file system. File systems that support extended file attributes can store this as user.charset. This also requires support in software that wants to take advantage of it, but does not disturb other software. While a few encodings are easy to detect, in particular UTF-8, there are many that are hard to distinguish (see
charset detection Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when sp ...
). A
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
may not be able to distinguish a page coded in EUC-JP and another in
Shift-JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Oracle Solaris, Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporati ...
if the coding scheme is not assigned explicitly using HTTP headers sent along with the documents, or using the
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
document's meta tags that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers; see character encodings in HTML.


Mis-specification

Mojibake also occurs when the encoding is wrongly specified. This often happens between encodings that are similar. For example, the
Eudora Eudora may refer to: Places * Eudora, Arkansas, a city * Eudora, Kansas, a city * Eudora Township, Douglas County, Kansas * Eudora, Mississippi, an unincorporated community * Eudora, Missouri, an unincorporated community Other * 217 Eudora, an as ...
email client for
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...
was known to send emails labelled as
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
that were in reality
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
. Windows-1252 contains extra printable characters in the C1 range (the most frequently seen being curved quotation marks and extra
dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen ...
es), that were not displayed properly in software complying with the ISO standard; this especially affected software running under other operating systems such as
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
.


User oversight

Of the encodings still in common use, many originated from taking
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
and appending atop it; as a result, these encodings are partially compatible with each other. Examples of this include Windows-1252 and ISO 8859-1. People thus may mistake the expanded encoding set they are using with plain ASCII.


Overspecification

When there are layers of protocols, each trying to specify the encoding based on different information, the least certain information may be misleading to the recipient. For example, consider a
web server A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initia ...
serving a static HTML file over HTTP. The character set may be communicated to the client in any number of 3 ways: * in the HTTP header. This information can be based on server configuration (for instance, when serving a file off disk) or controlled by the application running on the server (for dynamic websites). * in the file, as an HTML meta tag (http-equiv or charset) or the encoding attribute of an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
declaration. This is the encoding that the author meant to save the particular file in. * in the file, as a
byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
. This is the encoding that the author's editor actually saved it in. Unless an accidental encoding conversion has happened (by opening it in one encoding and saving it in another), this will be correct. It is, however, only available in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
encodings such as UTF-8 or UTF-16.


Lack of hardware or software support

Much older hardware is typically designed to support only one character set and the character set typically cannot be altered. The character table contained within the display firmware will be localized to have characters for the country the device is to be sold in, and typically the table differs from country to country. As such, these systems will potentially display mojibake when loading text generated on a system from a different country. Likewise, many early operating systems do not support multiple encoding formats and thus will end up displaying mojibake if made to display non-standard text—early versions of
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...
and Palm OS for example, are localized on a per-country basis and will only support encoding standards relevant to the country the localized version will be sold in, and will display mojibake if a file containing a text in a different encoding format from the version that the OS is designed to support is opened.


Resolutions

Applications using
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
as a default encoding may achieve a greater degree of interoperability because of its widespread use and backward compatibility with US-ASCII. UTF-8 also has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings. The difficulty of resolving an instance of mojibake varies depending on the application within which it occurs and the causes of it. Two of the most common applications in which mojibake may occur are
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
s and
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Early word processors were stand-alone devices dedicated to the function, but current ...
s. Modern browsers and word processors often support a wide array of character encodings. Browsers often allow a user to change their rendering engine's encoding setting on the fly, while word processors allow the user to select the appropriate encoding when opening a file. It may take some trial and error for users to find the correct encoding. The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game. In this case, the user must change the operating system's encoding settings to match that of the game. However, changing the system-wide encoding settings can also cause Mojibake in pre-existing applications. In
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was release to manufacturing, released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Wind ...
or later, a user also has the option to use
Microsoft AppLocale AppLocale is a tool for Windows XP and Windows Server 2003 by Microsoft. It is a launcher application that makes it possible to run non- Unicode (code page-based) applications in a locale of the user's choice. Since changing the locale normally re ...
, an application that allows the changing of per-application locale settings. Even so, changing the operating system encoding settings is not possible on earlier operating systems such as
Windows 98 Windows 98 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of Microsoft Windows operating systems. The second operating system in the 9x line, it is the successor to Windows 95, and was released to ...
; to resolve this issue on earlier operating systems, a user would have to use third party font rendering applications.


Problems in different writing systems


English

Mojibake in English texts generally occurs in punctuation, such as
em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...
es (—),
en dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...
es (–), and curly quotes (“,”,‘,’), but rarely in character text, since most encodings agree with
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
on the encoding of the
English alphabet The alphabet for Modern English is a Latin-script alphabet consisting of 26 letters, each having an upper- and lower-case form. The word ''alphabet'' is a compound of the first two letters of the Greek alphabet, ''alpha'' and '' beta''. ...
. For example, the
pound sign The pound sign is the symbol for the pound unit of sterling – the currency of the United Kingdom and previously of Great Britain and of the Kingdom of England. The same symbol is used for other currencies called pound, such as the Gibralt ...
"£" will appear as "£" if it was encoded by the sender as
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
but interpreted by the recipient as CP1252 or
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
. If iterated using CP1252, this can lead to "£", "£", "£", etc. Some computers did, in older eras, have vendor-specific encodings which caused mismatch also for English text. Commodore brand
8-bit In computer architecture, 8-bit integers or other data units are those that are 8 bits wide (1 octet). Also, 8-bit central processing unit (CPU) and arithmetic logic unit (ALU) architectures are those that are based on registers or data buses ...
computers used PETSCII encoding, particularly notable for inverting the upper and lower case compared to standard
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
. PETSCII printers worked fine on other computers of the era, but flipped the case of all letters. IBM mainframes use the
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
encoding which does not match ASCII at all.


Other Western European languages

The alphabets of the
North Germanic languages The North Germanic languages make up one of the three branches of the Germanic languages—a sub-family of the Indo-European languages—along with the West Germanic languages and the extinct East Germanic languages. The language group is also ...
,
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
, Finnish,
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
,
French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...
, Portuguese and Spanish are all extensions of the
Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the ...
. The additional characters are typically the ones that become corrupted, making texts only mildly unreadable with mojibake: * å, ä, ö in Finnish and Swedish *à, ç, è, é, ï, í, ò, ó, ú, ü in
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
* æ, ø, å in Norwegian and Danish *á, é, ó, ij, è, ë, ï in Dutch *ä, ö, ü, and ß in
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
*á, ð, í, ó, ú, ý, æ, ø in Faroese *á, ð, é, í, ó, ú, ý, þ, æ, ö in Icelandic *à, â, ç, è, é, ë, ê, ï, î, ô, ù, û, ü, ÿ, æ, œ in
French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...
*à, è, é, ì, ò, ù in Italian *á, é, í, ñ, ó, ú, ü, ¡, ¿ in Spanish *à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese ( ü no longer used) *á, é, í, ó, ú in Irish *à, è, ì, ò, ù in
Scottish Gaelic Scottish Gaelic ( gd, Gàidhlig ), also known as Scots Gaelic and Gaelic, is a Goidelic language (in the Celtic branch of the Indo-European language family) native to the Gaels of Scotland. As a Goidelic language, Scottish Gaelic, as well as ...
* £ in
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
… and their uppercase counterparts, if applicable. These are languages for which the
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
character set (also known as ''Latin 1'' or ''Western'') has been in use. However, ISO-8859-1 has been obsoleted by two competing standards, the backward compatible
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
, and the slightly altered ISO-8859-15. Both add the
Euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists o ...
€ and the French œ, but otherwise any confusion of these three character sets does not create mojibake in these languages. Furthermore, it is always safe to interpret ISO-8859-1 as Windows-1252, and fairly safe to interpret it as ISO-8859-15, in particular with respect to the Euro sign, which replaces the rarely used
currency sign A currency symbol or currency sign is a graphic symbol used to denote a currency unit. Usually it is defined by the monetary authority, like the national central bank for the currency concerned. In formatting, the symbol can use various format ...
(¤). However, with the advent of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
, mojibake has become more common in certain scenarios, e.g. exchange of text files between
UNIX Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
and
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...
computers, due to UTF-8's incompatibility with Latin-1 and Windows-1252. But UTF-8 has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings, so this was most common when many had software not supporting UTF-8. Most of these languages were supported by MS-DOS default CP437 and other machine default encodings, except ASCII, so problems when buying an operating system version were less common. Windows and MS-DOS are not compatible however. In Swedish, Norwegian, Danish and German, vowels are rarely repeated, and it is usually obvious when one character gets corrupted, e.g. the second letter in "kÃ⁠¤rlek" (', "love"). This way, even though the reader has to guess between å, ä and ö, almost all texts remain legible. Finnish text, on the other hand, does feature repeating vowels in words like ' ("wedding night") which can sometimes render text very hard to read (e.g. ' appears as "hÃ⁠¤Ã⁠¤yÃ⁠¶"). Icelandic and Faroese have ten and eight possibly confounding characters, respectively, which thus can make it more difficult to guess corrupted characters; Icelandic words like ' ("outstanding hospitality") become almost entirely unintelligible when rendered as "þjóðlöð". In German, ' ("letter salad") is a common term for this phenomenon, and in Spanish, ' (literally deformation). Some users transliterate their writing when using a computer, either by omitting the problematic diacritics, or by using digraph replacements (å → aa, ä/æ → ae, ö/ø → oe, ü → ue etc.). Thus, an author might write "ueber" instead of "über", which is standard practice in German when umlauts are not available. The latter practice seems to be better tolerated in the German language sphere than in the
Nordic countries The Nordic countries (also known as the Nordics or ''Norden''; lit. 'the North') are a geographical and cultural region in Northern Europe and the North Atlantic. It includes the sovereign states of Denmark, Finland, Iceland, Norway and Sw ...
. For example, in Norwegian, digraphs are associated with archaic Danish, and may be used jokingly. However, digraphs are useful in communication with other parts of the world. As an example, the Norwegian football player Ole Gunnar Solskjær had his name spelled "SOLSKJAER" on his back when he played for
Manchester United Manchester () is a city in Greater Manchester, England. It had a population of 552,000 in 2021. It is bordered by the Cheshire Plain to the south, the Pennines to the north and east, and the neighbouring city of Salford to the west. The ...
. An artifact of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
misinterpreted as
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
, "Ring meg nå" (""), was seen in an SMS scam raging in Norway in June 2014.


Central and Eastern European

Users of
Central Central is an adjective usually referring to being in the center of some place or (mathematical) object. Central may also refer to: Directions and generalised locations * Central Africa, a region in the centre of Africa continent, also known a ...
and
Eastern Europe Eastern Europe is a subregion of the European continent. As a largely ambiguous term, it has a wide range of geopolitical, geographical, ethnic, cultural, and socio-economic connotations. The vast majority of the region is covered by Russia, whi ...
an languages can also be affected. Because most computers were not connected to any network during the mid- to late-1980s, there were different character encodings for every language with
diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
al characters (see ISO/IEC 8859 and KOI-8), often also varying by operating system.


Hungarian

Hungarian is another affected language, which uses the 26 basic English characters, plus the accented forms á, é, í, ó, ú, ö, ü (all present in the Latin-1 character set), plus the two characters ő and ű, which are not in Latin-1. These two characters can be correctly encoded in Latin-2, Windows-1250 and Unicode. Before Unicode became common in e-mail clients, e-mails containing Hungarian text often had the letters ő and ű corrupted, sometimes to the point of unrecognizability. It is common to respond to an e-mail rendered unreadable (see examples below) by character mangling (referred to as "betűszemét", meaning "letter garbage") with the phrase "Árvíztűrő tükörfúrógép", a nonsense phrase (literally "Flood-resistant mirror-drilling machine") containing all accented characters used in Hungarian.


=Examples

=


Polish

Prior to the creation of ISO 8859-2 in 1987, users of various computing platforms used their own
character encodings Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
such as AmigaPL on Amiga, Atari Club on Atari ST and Masovia, IBM CP852, Mazovia and Windows CP1250 on IBM PCs. Polish companies selling early
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
computers created their own mutually-incompatible ways to encode Polish characters and simply reprogrammed the EPROMs of the video cards (typically CGA, EGA, or
Hercules Hercules (, ) is the Roman equivalent of the Greek divine hero Heracles, son of Jupiter and the mortal Alcmena. In classical mythology, Hercules is famous for his strength and for his numerous far-ranging adventures. The Romans adapted the ...
) to provide
hardware code page In computing, a hardware code page (HWCP) refers to a code page supported natively by a hardware device such as a display adapter or printer. The glyphs to present the characters are stored in the alphanumeric character generator's resident read ...
s with the needed glyphs for Polish—arbitrarily located without reference to where other computer sellers had placed them. The situation began to improve when, after pressure from academic and user groups, ISO 8859-2 succeeded as the "Internet standard" with limited support of the dominant vendors' software (today largely replaced by Unicode). With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as (, lit. "little shrubs").


Russian and other Cyrillic alphabets

Mojibake may be colloquially called ( ) in Russian, which was and remains complicated by several systems for encoding
Cyrillic The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...
. The
Soviet Union The Soviet Union,. officially the Union of Soviet Socialist Republics. (USSR),. was a transcontinental country that spanned much of Eurasia from 1922 to 1991. A flagship communist state, it was nominally a federal union of fifteen nationa ...
and early
Russian Federation Russia (, , ), or the Russian Federation, is a transcontinental country spanning Eastern Europe and Northern Asia. It is the largest country in the world, with its internationally recognised territory covering , and encompassing one-eig ...
developed KOI encodings (, , which translates to "Code for Information Exchange"). This began with Cyrillic-only 7-bit KOI7, based on
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
but with Latin and some other characters replaced with Cyrillic letters. Then came 8-bit KOI8 encoding that is an ASCII extension which encodes Cyrillic letters only with high-bit set octets corresponding to 7-bit codes from KOI7. It is for this reason that KOI8 text, even Russian, remains partially readable after stripping the eighth bit, which was considered as a major advantage in the age of
8BITMIME The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typi ...
-unaware email systems. For example, words "" , encoded in KOI8 and then passed through the high bit stripping process, end up rendered as " OLA RUSSKOGO qZYKA". Eventually KOI8 gained different flavors for Russian and Bulgarian (KOI8-R), Ukrainian (KOI8-U), Belarusian alphabet, Belarusian (KOI8-RU) and even Tajik Cyrillic alphabet, Tajik (KOI8-T). Meanwhile, in the West, Code page 866 supported Ukrainian language, Ukrainian and Belarusian language, Belarusian as well as Russian/ Bulgarian language, Bulgarian in
MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few o ...
. For
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...
, Code Page 1251 added support for
Serbian Serbian may refer to: * someone or something related to Serbia, a country in Southeastern Europe * someone or something related to the Serbs, a South Slavic people * Serbian language * Serbian names See also * * * Old Serbian (disambiguation ...
and other Slavic variants of Cyrillic. Most recently, the
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
encoding includes
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s for practically all the characters of all the world's languages, including all Cyrillic characters. Before Unicode, it was necessary to match text encoding with a font using the same encoding system. Failure to do this produced unreadable
gibberish Gibberish, also called jibber-jabber or gobbledygook, is speech that is (or appears to be) nonsense. It may include speech sounds that are not actual words, pseudowords, or language games and specialized jargon that seems nonsensical to outsi ...
whose specific appearance varied depending on the exact combination of text encoding and font encoding. For example, attempting to view non-Unicode Cyrillic text using a font that is limited to the Latin alphabet, or using the default ("Western") encoding, typically results in text that consists almost entirely of vowels with diacritical marks. (KOI8 "" (, library) becomes "âÉÂÌÉÏÔÅËÁ".) Using Windows codepage 1251 to view text in KOI8 or vice versa results in garbled text that consists mostly of capital letters (KOI8 and codepage 1251 share the same ASCII region, but KOI8 has uppercase letters in the region where codepage 1251 has lowercase, and vice versa). In general, Cyrillic gibberish is symptomatic of using the wrong Cyrillic font. During the early years of the Russian sector of the World Wide Web, both KOI8 and codepage 1251 were common. As of 2017, one can still encounter HTML pages in codepage 1251 and, rarely, KOI8 encodings, as well as Unicode. (An estimated 1.7% of all web pages worldwide – all languages included – are encoded in codepage 1251.) Though the HTML standard includes the ability to specify the encoding for any given web page in its source, this is sometimes neglected, forcing the user to switch encodings in the browser manually. In Bulgarian language, Bulgarian, mojibake is often called (), meaning "monkey's lphabet. In
Serbian Serbian may refer to: * someone or something related to Serbia, a country in Southeastern Europe * someone or something related to the Serbs, a South Slavic people * Serbian language * Serbian names See also * * * Old Serbian (disambiguation ...
, it is called (), meaning " trash". Unlike the former USSR, South Slavs never used something like KOI8, and Code Page 1251 was the dominant Cyrillic encoding there before Unicode. Therefore, these languages experienced fewer encoding incompatibility troubles than Russian. In the 1980s, Bulgarian computers used their own MIK encoding, which is superficially similar to (although incompatible with) CP866.


Yugoslav languages

Croatian, Bosnian,
Serbian Serbian may refer to: * someone or something related to Serbia, a country in Southeastern Europe * someone or something related to the Serbs, a South Slavic people * Serbian language * Serbian names See also * * * Old Serbian (disambiguation ...
(the seceding varieties of
Serbo-Croatian Serbo-Croatian () – also called Serbo-Croat (), Serbo-Croat-Bosnian (SCB), Bosnian-Croatian-Serbian (BCS), and Bosnian-Croatian-Montenegrin-Serbian (BCMS) – is a South Slavic language and the primary language of Serbia, Croatia, Bosnia an ...
language) and
Slovenian Slovene or Slovenian may refer to: * Something of, from, or related to Slovenia, a country in Central Europe * Slovene language, a South Slavic language mainly spoken in Slovenia * Slovenes, an ethno-linguistic group mainly living in Slovenia * Sl ...
add to the basic Latin alphabet the letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć, Ž (only č/Č, š/Š and ž/Ž in Slovenian; officially, although others are used when needed, mostly in foreign names, as well). All of these letters are defined in
Latin-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ...
and Windows-1250, while only some (š, Š, ž, Ž, Đ) exist in the usual OS-default
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
, and are there because of some other languages. Although Mojibake can occur with any of these characters, the letters that are not included in Windows-1252 are much more prone to errors. Thus, even nowadays, "šđčćž ŠĐČĆŽ" is often displayed as "šðèæž ŠÐÈÆŽ", although ð, è, æ, È, Æ are never used in Slavic languages. When confined to basic ASCII (most user names, for example), common replacements are: š→s, đ→dj, č→c, ć→c, ž→z (capital forms analogously, with Đ→Dj or Đ→DJ depending on word case). All of these replacements introduce ambiguities, so reconstructing the original from such a form is usually done manually if required. The
Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. ...
encoding is important because the English versions of the Windows operating system are most widespread, not localized ones. The reasons for this include a relatively small and fragmented market, increasing the price of high quality localization, a high degree of software piracy (in turn caused by high price of software compared to income), which discourages localization efforts, and people preferring English versions of Windows and other software. The drive to differentiate Croatian from Serbian, Bosnian from Croatian and Serbian, and now even Montenegrin from the other three creates many problems. There are many different localizations, using different standards and of different quality. There are no common translations for the vast amount of computer terminology originating in English. In the end, people use adopted English words ("kompjuter" for "computer", "kompajlirati" for "compile," etc.), and if they are unaccustomed to the translated terms may not understand what some option in a menu is supposed to do based on the translated phrase. Therefore, people who understand English, as well as those who are accustomed to English terminology (who are most, because English terminology is also mostly taught in schools because of these problems) regularly choose the original English versions of non-specialist software. When Cyrillic script is used (for
Macedonian Macedonian most often refers to someone or something from or related to Macedonia. Macedonian(s) may specifically refer to: People Modern * Macedonians (ethnic group), a nation and a South Slavic ethnic group primarily associated with North Ma ...
and partially
Serbian Serbian may refer to: * someone or something related to Serbia, a country in Southeastern Europe * someone or something related to the Serbs, a South Slavic people * Serbian language * Serbian names See also * * * Old Serbian (disambiguation ...
), the problem is similar to other Cyrillic-based scripts. Newer versions of English Windows allow the
code page In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some c ...
to be changed (older versions require special English versions with this support), but this setting can be and often was incorrectly set. For example, Windows 98 and Windows Me can be set to most non-right-to-left single-byte code pages including 1250, but only at install time.


Caucasian languages

The writing systems of certain
languages of the Caucasus The Caucasian languages comprise a large and extremely varied array of languages spoken by more than ten million people in and around the Caucasus Mountains, which lie between the Black Sea and the Caspian Sea. Linguistic comparison allows t ...
region, including the scripts of Georgian and Armenian, may produce mojibake. This problem is particularly acute in the case of ArmSCII or ARMSCII, a set of obsolete character encodings for the Armenian alphabet which have been superseded by Unicode standards. ArmSCII is not widely used because of a lack of support in the computer industry. For example,
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ...
does not support it.


Asian encodings

Another type of mojibake occurs when text is erroneously parsed in a multi-byte encoding, such as one of the encodings for East Asian languages. With this kind of mojibake more than one (typically two) characters are corrupted at once, e.g. "k舐lek" () in Swedish, where "" is parsed as "舐". Compared to the above mojibake, this is harder to read, since letters unrelated to the problematic å, ä or ö are missing, and is especially problematic for short words starting with å, ä or ö such as "än" (which becomes "舅"). Since two letters are combined, the mojibake also seems more random (over 50 variants compared to the normal three, not counting the rarer capitals). In some rare cases, an entire text string which happens to include a pattern of particular word lengths, such as the sentence "
Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...
", may be misinterpreted.


Vietnamese

In Vietnamese, the phenomenon is called or . It can occur when a computer tries to encode diacritic character defined in Windows-1258, TCVN3 or VNI to UTF-8. ''Chữ ma'' was common in Vietnam when using Windows XP computers or cheap mobile phones.


Japanese

In Japanese, the same phenomenon is, as mentioned, called . It is a particular problem in Japan due to the numerous different encodings that exist for Japanese text. Alongside Unicode encodings like UTF-8 and UTF-16, there are other standard encodings, such as
Shift-JIS Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Oracle Solaris, Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporati ...
(Windows machines) and EUC-JP (UNIX systems). Mojibake, as well as being encountered by Japanese users, is also often encountered by non-Japanese when attempting to run software written for the Japanese market.


Chinese

In Chinese, the same phenomenon is called ''Luàn mǎ'' (
Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese fo ...
,
Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions ...
,
Traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
, meaning 'chaotic code'), and can occur when computerised text is encoded in one
Chinese character encoding In computing, Chinese character encodings can be used to represent text written in the CJK languages— Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose characte ...
but is displayed using the wrong encoding. When this occurs, it is often possible to fix the issue by switching the character encoding without loss of data. The situation is complicated because of the existence of several Chinese character encoding systems in use, the most common ones being:
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
,
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character se ...
, and Guobiao (with several backward compatible versions), and the possibility of Chinese characters being encoded using Japanese encoding. It is easy to identify the original encoding when ''luanma'' occurs in Guobiao encodings: An additional problem is caused when encodings are missing characters, which is common with rare or antiquated characters that are still used in personal or place names. Examples of this are
Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the no ...
ese politicians
Wang Chien-shien Wang Chien-shien (; born 7 August 1938) is a Taiwanese politician who is the founder of the New Party. He was finance minister of the Republic of China from 1990 to 1992 and is the chairman of the Chinese Management Association (CMA) (since 199 ...
()'s "煊", Yu Shyi-kun ()'s "堃" and singer David Tao ()'s "喆" missing in
Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character se ...
, ex-PRC Premier Zhu Rongji ()'s "镕" missing in GB 2312, copyright symbol "©" missing in GBK. Newspapers have dealt with this problem in various ways, including using software to combine two existing, similar characters; using a picture of the personality; or simply substituting a homophone for the rare character in the hope that the reader would be able to make the correct inference.


Indic text

A similar effect can occur in Brahmic or Indic scripts of
South Asia South Asia is the southern subregion of Asia, which is defined in both geographical and ethno-cultural terms. The region consists of the countries of Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka.;;;;; ...
, used in such Indo-Aryan or Indic languages as Hindustani (Hindi-Urdu),
Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...
,
Punjabi Punjabi, or Panjabi, most often refers to: * Something of, from, or related to Punjab, a region in India and Pakistan * Punjabi language * Punjabi people * Punjabi dialects and languages Punjabi may also refer to: * Punjabi (horse), a British Th ...
, Marathi, and others, even if the character set employed is properly recognized by the application. This is because, in many Indic scripts, the rules by which individual letter symbols combine to create symbols for syllables may not be properly understood by a computer missing the appropriate software, even if the glyphs for the individual letter forms are available. One example of this is the old Wikipedia logo, which attempts to show the character analogous to "wi" (the first syllable of "Wikipedia") on each of many puzzle pieces. The puzzle piece meant to bear the
Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental writing system), based on the ...
character for "wi" instead used to display the "wa" character followed by an unpaired "i" modifier vowel, easily recognizable as mojibake generated by a computer not configured to display Indic text. The logo as redesigned has fixed these errors. The idea of Plain Text requires the operating system to provide a font to display Unicode codes. This font is different from OS to OS for Singhala and it makes orthographically incorrect glyphs for some letters (syllables) across all operating systems. For instance, the 'reph', the short form for 'r' is a diacritic that normally goes on top of a plain letter. However, it is wrong to go on top of some letters like 'ya' or 'la' in specific contexts. For Sanskritic words or names inherited by modern languages, such as कार्य, IAST: ''kārya'', or आर्या, IAST: ''āryā'', it is apt to put it on top of these letters. By contrast, for similar sounds in modern languages which result from their specific rules, it is not put on top, such as the word करणाऱ्या, IAST: ''karaṇāryā'', a stem form of the common word करणारा/री, IAST: ''karaṇārā/rī'', in the
Marathi language Marathi (; ''Marāṭhī'', ) is an Indo-Aryan languages, Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and additional official language in the state o ...
. But it happens in most operating systems. This appears to be a fault of internal programming of the fonts. In Mac OS and iOS, the muurdhaja l (dark l) and 'u' combination and its long form both yield wrong shapes. Some Indic and Indic-derived scripts, most notably Lao, were not officially supported by
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was release to manufacturing, released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Wind ...
until the release of Vista. However, various sites have made free-to-download fonts.


Burmese

Due to Western sanctions and the late arrival of Burmese language support in computers, much of the early Burmese localization was homegrown without international cooperation. The prevailing means of Burmese support is via the Zawgyi font, a font that was created as a Unicode font but was in fact only partially Unicode compliant. In the Zawgyi font, some
codepoint In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s for Burmese script were implemented as specified in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
, but others were not. The Unicode Consortium refers to this as ''ad hoc font encodings''. With the advent of mobile phones, mobile vendors such as Samsung and Huawei simply replaced the Unicode compliant system fonts with Zawgyi versions. Due to these ''ad hoc'' encodings, communications between users of Zawgyi and Unicode would render as garbled text. To get around this issue, content producers would make posts in both Zawgyi and Unicode. Myanmar government has designated 1 October 2019 as "U-Day" to officially switch to Unicode. The full transition is estimated to take two years.


African languages

In certain
writing systems of Africa The writing systems of Africa refer to the current and historical practice of writing systems on the African continent, both indigenous and those introduced. Today, the Latin script is commonly encountered across Africa, especially in the Western ...
, unencoded text is unreadable. Texts that may produce mojibake include those from the
Horn of Africa The Horn of Africa (HoA), also known as the Somali Peninsula, is a large peninsula and geopolitical region in East Africa.Robert Stock, ''Africa South of the Sahara, Second Edition: A Geographical Interpretation'', (The Guilford Press; 2004 ...
such as the Ge'ez script in
Ethiopia Ethiopia, , om, Itiyoophiyaa, so, Itoobiya, ti, ኢትዮጵያ, Ítiyop'iya, aa, Itiyoppiya officially the Federal Democratic Republic of Ethiopia, is a landlocked country in the Horn of Africa. It shares borders with Eritrea to the ...
and
Eritrea Eritrea ( ; ti, ኤርትራ, Ertra, ; ar, إرتريا, ʾIritriyā), officially the State of Eritrea, is a country in the Horn of Africa region of Eastern Africa, with its capital and largest city at Asmara. It is bordered by Ethiopi ...
, used for
Amharic Amharic ( or ; (Amharic: ), ', ) is an Ethiopian Semitic language, which is a subgrouping within the Semitic branch of the Afroasiatic languages. It is spoken as a first language by the Amharas, and also serves as a lingua franca for all oth ...
, Tigre, and other languages, and the
Somali language Somali (Latin script: ; Wadaad: ; Osmanya: 𐒖𐒍 𐒈𐒝𐒑𐒛𐒐𐒘 ) is an Afroasiatic language belonging to the Cushitic branch. It is spoken as a mother tongue by Somalis in Greater Somalia and the Somali diaspora. Somali is an ...
, which employs the Osmanya alphabet. In
Southern Africa Southern Africa is the southernmost subregion of the African continent, south of the Congo and Tanzania. The physical location is the large part of Africa to the south of the extensive Congo River basin. Southern Africa is home to a number o ...
, the Mwangwego alphabet is used to write languages of
Malawi Malawi (; or aláwi Tumbuka: ''Malaŵi''), officially the Republic of Malawi, is a landlocked country in Southeastern Africa that was formerly known as Nyasaland. It is bordered by Zambia to the west, Tanzania to the north and northe ...
and the Mandombe alphabet was created for the
Democratic Republic of the Congo The Democratic Republic of the Congo (french: République démocratique du Congo (RDC), colloquially "La RDC" ), informally Congo-Kinshasa, DR Congo, the DRC, the DROC, or the Congo, and formerly and also colloquially Zaire, is a country in ...
, but these are not generally supported. Various other writing systems native to
West Africa West Africa or Western Africa is the westernmost region of Africa. The United Nations defines Western Africa as the 16 countries of Benin, Burkina Faso, Cape Verde, The Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Liberia, Mali ...
present similar problems, such as the N'Ko alphabet, used for Manding languages in
Guinea Guinea ( ),, fuf, 𞤘𞤭𞤲𞤫, italic=no, Gine, wo, Gine, nqo, ߖߌ߬ߣߍ߫, bm, Gine officially the Republic of Guinea (french: République de Guinée), is a coastal country in West Africa. It borders the Atlantic Ocean to the we ...
, and the Vai syllabary, used in
Liberia Liberia (), officially the Republic of Liberia, is a country on the West African coast. It is bordered by Sierra Leone to Liberia–Sierra Leone border, its northwest, Guinea to Guinea–Liberia border, its north, Ivory Coast to Ivory Coast� ...
.


Arabic

Another affected language is
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
(see
below Below may refer to: *Earth * Ground (disambiguation) *Soil *Floor * Bottom (disambiguation) *Less than *Temperatures below freezing *Hell or underworld People with the surname *Ernst von Below (1863–1955), German World War I general *Fred Below ...
). The text becomes unreadable when the encodings do not match.


Examples

The examples in this article do not have UTF-8 as browser setting, because UTF-8 is easily recognisable, so if a browser supports UTF-8 it should recognise it automatically, and not try to interpret something else as UTF-8.


See also

*
Code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
* Replacement character * Substitute character *
Newline Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
– The conventions for representing the line break differ between Windows and Unix systems. Though most software supports both conventions (which is trivial), software that must preserve or display the difference (e.g.
version control system In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections ...
s and data comparison tools) can get substantially more difficult to use if not adhering to one convention. *
Byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...
– The most
in-band In telecommunications, in-band signaling is the sending of control information within the same band or channel used for data such as voice or video. This is in contrast to out-of-band signaling which is sent over a different channel, or even ...
way to store the encoding together with the data – prepend it. This is by intention invisible to humans using compliant software, but will by design be perceived as "garbage characters" to incompliant software (including many
interpreters Interpreting is a translational activity in which one produces a first and final target-language output on the basis of a one-time exposure to an expression in a source language. The most common two modes of interpreting are simultaneous interp ...
). *
HTML entities In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...
– An encoding of special characters in HTML, mostly optional, but required for certain characters to
escape Escape or Escaping may refer to: Computing * Escape character, in computing and telecommunication, a character which signifies that what follows takes an alternative interpretation ** Escape sequence, a series of characters used to trigger some s ...
interpretation as markup. While failure to apply this transformation is a vulnerability (see
cross-site scripting Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability m ...
), applying it too many times results in garbling of these characters. For example, the quotation mark " becomes ", ", " and so on. *
Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...


References


External links

* * {{Character encoding Character encoding Computer errors Nonsense