Mojibake Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, oft ...

( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended

character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...

. The result is a systematic replacement of symbols with completely unrelated ones, often from a different

writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form ...

. This display may include the generic

replacement character Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0: *, marks start of annotated text *, marks start ...

("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably

UTF-8 UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...

and

UTF-16 UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...

). Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake. Symptoms of this failed rendering include blocks with the

code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...

displayed in

hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...

or using the generic replacement character. Importantly, these replacements are ''valid'' and are the result of correct error handling by the software.

Causes

To correctly reproduce the original text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved. As mojibake is the instance of non-compliance between these, it can be achieved by manipulating the data itself, or just relabeling it. Mojibake is often seen with text data that have been tagged with a wrong encoding; it may not even be tagged at all, but moved between computers with different default encodings. A major source of trouble are

communication protocol A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any kind of variation of a physical quantity. The protocol defines the rules, syntax, semantics (computer scien ...

s that rely on settings on each computer rather than sending or storing

metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...

together with the data. The differing default settings between computers are in part due to differing deployments of

Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...

among

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...

families, and partly the legacy encodings' specializations for different

s of human languages. Whereas

Linux distribution A Linux distribution (often abbreviated as distro) is an operating system made from a software collection that includes the Linux kernel and, often, a package management system. Linux users usually obtain their operating system by downloading one ...

s mostly switched to

in 2004,

Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...

generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages. For some

s, an example being

Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...

, several encodings have historically been employed, causing users to see mojibake relatively often. As a Japanese example, the word ''mojibake'' "文字化け" stored as

EUC-JP Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...

might be incorrectly displayed as "ﾊｸｻ�ｽ､ｱ", "ﾊｸｻ嵂ｽ､ｱ" ( MS-932), or "ﾊｸｻ郾ｽ､ｱ" (

Shift JIS-2004 Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunct ...

). The same text stored as

is displayed as "譁�蟄怜喧縺�" if interpreted as Shift JIS. This is further exacerbated if other locales are involved: the same UTF-8 text appears as "æ–‡å—åŒ–ã‘" in software that assumes text to be in the

Windows-1252 Windows-1252 or CP-1252 ( code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It ...

ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...

encodings, usually labelled Western, or (for example) as "鏂囧瓧鍖栥亼" if interpreted as being in a GBK (Mainland China) locale.

Underspecification

If the encoding is not specified, it is up to the software to decide it by other means. Depending on the type of software, the typical solution is either configuration or

charset detection Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when sp ...

heuristics. Both are prone to mis-prediction in not-so-uncommon scenarios. The encoding of

text file A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating ...

s is affected by locale setting, which depends on the user's language, brand of

and possibly other conditions. Therefore, the assumed encoding is systematically wrong for files that come from a computer with a different setting, or even from a differently localized software within the same system. For Unicode, one solution is to use a

byte order mark The byte order mark (BOM) is a particular usage of the special Unicode character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or endianness, of t ...

, but for

source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...

and other machine readable text, many parsers don't tolerate this. Another is storing the encoding as metadata in the file system. File systems that support

extended file attributes Extended file attributes are file system features that enable users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or re ...

can store this as user.charset. This also requires support in software that wants to take advantage of it, but does not disturb other software. While a few encodings are easy to detect, in particular UTF-8, there are many that are hard to distinguish (see

). A

web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...

may not be able to distinguish a page coded in

and another in Shift-JIS if the coding scheme is not assigned explicitly using

HTTP headers The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, ...

sent along with the documents, or using the

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

document's

meta tag Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can ...

s that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers; see

character encodings in HTML While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special ch ...

Mis-specification

Mojibake also occurs when the encoding is wrongly specified. This often happens between encodings that are similar. For example, the Eudora email client for

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...

was known to send emails labelled as

that were in reality

. Windows-1252 contains extra printable characters in the C1 range (the most frequently seen being curved

quotation marks Quotation marks (also known as quotes, quote marks, speech marks, inverted commas, or talking marks) are punctuation marks used in pairs in various writing systems to set off direct speech, a quotation, or a phrase. The pair consists of an ...

and extra

dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...

es), that were not displayed properly in software complying with the ISO standard; this especially affected software running under other operating systems such as

Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...

User oversight

Of the encodings still in common use, many originated from taking

ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...

and appending atop it; as a result, these encodings are partially compatible with each other. Examples of this include Windows-1252 and ISO 8859-1. People thus may mistake the expanded encoding set they are using with plain ASCII.

Overspecification

When there are layers of protocols, each trying to specify the encoding based on different information, the least certain information may be misleading to the recipient. For example, consider a

web server A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiate ...

serving a static HTML file over HTTP. The character set may be communicated to the client in any number of 3 ways: * in the HTTP header. This information can be based on server configuration (for instance, when serving a file off disk) or controlled by the application running on the server (for dynamic websites). * in the file, as an HTML meta tag (http-equiv or charset) or the encoding attribute of an

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...

declaration. This is the encoding that the author meant to save the particular file in. * in the file, as a

. This is the encoding that the author's editor actually saved it in. Unless an accidental encoding conversion has happened (by opening it in one encoding and saving it in another), this will be correct. It is, however, only available in

encodings such as UTF-8 or UTF-16.

Lack of hardware or software support

Much older hardware is typically designed to support only one character set and the character set typically cannot be altered. The character table contained within the display firmware will be localized to have characters for the country the device is to be sold in, and typically the table differs from country to country. As such, these systems will potentially display mojibake when loading text generated on a system from a different country. Likewise, many early operating systems do not support multiple encoding formats and thus will end up displaying mojibake if made to display non-standard text—early versions of

and

Palm OS Palm OS (also known as Garnet OS) was a mobile operating system initially developed by Palm, Inc., for personal digital assistants (PDAs) in 1996. Palm OS was designed for ease of use with a touchscreen-based graphical user interface. It is provi ...

for example, are localized on a per-country basis and will only support encoding standards relevant to the country the localized version will be sold in, and will display mojibake if a file containing a text in a different encoding format from the version that the OS is designed to support is opened.

Resolutions

Applications using

as a default encoding may achieve a greater degree of interoperability because of its widespread use and backward compatibility with

US-ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...

. UTF-8 also has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings. The difficulty of resolving an instance of mojibake varies depending on the application within which it occurs and the causes of it. Two of the most common applications in which mojibake may occur are

s and

word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Word processor (electronic device), Early word processors were stand-alone devices ded ...

s. Modern browsers and word processors often support a wide array of character encodings. Browsers often allow a user to change their rendering engine's encoding setting on the fly, while word processors allow the user to select the appropriate encoding when opening a file. It may take some

trial and error Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying. According to W.H. Thorpe, the term was devised by C. Lloyd Morgan (18 ...

for users to find the correct encoding. The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game. In this case, the user must change the operating system's encoding settings to match that of the game. However, changing the system-wide encoding settings can also cause Mojibake in pre-existing applications. In

Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...

or later, a user also has the option to use Microsoft AppLocale, an application that allows the changing of per-application locale settings. Even so, changing the operating system encoding settings is not possible on earlier operating systems such as

Windows 98 Windows 98 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of Microsoft Windows operating systems. The second operating system in the 9x line, it is the successor to Windows 95, and was released to ...

; to resolve this issue on earlier operating systems, a user would have to use third party font rendering applications.

Problems in different writing systems

English

Mojibake in English texts generally occurs in punctuation, such as

em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...

es (—), en dashes (–), and curly quotes (“,”,‘,’), but rarely in character text, since most encodings agree with

on the encoding of the

English alphabet The alphabet for Modern English is a Latin-script alphabet consisting of 26 letters, each having an upper- and lower-case form. The word ''alphabet'' is a compound of the first two letters of the Greek alphabet, '' alpha'' and '' beta''. ...

. For example, the

pound sign The pound sign is the symbol for the pound unit of sterling – the currency of the United Kingdom and previously of Great Britain and of the Kingdom of England. The same symbol is used for other currencies called pound, such as the Gibralta ...

"£" will appear as "Â£" if it was encoded by the sender as

but interpreted by the recipient as

CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...

or ISO 8859-1. If iterated using CP1252, this can lead to "Ã‚Â£", "Ãƒâ€šÃ‚Â£", "ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â£", etc. Some computers did, in older eras, have vendor-specific encodings which caused mismatch also for English text.

Commodore Commodore may refer to: Ranks * Commodore (rank), a naval rank ** Commodore (Royal Navy), in the United Kingdom ** Commodore (United States) ** Commodore (Canada) ** Commodore (Finland) ** Commodore (Germany) or ''Kommodore'' * Air commodore ...

brand

8-bit In computer architecture, 8-bit Integer (computer science), integers or other Data (computing), data units are those that are 8 bits wide (1 octet (computing), octet). Also, 8-bit central processing unit (CPU) and arithmetic logic unit (ALU) arc ...

computers used

PETSCII PETSCII (''PET Standard Code of Information Interchange''), also known as CBM ASCII, is the character set used in Commodore Business Machines (CBM)'s 8-bit home computers, starting with the PET from 1977 and including the C16, C64, C116, C1 ...

encoding, particularly notable for inverting the upper and lower case compared to standard

. PETSCII printers worked fine on other computers of the era, but flipped the case of all letters. IBM mainframes use the

EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six- ...

encoding which does not match ASCII at all.

Other Western European languages

The alphabets of the

North Germanic languages The North Germanic languages make up one of the three branches of the Germanic languages—a sub-family of the Indo-European languages—along with the West Germanic languages and the extinct East Germanic languages. The language group is also r ...

Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...

Finnish Finnish may refer to: * Something or someone from, or related to Finland * Culture of Finland * Finnish people or Finns, the primary ethnic group in Finland * Finnish language, the national language of the Finnish people * Finnish cuisine See also ...

German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...

, French,

Portuguese Portuguese may refer to: * anything of, from, or related to the country and nation of Portugal ** Portuguese cuisine, traditional foods ** Portuguese language, a Romance language *** Portuguese dialects, variants of the Portuguese language ** Portu ...

and

Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...

are all extensions of the

Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the o ...

. The additional characters are typically the ones that become corrupted, making texts only mildly unreadable with mojibake: * å, ä, ö in

and

Swedish Swedish or ' may refer to: Anything from or related to Sweden, a country in Northern Europe. Or, specifically: * Swedish language, a North Germanic language spoken primarily in Sweden and Finland ** Swedish alphabet, the official alphabet used by ...

*à, ç, è, é, ï, í, ò, ó, ú, ü in

* æ, ø, å in

Norwegian Norwegian, Norwayan, or Norsk may refer to: *Something of, from, or related to Norway, a country in northwestern Europe * Norwegians, both a nation and an ethnic group native to Norway * Demographics of Norway *The Norwegian language, including ...

and

Danish Danish may refer to: * Something of, from, or related to the country of Denmark People * A national or citizen of Denmark, also called a "Dane," see Demographics of Denmark * Culture of Denmark * Danish people or Danes, people with a Danish a ...

*á, é, ó, ĳ, è, ë, ï in

Dutch Dutch commonly refers to: * Something of, from, or related to the Netherlands * Dutch people () * Dutch language () Dutch may also refer to: Places * Dutch, West Virginia, a community in the United States * Pennsylvania Dutch Country People E ...

*ä, ö, ü, and ß in

*á, ð, í, ó, ú, ý, æ, ø in Faroese *á, ð, é, í, ó, ú, ý, þ, æ, ö in Icelandic *à, â, ç, è, é, ë, ê, ï, î, ô, ù, û, ü, ÿ, æ, œ in French *à, è, é, ì, ò, ù in

Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance language *** Regional Ita ...

*á, é, í, ñ, ó, ú, ü, ¡, ¿ in

*à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in

( ü no longer used) *á, é, í, ó, ú in

Irish Irish may refer to: Common meanings * Someone or something of, from, or related to: ** Ireland, an island situated off the north-western coast of continental Europe ***Éire, Irish language name for the isle ** Northern Ireland, a constituent unit ...

*à, è, ì, ò, ù in

Scottish Gaelic Scottish Gaelic ( gd, Gàidhlig ), also known as Scots Gaelic and Gaelic, is a Goidelic language (in the Celtic branch of the Indo-European language family) native to the Gaels of Scotland. As a Goidelic language, Scottish Gaelic, as well as ...

* £ in

British English British English (BrE, en-GB, or BE) is, according to Lexico, Oxford Dictionaries, "English language, English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in ...

… and their uppercase counterparts, if applicable. These are languages for which the

character set (also known as ''Latin 1'' or ''Western'') has been in use. However, ISO-8859-1 has been obsoleted by two competing standards, the backward compatible

, and the slightly altered

ISO-8859-15 ISO/IEC 8859-15:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. ...

. Both add the

Euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists ...

€ and the French œ, but otherwise any confusion of these three character sets does not create mojibake in these languages. Furthermore, it is always safe to interpret ISO-8859-1 as Windows-1252, and fairly safe to interpret it as ISO-8859-15, in particular with respect to the Euro sign, which replaces the rarely used currency sign (¤). However, with the advent of

, mojibake has become more common in certain scenarios, e.g. exchange of text files between

UNIX Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...

and

computers, due to UTF-8's incompatibility with Latin-1 and Windows-1252. But UTF-8 has the ability to be directly recognised by a simple algorithm, so that well written software should be able to avoid mixing UTF-8 up with other encodings, so this was most common when many had software not supporting UTF-8. Most of these languages were supported by MS-DOS default CP437 and other machine default encodings, except ASCII, so problems when buying an operating system version were less common. Windows and MS-DOS are not compatible however. In Swedish, Norwegian, Danish and German, vowels are rarely repeated, and it is usually obvious when one character gets corrupted, e.g. the second letter in "kÃ⁠¤rlek" (', "love"). This way, even though the reader has to guess between å, ä and ö, almost all texts remain legible. Finnish text, on the other hand, does feature repeating vowels in words like ' ("wedding night") which can sometimes render text very hard to read (e.g. ' appears as "hÃ⁠¤Ã⁠¤yÃ⁠¶"). Icelandic and Faroese have ten and eight possibly confounding characters, respectively, which thus can make it more difficult to guess corrupted characters; Icelandic words like ' ("outstanding hospitality") become almost entirely unintelligible when rendered as "Ã¾jÃ³Ã°lÃ¶Ã°". In German, ' ("letter salad") is a common term for this phenomenon, and in Spanish, ' (literally deformation). Some users transliterate their writing when using a computer, either by omitting the problematic diacritics, or by using digraph replacements (å → aa, ä/æ → ae, ö/ø → oe, ü → ue etc.). Thus, an author might write "ueber" instead of "über", which is standard practice in German when umlauts are not available. The latter practice seems to be better tolerated in the German language sphere than in the

Nordic countries The Nordic countries (also known as the Nordics or ''Norden''; literal translation, lit. 'the North') are a geographical and cultural region in Northern Europe and the Atlantic Ocean, North Atlantic. It includes the sovereign states of Denmar ...

. For example, in Norwegian, digraphs are associated with archaic Danish, and may be used jokingly. However, digraphs are useful in communication with other parts of the world. As an example, the Norwegian football player

Ole Gunnar Solskjær Ole Gunnar Solskjær (; born 26 February 1973) is a Norwegian professional football manager and former player who played as a forward who last managed Premier League club Manchester United. As a player, Solskjær spent the majority of his car ...

had his name spelled "SOLSKJAER" on his back when he played for

Manchester United Manchester () is a city in Greater Manchester, England. It had a population of 552,000 in 2021. It is bordered by the Cheshire Plain to the south, the Pennines to the north and east, and the neighbouring city of City of Salford, Salford to ...

. An artifact of

misinterpreted as

, "Ring meg nÃ¥" (""), was seen in an SMS scam raging in Norway in June 2014.

Central and Eastern European

Users of

Central Central is an adjective usually referring to being in the center of some place or (mathematical) object. Central may also refer to: Directions and generalised locations * Central Africa, a region in the centre of Africa continent, also known as ...

and

Eastern Europe Eastern Europe is a subregion of the Europe, European continent. As a largely ambiguous term, it has a wide range of geopolitical, geographical, ethnic, cultural, and socio-economic connotations. The vast majority of the region is covered by Russ ...

an languages can also be affected. Because most computers were not connected to any network during the mid- to late-1980s, there were different character encodings for every language with

diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...

al characters (see

ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...

and

KOI-8 KOI-8 (КОИ-8) is an 8-bit character set standardized in GOST 19768-74. Маркелова Л. Н. Эксплуатация программоуправляемой вычислительной машины «Искра 226». — М.: Ма ...

), often also varying by operating system.

Hungarian

Hungarian is another affected language, which uses the 26 basic English characters, plus the accented forms á, é, í, ó, ú, ö, ü (all present in the Latin-1 character set), plus the two characters ő and ű, which are not in Latin-1. These two characters can be correctly encoded in Latin-2, Windows-1250 and Unicode. Before Unicode became common in e-mail clients, e-mails containing Hungarian text often had the letters ő and ű corrupted, sometimes to the point of unrecognizability. It is common to respond to an e-mail rendered unreadable (see examples below) by character mangling (referred to as "betűszemét", meaning "letter garbage") with the phrase "Árvíztűrő tükörfúrógép", a nonsense phrase (literally "Flood-resistant mirror-drilling machine") containing all accented characters used in Hungarian.

=Examples

Polish

Prior to the creation of

ISO 8859-2 ISO/IEC 8859-2:1999, ''Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ...

in 1987, users of various computing platforms used their own

character encodings Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...

such as AmigaPL on Amiga, Atari Club on Atari ST and Masovia, IBM CP852,

Mazovia Mazovia or Masovia ( pl, Mazowsze) is a historical region in mid-north-eastern Poland. It spans the North European Plain, roughly between Łódź and Białystok, with Warsaw being the unofficial capital and largest city. Throughout the centurie ...

and Windows CP1250 on IBM PCs. Polish companies selling early

DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...

computers created their own mutually-incompatible ways to encode Polish characters and simply reprogrammed the

EPROM An EPROM (rarely EROM), or erasable programmable read-only memory, is a type of programmable read-only memory (PROM) chip that retains its data when its power supply is switched off. Computer memory that can retrieve stored data after a power s ...

s of the video cards (typically CGA, EGA, or

Hercules Hercules (, ) is the Roman equivalent of the Greek divine hero Heracles, son of Jupiter and the mortal Alcmena. In classical mythology, Hercules is famous for his strength and for his numerous far-ranging adventures. The Romans adapted the Gr ...

) to provide

hardware code page In computing, a hardware code page (HWCP) refers to a code page supported natively by a hardware device such as a display adapter or printer. The glyphs to present the characters are stored in the alphanumeric character generator's resident re ...

s with the needed glyphs for Polish—arbitrarily located without reference to where other computer sellers had placed them. The situation began to improve when, after pressure from academic and user groups,

succeeded as the "Internet standard" with limited support of the dominant vendors' software (today largely replaced by Unicode). With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as (, lit. "little shrubs").

Russian and other Cyrillic alphabets

Mojibake may be colloquially called ( ) in

Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...

, which was and remains complicated by several systems for encoding

Cyrillic , bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця , fam1 = Egyptian hieroglyphs , fam2 = Proto-Sinaitic , fam3 = Phoenician , fam4 = G ...

. The

Soviet Union The Soviet Union,. officially the Union of Soviet Socialist Republics. (USSR),. was a transcontinental country that spanned much of Eurasia from 1922 to 1991. A flagship communist state, it was nominally a federal union of fifteen national ...

and early

Russian Federation Russia (, , ), or the Russian Federation, is a List of transcontinental countries, transcontinental country spanning Eastern Europe and North Asia, Northern Asia. It is the List of countries and dependencies by area, largest country in the ...

developed KOI encodings (, , which translates to "Code for Information Exchange"). This began with Cyrillic-only 7-bit KOI7, based on

but with Latin and some other characters replaced with Cyrillic letters. Then came 8-bit

KOI8 KOI-8 (КОИ-8) is an 8-bit character set standardized in GOST 19768-74. Маркелова Л. Н. Эксплуатация программоуправляемой вычислительной машины «Искра 226». — М.: Ма ...

encoding that is an

ASCII extension Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...

which encodes Cyrillic letters only with high-bit set octets corresponding to 7-bit codes from KOI7. It is for this reason that KOI8 text, even Russian, remains partially readable after stripping the eighth bit, which was considered as a major advantage in the age of

8BITMIME The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typical ...

-unaware email systems. For example, words "" , encoded in KOI8 and then passed through the high bit stripping process, end up rendered as " OLA RUSSKOGO qZYKA". Eventually KOI8 gained different flavors for Russian and Bulgarian (KOI8-R), Ukrainian (KOI8-U), Belarusian alphabet, Belarusian (KOI8-RU) and even Tajik Cyrillic alphabet, Tajik (KOI8-T). Meanwhile, in the West, Code page 866 supported Ukrainian language, Ukrainian and Belarusian language, Belarusian as well as Russian/ Bulgarian language, Bulgarian in

MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few ope ...

. For

Code Page 1251 Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages. On the web, it is the second most-used si ...

added support for Serbian and other Slavic variants of Cyrillic. Most recently, the

encoding includes

s for practically all the characters of all the world's languages, including all Cyrillic characters. Before Unicode, it was necessary to match text encoding with a font using the same encoding system. Failure to do this produced unreadable

gibberish Gibberish, also called jibber-jabber or gobbledygook, is speech that is (or appears to be) nonsense. It may include speech sounds that are not actual words, pseudowords, or language games and specialized jargon that seems nonsensical to outsider ...

whose specific appearance varied depending on the exact combination of text encoding and font encoding. For example, attempting to view non-Unicode Cyrillic text using a font that is limited to the Latin alphabet, or using the default ("Western") encoding, typically results in text that consists almost entirely of vowels with diacritical marks. (KOI8 "" (, library) becomes "âÉÂÌÉÏÔÅËÁ".) Using Windows codepage 1251 to view text in KOI8 or vice versa results in garbled text that consists mostly of capital letters (KOI8 and codepage 1251 share the same ASCII region, but KOI8 has uppercase letters in the region where codepage 1251 has lowercase, and vice versa). In general, Cyrillic gibberish is symptomatic of using the wrong Cyrillic font. During the early years of the Russian sector of the World Wide Web, both KOI8 and codepage 1251 were common. As of 2017, one can still encounter HTML pages in codepage 1251 and, rarely, KOI8 encodings, as well as Unicode. (An estimated 1.7% of all web pages worldwide – all languages included – are encoded in codepage 1251.) Though the HTML standard includes the ability to specify the encoding for any given web page in its source, this is sometimes neglected, forcing the user to switch encodings in the browser manually. In Bulgarian language, Bulgarian, mojibake is often called (), meaning "monkey's lphabet. In Serbian, it is called (), meaning "

trash Trash may refer to: Garbage * Garbage, unwanted or undesired waste material ** Litter, material discarded in inappropriate places ** Municipal solid waste, unwanted or undesired waste material generated in a municipal environment Arts, enter ...

". Unlike the former USSR, South Slavs never used something like KOI8, and Code Page 1251 was the dominant Cyrillic encoding there before Unicode. Therefore, these languages experienced fewer encoding incompatibility troubles than Russian. In the 1980s, Bulgarian computers used their own MIK encoding, which is superficially similar to (although incompatible with) CP866.

Yugoslav languages

Croatian, Bosnian, Serbian (the seceding varieties of

Serbo-Croatian Serbo-Croatian () – also called Serbo-Croat (), Serbo-Croat-Bosnian (SCB), Bosnian-Croatian-Serbian (BCS), and Bosnian-Croatian-Montenegrin-Serbian (BCMS) – is a South Slavic language and the primary language of Serbia, Croatia, Bosnia and ...

language) and

Slovenian Slovene or Slovenian may refer to: * Something of, from, or related to Slovenia, a country in Central Europe * Slovene language, a South Slavic language mainly spoken in Slovenia * Slovenes The Slovenes, also known as Slovenians ( sl, Sloven ...

add to the basic Latin alphabet the letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć, Ž (only č/Č, š/Š and ž/Ž in Slovenian; officially, although others are used when needed, mostly in foreign names, as well). All of these letters are defined in Latin-2 and

Windows-1250 Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Czech (which is its main user with half its use, though Czech has 96.6% use of UTF-8, an ...

, while only some (š, Š, ž, Ž, Đ) exist in the usual OS-default

, and are there because of some other languages. Although Mojibake can occur with any of these characters, the letters that are not included in Windows-1252 are much more prone to errors. Thus, even nowadays, "šđčćž ŠĐČĆŽ" is often displayed as "šðèæž ŠÐÈÆŽ", although ð, è, æ, È, Æ are never used in Slavic languages. When confined to basic ASCII (most user names, for example), common replacements are: š→s, đ→dj, č→c, ć→c, ž→z (capital forms analogously, with Đ→Dj or Đ→DJ depending on word case). All of these replacements introduce ambiguities, so reconstructing the original from such a form is usually done manually if required. The

encoding is important because the English versions of the Windows operating system are most widespread, not localized ones. The reasons for this include a relatively small and fragmented market, increasing the price of high quality localization, a high degree of software piracy (in turn caused by high price of software compared to income), which discourages localization efforts, and people preferring English versions of Windows and other software. The drive to differentiate Croatian from Serbian, Bosnian from Croatian and Serbian, and now even Montenegrin from the other three creates many problems. There are many different localizations, using different standards and of different quality. There are no common translations for the vast amount of computer terminology originating in English. In the end, people use adopted English words ("kompjuter" for "computer", "kompajlirati" for "compile," etc.), and if they are unaccustomed to the translated terms may not understand what some option in a menu is supposed to do based on the translated phrase. Therefore, people who understand English, as well as those who are accustomed to English terminology (who are most, because English terminology is also mostly taught in schools because of these problems) regularly choose the original English versions of non-specialist software. When Cyrillic script is used (for Macedonian and partially Serbian), the problem is similar to other Cyrillic-based scripts. Newer versions of English Windows allow the code page to be changed (older versions require special English versions with this support), but this setting can be and often was incorrectly set. For example, Windows 98 and Windows Me can be set to most non-right-to-left single-byte code pages including 1250, but only at install time.

Caucasian languages

The writing systems of certain

languages of the Caucasus The Caucasian languages comprise a large and extremely varied array of languages spoken by more than ten million people in and around the Caucasus Mountains, which lie between the Black Sea and the Caspian Sea. Linguistic comparison allows th ...

region, including the scripts of

Georgian Georgian may refer to: Common meanings * Anything related to, or originating from Georgia (country) ** Georgians, an indigenous Caucasian ethnic group ** Georgian language, a Kartvelian language spoken by Georgians **Georgian scripts, three scrip ...

and

Armenian Armenian may refer to: * Something of, from, or related to Armenia, a country in the South Caucasus region of Eurasia * Armenians, the national people of Armenia, or people of Armenian descent ** Armenian Diaspora, Armenian communities across the ...

, may produce mojibake. This problem is particularly acute in the case of

ArmSCII ArmSCII or ARMSCII is a set of obsolete single-byte character encodings for the Armenian alphabet defined by Armenian national standard 166–9. ArmSCII is an acronym for Armenian Standard Code for Information Interchange, similar to ASCII for th ...

or ARMSCII, a set of obsolete character encodings for the Armenian alphabet which have been superseded by Unicode standards. ArmSCII is not widely used because of a lack of support in the computer industry. For example,

does not support it.

Asian encodings

Another type of mojibake occurs when text is erroneously parsed in a multi-byte encoding, such as one of the encodings for

East Asian languages The East Asian languages are a language family (alternatively ''macrofamily'' or ''superphylum'') proposed by Stanley Starosta in 2001. The proposal has since been adopted by George van Driem. Classifications Early proposals Early proposals of s ...

. With this kind of mojibake more than one (typically two) characters are corrupted at once, e.g. "k舐lek" () in Swedish, where "" is parsed as "舐". Compared to the above mojibake, this is harder to read, since letters unrelated to the problematic å, ä or ö are missing, and is especially problematic for short words starting with å, ä or ö such as "än" (which becomes "舅"). Since two letters are combined, the mojibake also seems more random (over 50 variants compared to the normal three, not counting the rarer capitals). In some rare cases, an entire text string which happens to include a pattern of particular word lengths, such as the sentence "

Bush hid the facts Bush hid the facts is a common name for a bug present in some versions of Microsoft Windows, which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without q ...

", may be misinterpreted.

Vietnamese

Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overse ...

, the phenomenon is called or . It can occur when a computer tries to encode diacritic character defined in

Windows-1258 Windows-1258 is a code page used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is compatible with neither the Vietnamese standard ( TCVN 5712 / VSCII), nor the various other encodin ...

, TCVN3 or VNI to UTF-8. ''Chữ ma'' was common in Vietnam when using Windows XP computers or cheap mobile phones.

Japanese

, the same phenomenon is, as mentioned, called . It is a particular problem in Japan due to the numerous different encodings that exist for Japanese text. Alongside Unicode encodings like UTF-8 and UTF-16, there are other standard encodings, such as Shift-JIS (Windows machines) and

(UNIX systems). Mojibake, as well as being encountered by Japanese users, is also often encountered by non-Japanese when attempting to run software written for the Japanese market.

Chinese

Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...

, the same phenomenon is called ''Luàn mǎ'' (

Pinyin Hanyu Pinyin (), often shortened to just pinyin, is the official romanization system for Standard Mandarin Chinese in China, and to some extent, in Singapore and Malaysia. It is often used to teach Mandarin, normally written in Chinese for ...

Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions, ...

Traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays or ...

, meaning 'chaotic code'), and can occur when computerised text is encoded in one

Chinese character encoding In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character enc ...

but is displayed using the wrong encoding. When this occurs, it is often possible to fix the issue by switching the character encoding without loss of data. The situation is complicated because of the existence of several Chinese character encoding systems in use, the most common ones being:

Big5 Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters. The People's Republic of China (PRC), which uses simplified Chinese characters, uses the GB 18030 character set inst ...

, and

Guobiao The National Standards of the People's Republic of China (), coded as , are the standards issued by the Standardization Administration of China under the authorization of Article 10 of the Standardization Law of the People's Republic of China. ...

(with several backward compatible versions), and the possibility of Chinese characters being encoded using Japanese encoding. It is easy to identify the original encoding when ''luanma'' occurs in Guobiao encodings: An additional problem is caused when encodings are missing characters, which is common with rare or antiquated characters that are still used in personal or place names. Examples of this are

Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia, at the junction of the East and South China Seas in the northwestern Pacific Ocean, with the People's Republic of China (PRC) to the northwest, Japan to the nort ...

ese politicians

Wang Chien-shien Wang Chien-shien (; born 7 August 1938) is a Taiwanese politician who is the founder of the New Party. He was finance minister of the Republic of China from 1990 to 1992 and is the chairman of the Chinese Management Association (CMA) (since 19 ...

()'s "煊",

Yu Shyi-kun You Si-kun (; born 25 April 1948), also romanized Yu Shyi-kun, is a Taiwanese politician serving as a member and the president of the Legislative Yuan. He was one of the founding members of the Democratic Progressive Party (DPP), and is know ...

()'s "堃" and singer

David Tao David Tao (), born Tao Xuzhong () (born 11 July 1969), is a Taiwanese Golden Melody Award-winning singer-songwriter. He is well known for creating a crossover genre of R&B and hard rock tunes which has now become his signature style and for hav ...

()'s "喆" missing in

, ex-PRC Premier

Zhu Rongji Zhu Rongji (; IPA: ; born 23 October 1928) is a retired Chinese politician who served as Premier of the People's Republic of China from 1998 to 2003 and CCP Politburo Standing Committee member from 1992 to 2002 along with the Chinese Communist ...

()'s "镕" missing in

GB 2312 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. ''GB'' refers to the Guobiao standards (国家标准 ...

, copyright symbol "©" missing in GBK. Newspapers have dealt with this problem in various ways, including using software to combine two existing, similar characters; using a picture of the personality; or simply substituting a homophone for the rare character in the hope that the reader would be able to make the correct inference.

Indic text

A similar effect can occur in Brahmic or Indic scripts of

South Asia South Asia is the southern subregion of Asia, which is defined in both geographical and ethno-cultural terms. The region consists of the countries of Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka.;;;;;;;; ...

, used in such Indo-Aryan or Indic languages as Hindustani (Hindi-Urdu),

Bengali Bengali or Bengalee, or Bengalese may refer to: *something of, from, or related to Bengal, a large region in South Asia * Bengalis, an ethnic and linguistic group of the region * Bengali language, the language they speak ** Bengali alphabet, the w ...

, Punjabi,

Marathi Marathi may refer to: *Marathi people, an Indo-Aryan ethnolinguistic group of Maharashtra, India *Marathi language, the Indo-Aryan language spoken by the Marathi people *Palaiosouda, also known as Marathi, a small island in Greece See also * * ...

, and others, even if the character set employed is properly recognized by the application. This is because, in many Indic scripts, the rules by which individual letter symbols combine to create symbols for syllables may not be properly understood by a computer missing the appropriate software, even if the glyphs for the individual letter forms are available. One example of this is the old

Wikipedia logo The logo of Wikipedia, a free online encyclopedia, is an unfinished globe constructed from jigsaw pieces—some pieces are missing at the top—each inscribed with a glyph from a different writing system. As displayed on the web pages of the E ...

, which attempts to show the character analogous to "wi" (the first syllable of "Wikipedia") on each of many puzzle pieces. The puzzle piece meant to bear the

Devanagari Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...

character for "wi" instead used to display the "wa" character followed by an unpaired "i"

modifier Modifier may refer to: * Grammatical modifier, a word that modifies the meaning of another word or limits its meaning ** Compound modifier, two or more words that modify a noun ** Dangling modifier, a word or phrase that modifies a clause in an am ...

vowel, easily recognizable as mojibake generated by a computer not configured to display Indic text. The logo as redesigned has fixed these errors. The idea of Plain Text requires the operating system to provide a font to display Unicode codes. This font is different from OS to OS for Singhala and it makes orthographically incorrect glyphs for some letters (syllables) across all operating systems. For instance, the 'reph', the short form for 'r' is a diacritic that normally goes on top of a plain letter. However, it is wrong to go on top of some letters like 'ya' or 'la' in specific contexts. For Sanskritic words or names inherited by modern languages, such as कार्य, IAST: ''kārya'', or आर्या, IAST: ''āryā'', it is apt to put it on top of these letters. By contrast, for similar sounds in modern languages which result from their specific rules, it is not put on top, such as the word करणाऱ्या, IAST: ''karaṇāryā'', a stem form of the common word करणारा/री, IAST: ''karaṇārā/rī'', in the

Marathi language Marathi (; ''Marāṭhī'', ) is an Indo-Aryan languages, Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and additional official language in the state o ...

. But it happens in most operating systems. This appears to be a fault of internal programming of the fonts. In Mac OS and iOS, the muurdhaja l (dark l) and 'u' combination and its long form both yield wrong shapes. Some Indic and Indic-derived scripts, most notably Lao, were not officially supported by

until the release of

Vista Vista usually refers to a distant view. Vista may also refer to: Software *Windows Vista, the line of Microsoft Windows client operating systems released in 2006 and 2007 * VistA, (Veterans Health Information Systems and Technology Architecture) ...

. However, various sites have made free-to-download fonts.

Burmese

Due to Western sanctions and the late arrival of Burmese language support in computers, much of the early Burmese localization was homegrown without international cooperation. The prevailing means of Burmese support is via the

Zawgyi font Zawgyi font is a predominant typeface used for Burmese language text on websites. It is also known as Zawgyi-One or zawgyi1 font although updated versions of this font were not named Zawgyi-two. Prior to 2019, it was the most popular font on Bur ...

, a font that was created as a

Unicode font A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even on ...

but was in fact only partially Unicode compliant. In the Zawgyi font, some

codepoint In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...

s for Burmese script were implemented as specified in

, but others were not. The Unicode Consortium refers to this as ''ad hoc font encodings''. With the advent of mobile phones, mobile vendors such as Samsung and Huawei simply replaced the Unicode compliant system fonts with Zawgyi versions. Due to these ''ad hoc'' encodings, communications between users of Zawgyi and Unicode would render as garbled text. To get around this issue, content producers would make posts in both Zawgyi and Unicode. Myanmar government has designated 1 October 2019 as "U-Day" to officially switch to Unicode. The full transition is estimated to take two years.

African languages

In certain

writing systems of Africa The writing systems of Africa refer to the current and historical practice of writing systems on the African continent, both indigenous and those introduced. Today, the Latin script is commonly encountered across Africa, especially in the Western ...

, unencoded text is unreadable. Texts that may produce mojibake include those from the

Horn of Africa The Horn of Africa (HoA), also known as the Somali Peninsula, is a large peninsula and geopolitical region in East Africa.Robert Stock, ''Africa South of the Sahara, Second Edition: A Geographical Interpretation'', (The Guilford Press; 2004), ...

such as the Ge'ez script in

Ethiopia Ethiopia, , om, Itiyoophiyaa, so, Itoobiya, ti, ኢትዮጵያ, Ítiyop'iya, aa, Itiyoppiya officially the Federal Democratic Republic of Ethiopia, is a landlocked country in the Horn of Africa. It shares borders with Eritrea to the ...

and

Eritrea Eritrea ( ; ti, ኤርትራ, Ertra, ; ar, إرتريا, ʾIritriyā), officially the State of Eritrea, is a country in the Horn of Africa region of Eastern Africa, with its capital and largest city at Asmara. It is bordered by Ethiopia ...

, used for

Amharic Amharic ( or ; (Amharic: ), ', ) is an Ethiopian Semitic language, which is a subgrouping within the Semitic branch of the Afroasiatic languages. It is spoken as a first language by the Amharas, and also serves as a lingua franca for all oth ...

, Tigre, and other languages, and the

Somali language Somali (Latin script: ; Wadaad writing, Wadaad: ; Osmanya: 𐒖𐒍 𐒈𐒝𐒑𐒛𐒐𐒘 ) is an Afroasiatic languages, Afroasiatic language belonging to the Cushitic languages, Cushitic branch. It is spoken as a mother tongue by Somalis in ...

, which employs the

Osmanya alphabet The Osmanya script ( so, Farta Cismaanya 𐒍𐒖𐒇𐒂𐒖 𐒋𐒘𐒈𐒑𐒛𐒒𐒕𐒖), also known as Far Soomaali (𐒍𐒖𐒇 𐒘𐒝𐒈𐒑𐒛𐒘, "Somali writing") and, in Arabic, as ''al-kitābah al-ʿuthmānīyah'' (الكتا ...

. In

Southern Africa Southern Africa is the southernmost subregion of the African continent, south of the Congo and Tanzania. The physical location is the large part of Africa to the south of the extensive Congo River basin. Southern Africa is home to a number of ...

, the

Mwangwego alphabet The Mwangwego script is an abugida writing system developed for Malawian languages and other African Bantu languages by linguist Nolence Mwangwego in 1977. It is one of several indigenous scripts invented for local language communities in Africa. ...

is used to write languages of

Malawi Malawi (; or aláwi Tumbuka: ''Malaŵi''), officially the Republic of Malawi, is a landlocked country in Southeastern Africa that was formerly known as Nyasaland. It is bordered by Zambia to the west, Tanzania to the north and northeast ...

and the Mandombe alphabet was created for the

Democratic Republic of the Congo The Democratic Republic of the Congo (french: République démocratique du Congo (RDC), colloquially "La RDC" ), informally Congo-Kinshasa, DR Congo, the DRC, the DROC, or the Congo, and formerly and also colloquially Zaire, is a country in ...

, but these are not generally supported. Various other writing systems native to

West Africa West Africa or Western Africa is the westernmost region of Africa. The United Nations defines Western Africa as the 16 countries of Benin, Burkina Faso, Cape Verde, The Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Liberia, Mali, Maurit ...

present similar problems, such as the

N'Ko alphabet N'Ko () is a script devised by Solomana Kante in 1949, as a modern writing system for the Mandé languages of West Africa. The term ''N'Ko'', which means ''I say'' in all Mandé languages, is also used for the Mandé literary standard written ...

, used for

Manding languages The Manding languages (sometimes spelt Manden) are a dialect continuum within the Mande language family spoken in West Africa. Varieties of Manding are generally considered (among native speakers) to be mutually intelligible – dependent on exp ...

Guinea Guinea ( ),, fuf, 𞤘𞤭𞤲𞤫, italic=no, Gine, wo, Gine, nqo, ߖߌ߬ߣߍ߫, bm, Gine officially the Republic of Guinea (french: République de Guinée), is a coastal country in West Africa. It borders the Atlantic Ocean to the we ...

, and the

Vai syllabary The Vai syllabary is a syllabic writing system devised for the Vai language by Momolu Duwalu Bukele of Jondu, in what is now Grand Cape Mount County, Liberia. Bukele is regarded within the Vai community, as well as by most scholars, as the s ...

, used in

Liberia Liberia (), officially the Republic of Liberia, is a country on the West African coast. It is bordered by Sierra Leone to Liberia–Sierra Leone border, its northwest, Guinea to its north, Ivory Coast to its east, and the Atlantic Ocean ...

Arabic

Another affected language is

Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...

(see below). The text becomes unreadable when the encodings do not match.

Examples

The examples in this article do not have UTF-8 as browser setting, because UTF-8 is easily recognisable, so if a browser supports UTF-8 it should recognise it automatically, and not try to interpret something else as UTF-8.

References

External links

* * {{Character encoding Character encoding Computer errors Nonsense