HOME

TheInfoList



OR:

In
orthography An orthography is a set of conventions for writing a language, including norms of spelling, hyphenation, capitalization, word breaks, emphasis, and punctuation. Most transnational languages in the modern period have a writing system, and mos ...
and
typography Typography is the art and technique of arranging type to make written language legible, readable and appealing when displayed. The arrangement of type involves selecting typefaces, point sizes, line lengths, line-spacing ( leading), and ...
, a homoglyph is one of two or more
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s,
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
, or
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
s with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties. Synoglyphs are glyphs that look different but mean the same thing. Synoglyphs are also known informally as ''display variants''. The term
homograph A homograph (from the el, ὁμός, ''homós'', "same" and γράφω, ''gráphō'', "write") is a word that shares the same written form as another word but has a different meaning. However, some dictionaries insist that the words must also ...
is sometimes used
synonym A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are all ...
ously with homoglyph, but in the usual linguistic sense, homographs are
word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
s that are spelled the same but have different meanings, a property of words, not characters. In 2008, the
Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intenti ...
published its Technical Report #36 on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts. An example of homoglyphic confusion in a historical regard results from the use of a 'y' to represent a 'þ' when setting older English texts in typefaces that do not contain the latter character. It has led in modern times to such phenomena as '' Ye olde shoppe'', implying incorrectly that the word ''the'' was formerly written ''ye'' . For further discussion, see
thorn Thorn(s) or The Thorn(s) may refer to: Botany * Thorns, spines, and prickles, sharp structures on plants * ''Crataegus monogyna'', or common hawthorn, a plant species Comics and literature * Rose and Thorn, the two personalities of two DC Com ...
. Examples of homoglyphic symbols are (a) the diaeresis and umlaut (both a pair of dots, but with different meaning, although
encoded In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
with the same
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s); and (b) the
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes (figure d ...
and
minus sign The plus and minus signs, and , are mathematical symbols used to represent the notions of positive and negative, respectively. In addition, represents the operation of addition, which results in a sum, while represents subtraction, resulti ...
(both a short horizontal stroke, but with different meaning, although often encoded with the same code point). Among digits and
letters Letter, letters, or literature may refer to: Characters typeface * Letter (alphabet), a character representing one or more of the sounds used in speech; any of the symbols of an alphabet. * Letterform, the graphic form of a letter of the alphabe ...
, digit 1 and lowercase l are always encoded separately but in many
font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...
s are given very similar glyphs, and digit 0 and capital O are always encoded separately but in many
font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...
s are given very similar glyphs. Virtually every example of a homoglyphic pair of characters can potentially be differentiated graphically with clearly distinguishable glyphs and separate code points, but this is not always done.
Typeface A typeface (or font family) is the design of lettering that can include variations in size, weight (e.g. bold), slope (e.g. italic), width (e.g. condensed), and so on. Each of these variations of the typeface is a font. There are list of type ...
s that do not emphatically distinguish the one/el and zero/oh homoglyphs are considered unsuitable for writing
formula In science, a formula is a concise way of expressing information symbolically, as in a mathematical formula or a ''chemical formula''. The informal use of the term ''formula'' in science refers to the general construct of a relationship betwee ...
s, URLs,
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...
, IDs and other text where characters cannot always be differentiated without
context Context may refer to: * Context (language use), the relevant constraints of the communicative situation that influence language use, language variation, and discourse summary Computing * Context (computing), the virtual environment required to su ...
. Fonts which distinguish glyphs by means of a
slashed zero The slashed zero is a representation of the Arabic digit " 0" (zero) with a slash through it. The slashed zero glyph is often used to distinguish the digit "zero" ("0") from the Latin script letter " O" anywhere that the distinction needs empha ...
, for example, are preferred for those uses.


Umlaut and diaresis

In the days of mechanical typewriters these were typed with the same key, which was also used for a double inverted comma. However the umlaut originated specifically as a pair of short vertical lines (not two dots) (see Sutterlin). Incidentally the two dots above the letter E in Albanian are described as a diaresis but do not fulfil the function of a diaresis.


0 and O; 1, l and I

Two common and important sets of homoglyphs in use today are the digit zero and the capital letter O (i.e. 0 and O); and the digit one, the lowercase letter L and the uppercase i (i.e. 1, l and I). In the early days of mechanical typewriters there was very little or no visual difference between these glyphs, and typists treated them interchangeably as keyboarding shortcuts. In fact, most keyboards did not even have a key for the digit "1", requiring users to type the letter "l" instead, and some also omitted 0. As these same typists transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them, and was an occasional source of confusion. Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and drawing the digit one with prominent
serifs In typography, a serif () is a small line or stroke regularly attached to the end of a larger stroke in a letter or symbol within a particular font or family of fonts. A typeface or "font family" making use of serifs is called a serif typeface ( ...
. Early computer print-outs went even further and marked the zero with a slash or dot, which led to a new conflict involving the Scandinavian letter " Ø" and the Greek letter Φ (
phi Phi (; uppercase Φ, lowercase φ or ϕ; grc, ϕεῖ ''pheî'' ; Modern Greek: ''fi'' ) is the 21st letter of the Greek alphabet. In Archaic and Classical Greek (c. 9th century BC to 4th century BC), it represented an aspirated voicele ...
). The redesigning of character types to differentiate these characters has meant less confusion. The degree to which two different characters appear the same to a given observer is called the "visual similarity".


Multi-letter homoglyphs

Some other combinations of letters look similar, for instance rn looks similar to m, cl looks similar to d, and vv looks similar to w. In certain narrow-spaced fonts (such as Tahoma), placing the letter c next to a letter such as j, l or i will create a homoglyph, such as cj cl ci (g d a). When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some
typographic ligatures Typography is the art and technique of typesetting, arranging type to make written language legibility, legible, readability, readable and beauty, appealing when displayed. The arrangement of type involves selecting typefaces, Point (typogra ...
can look similar to standalone glyphs. For example, the ligature (fi) can look similar to A in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.


Unicode homoglyphs

The
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
contains many strongly homoglyphic characters, known as "confusables". These present security risks in a variety of situations (addressed in UTR#36) and have recently been called to particular attention in regard to
internationalized domain name An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-latin script or alphabet, such as Arabic, Bengali, Chinese (Mandarin, simplified ...
s. One might deliberately spoof a domain name by replacing one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited in
phishing Phishing is a type of social engineering where an attacker sends a fraudulent (e.g., spoofed, fake, or otherwise deceptive) message designed to trick a person into revealing sensitive information to the attacker or to deploy malicious softwar ...
(''see main article
IDN homograph attack The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike (i.e., they are ...
''). In many
fonts In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a " sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...
the
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
letter 'Α', the
Cyrillic , bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця , fam1 = Egyptian hieroglyphs , fam2 = Proto-Sinaitic , fam3 = Phoenician , fam4 = G ...
letter 'А' and the
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
letter 'A' are visually identical, as are the Latin letter 'a' and the Cyrillic letter 'а' (the same can be applied to the Latin letters "aBeHKopcTxy" and the Cyrillic letters ""). A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as 'í' (with an acute accent) and 'i', É (E-acute) and Ė (E dot above) and È (E-grave), Í (with an acute accent) and ĺ (Lowercase L with acute). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a 'homoglyph pair', or if the sequences clearly appear to be words, as 'pseudo-homographs' (noting again that these terms may themselves cause confusion in other contexts). In the
Chinese language Chinese (, especially when referring to written Chinese) is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in Greater China. About 1.3 billion people (or approximately 16% of the wor ...
, many
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the ''Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are one o ...
are homoglyphs of the corresponding
traditional Chinese characters Traditional Chinese characters are one type of standard Chinese character sets of the contemporary written Chinese. The traditional characters had taken shapes since the clerical change and mostly remained in the same structure they took at ...
. Efforts by TLD registries and
Web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
designers are under way to minimize the risks of homoglyphic confusion. Commonly, this is achieved by prohibiting names which mix character sets from multiple languages ( toys-Я-us.org, using the Cyrillic letter Я, would be invalid, but wíkipedia.org and wikipedia.org still exist as different websites); Canada's
.ca .ca is the Internet country code top-level domain (ccTLD) for Canada. The domain name registry that operates it is the Canadian Internet Registration Authority (CIRA). Registrants can register domains at the second level (e.g., ''example.ca'') ...
registry goes one step further by requiring names which differ only in
diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
s to have the same owner and same registrar. The handling of Chinese characters varies: in
.org The domain name .org is a generic top-level domain (gTLD) of the Domain Name System (DNS) used on the Internet. The name is truncated from ''organization''. It was one of the original domains established in 1985, and has been operated by th ...
and
.info The domain name info is a generic top-level domain (gTLD) in the Domain Name System (DNS) of the Internet. The name is derived from ''information'', although registration requirements do not prescribe any particular purpose. The TLD ''info'' wa ...
registration of one variant renders the other unavailable to anyone, while in
.biz .biz is a generic top-level domain (gTLD) in the Domain Name System of the Internet. It is intended for registration of domains to be used by businesses. The name is a phonetic spelling of the first syllable of ''business''. History The TLD ...
the traditional and simplified versions of the same name are delivered as a two-domain bundle which both point to the same
domain name server A name server refers to the server component of the Domain Name System (DNS), one of the two principal namespaces of the Internet. The most important function of DNS servers is the translation (resolution) of human-memorable domain names (example. ...
. Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum provided by
ICANN The Internet Corporation for Assigned Names and Numbers (ICANN ) is an American multistakeholder group and nonprofit organization responsible for coordinating the maintenance and procedures of several databases related to the namespaces ...
.


Canonicalization

Homoglyphs of all kinds can be detected through a process called 'dual canonicalization'. The first step in this process is to identify homoglyph sets, namely characters appearing the same to a given observer. From here, a single token is specified to represent the homoglyph set. This token is called a canon. The next step is to convert each character in the text to the corresponding canon in a process called
canonicalization In computer science, canonicalization (sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare diff ...
. If the canons of two runs of text are the same but the original text is different, then a homoglyph exists in the text.


See also

* *
Duplicate characters in Unicode Unicode has a certain amount of duplication of characters. These are pairs of single Unicode code points that are canonically equivalent. The reason for this are compatibility issues with legacy systems. Unless two characters are canonically equi ...
*
Serif In typography, a serif () is a small line or stroke regularly attached to the end of a larger stroke in a letter or symbol within a particular font or family of fonts. A typeface or "font family" making use of serifs is called a serif typeface ...
* *
Vehicle registration plates of Bosnia and Herzegovina Bosnia and Herzegovina vehicle registration plates have held their current form since 2 February 1998. Currently the Bosnia and Herzegovina (BiH) vehicle registration plate format consists of seven characters: five numbers and two letters arrang ...
use only numbers and letters that look the same in the Latin and Cyrillic alphabets. * , South Korean language game of intentionally substituting
Hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The let ...
characters for homoglyphs.


References


External links


https://www.unicode.org/Public/security/latest/confusables.txt
- recommended confusable mapping for IDN. {{Unicode navigation Typography Unicode