IDN homograph attack
   HOME

TheInfoList



OR:

The
internationalized domain name An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-latin script or alphabet, such as Arabic, Bengali, Chinese ( Mandarin, simplif ...
(IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike (i.e., they are homographs, hence the term for the attack, although technically
homoglyph In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties. Synoglyph ...
is the more accurate term for different characters that look alike). For example, a regular user of example.com may be lured to click a link where the
Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...
character "a" is replaced with the
Cyrillic The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...
character "а". This kind of
spoofing attack In the context of information security, and especially network security, a spoofing attack is a situation in which a person or program successfully identifies as another by falsifying data, to gain an illegitimate advantage. Internet Spoofing and ...
is also known as script spoofing.
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
incorporates numerous writing systems, and, for a number of reasons, similar-looking characters such as Greek Ο, Latin O, and Cyrillic О were not assigned the same code. Their incorrect or malicious usage is a possibility for security attacks."Unicode Security Considerations"
Technical Report #36, 2010-04-28
The registration of homographic domain names is akin to
typosquatting Typosquatting, also called URL hijacking, a sting site, or a fake URL, is a form of cybersquatting, and possibly brandjacking which relies on mistakes such as typos made by Internet users when inputting a website address into a web browser. Shoul ...
, in that both forms of attacks use a similar-looking name to a more established domain to fool a user. The major difference is that in typosquatting the perpetrator attracts victims by relying on natural typographical errors commonly made when manually entering a URL, while in homograph spoofing the perpetrator deceives the victims by presenting visually indistinguishable
hyperlink In computing, a hyperlink, or simply a link, is a digital reference to data that the user can follow or be guided by clicking or tapping. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text w ...
s. Indeed, it would be a rare accident for a web user to type, for example, a Cyrillic letter within an otherwise English word such as "citibаnk". There are cases in which a registration can be both typosquatting and homograph spoofing; the pairs of l/I, i/j, and 0/O are all both close together on keyboards and, depending on the typeface, may be difficult or impossible to distinguish.


History

An early nuisance of this kind, pre-dating the Internet and even text terminals, was the confusion between "l" (lowercase letter "L") / "1" (the number "one") and "O" (capital letter for vowel "o") / "0" (the number "zero"). Some
typewriter A typewriter is a mechanical or electromechanical machine for typing characters. Typically, a typewriter has an array of keys, and each one causes a different single character to be produced on paper by striking an inked ribbon selectivel ...
s in the pre-computer era even combined the L and the one; users had to type a lowercase L when the number one was needed. The zero/o confusion gave rise to the tradition of crossing zeros, so that a computer operator would type them correctly. Unicode may contribute to this greatly with its combining characters, accents, several types of
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes ( figure ...
, etc., often due to inadequate rendering support, especially with smaller font sizes and the wide variety of fonts. Even earlier,
handwriting Handwriting is the writing done with a writing instrument, such as a pen or pencil, in the hand. Handwriting includes both printing and cursive styles and is separate from formal calligraphy or typeface. Because each person's handwriting is u ...
provided rich opportunities for confusion. A notable example is the etymology of the word "
zenith The zenith (, ) is an imaginary point directly "above" a particular location, on the celestial sphere. "Above" means in the vertical direction ( plumb line) opposite to the gravity direction at that location ( nadir). The zenith is the "high ...
". The translation from the Arabic "samt" included the scribe's confusing of "m" into "ni". This was common in medieval
blackletter Blackletter (sometimes black letter), also known as Gothic script, Gothic minuscule, or Textura, was a script used throughout Western Europe from approximately 1150 until the 17th century. It continued to be commonly used for the Danish, Norwe ...
, which did not connect the vertical columns on the letters i, m, n, or u, making them difficult to distinguish when several were in a row. The latter, as well as "rn"/"m"/"rri" ("RN"/"M"/"RRI") confusion, is still possible for a human eye even with modern advanced computer technology. Intentional look-alike character substitution with different alphabets has also been known in various contexts. For example,
Faux Cyrillic Faux Cyrillic, pseudo-Cyrillic, pseudo-Russian or faux Russian typography is the use of Cyrillic letters in Latin text, usually to evoke the Soviet Union or Russia, though it may be used in other contexts as well. It is a common Western trope u ...
has been used as an amusement or attention-grabber and " Volapuk encoding", in which Cyrillic script is represented by similar Latin characters, was used in early days of the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
as a way to overcome the lack of support for the Cyrillic alphabet. Another example is that vehicle registration plates can have both Cyrillic (for domestic usage in Cyrillic script countries) and Latin (for international driving) with the same letters. Registration plates that are issued in
Greece Greece,, or , romanized: ', officially the Hellenic Republic, is a country in Southeast Europe. It is situated on the southern tip of the Balkans, and is located at the crossroads of Europe, Asia, and Africa. Greece shares land borders wi ...
are limited to using letters of the
Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as ...
that have homoglyphs in the Latin alphabet, as
European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are located primarily in Europe, Europe. The union has a total area of ...
regulations require the use of Latin letters.


Homographs in ASCII

ASCII has several characters or pairs of characters that look alike and are known as ''homographs'' (or ''
homoglyph In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties. Synoglyph ...
s'').
Spoofing attack In the context of information security, and especially network security, a spoofing attack is a situation in which a person or program successfully identifies as another by falsifying data, to gain an illegitimate advantage. Internet Spoofing and ...
s based on these similarities are known as homograph spoofing attacks. For example, 0 (the number) and O (the letter), "l" lowercase L, and "I" uppercase "i". In a typical example of a hypothetical attack, someone could register a
domain name A domain name is a string that identifies a realm of administrative autonomy, authority or control within the Internet. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. As ...
that appears almost identical to an existing domain but goes somewhere else. For example, the domain "rnicrosoft.com" begins with "r" and "n", not "m". Other examples are ''G00GLE.COM'' which looks much like ''GOOGLE.COM'' in some fonts. Using a mix of uppercase and lowercase characters, ''googIe.com'' (capital ''i'', not small ''L'') looks much like ''google.com'' in some fonts.
PayPal PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support online money transfers, and serves as an electronic alternative to traditional paper ...
was a target of a phishing scam exploiting this, using the domain PayPaI.com. In certain narrow-spaced fonts such as Tahoma (the default in the address bar in
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was release to manufacturing, released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Wind ...
), placing a c in front of a j, l or i will produce homoglyphs such as cl cj ci (d g a).


Homographs in internationalized domain names

In
multilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. It is believed that multilingual speakers outnumber monolingual speakers in the world's population. More than half of all ...
computer systems, different logical characters may have identical appearances. For example,
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
character U+0430,
Cyrillic The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking co ...
small letter a ("а"), can look identical to Unicode character U+0061,
Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...
small letter a, ("a") which is the lowercase "a" used in English. Hence wikipediа.org (xn--wikipedi-86g.org; the Cyrillic version) instead of wikipedia.org (the Latin version). The problem arises from the different treatment of the characters in the user's mind and the computer's programming. From the viewpoint of the user, a Cyrillic "а" within a Latin string ''is'' a Latin "a"; there is no difference in the glyphs for these characters in most fonts. However, the computer treats them differently when processing the character string as an identifier. Thus, the user's assumption of a one-to-one correspondence between the visual appearance of a name and the named entity breaks down.
Internationalized domain name An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-latin script or alphabet, such as Arabic, Bengali, Chinese ( Mandarin, simplif ...
s provide a backward-compatible way for domain names to use the full Unicode character set, and this standard is already widely supported. However this system expanded the character repertoire from a few dozen characters in a single alphabet to many thousands of characters in many scripts; this greatly increased the scope for homograph attacks. This opens a rich vein of opportunities for
phishing Phishing is a type of social engineering where an attacker sends a fraudulent (e.g., spoofed, fake, or otherwise deceptive) message designed to trick a person into revealing sensitive information to the attacker or to deploy malicious softwa ...
and other varieties of fraud. An attacker could register a domain name that ''looks'' just like that of a legitimate website, but in which some of the letters have been replaced by homographs in another alphabet. The attacker could then send e-mail messages purporting to come from the original site, but directing people to the bogus site. The spoof site could then record information such as passwords or account details, while passing traffic through to the real site. The victims may never notice the difference, until suspicious or criminal activity occurs with their accounts. In December 2001 Evgeniy Gabrilovich and Alex Gontmakher, both from Technion,
Israel Israel (; he, יִשְׂרָאֵל, ; ar, إِسْرَائِيل, ), officially the State of Israel ( he, מְדִינַת יִשְׂרָאֵל, label=none, translit=Medīnat Yīsrāʾēl; ), is a country in Western Asia. It is situated ...
, published a paper titled "The Homograph Attack",Evgeniy Gabrilovich and Alex Gontmakher, , Communications of the ACM, 45(2):128, February 2002 which described an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name ''
microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...
.com'' which incorporated Cyrillic characters. Problems of this kind were anticipated before IDN was introduced, and guidelines were issued to registries to try to avoid or reduce the problem. For example, it was advised that registries only accept characters from the Latin alphabet and that of their own country, not all of Unicode characters, but this advice was neglected by major TLDs. On February 7, 2005,
Slashdot ''Slashdot'' (sometimes abbreviated as ''/.'') is a social news website that originally advertised itself as "News for Nerds. Stuff that Matters". It features news stories concerning science, technology, and politics that are submitted and eval ...
reported that this exploit was disclosed by 3ric Johanson at the hacker conference
Shmoocon ShmooCon is an American hacker convention organized by The Shmoo Group. There are typically 40 different talks and presentations on a variety of subjects related to computer security and cyberculture. Multiple events are held at the convention ...
. Web browsers supporting IDNA appeared to direct the URL http://www.pаypal.com/, in which the first ''a'' character is replaced by a Cyrillic ''а'', to the site of the well known payment site
PayPal PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support online money transfers, and serves as an electronic alternative to traditional paper ...
, but actually led to a spoofed web site with different content. Popular browsers continued to have problems properly displaying international domain names through April 2017. The following alphabets have characters that can be used for spoofing attacks (please note, these are only the most obvious and common, given artistic license and how much risk the spoofer will take of getting caught; the possibilities are far more numerous than can be listed here):


Cyrillic

Cyrillic is, by far, the most commonly used alphabet for homoglyphs, largely because it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts. The
Cyrillic letters The Cyrillic script ( ), Slavonic script or the Slavic script, is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking c ...
а, с, е, о, р, х and у have optical counterparts in the basic Latin alphabet and look close or identical to a, c, e, o, p, x and y. Cyrillic З, Ч and б resemble the numerals 3, 4 and 6.
Italic type In typography, italic type is a cursive font based on a stylised form of calligraphic handwriting. Owing to the influence from calligraphy, italics normally slant slightly to the right. Italics are a way to emphasise key points in a printed ...
generates more homoglyphs: ''дтпи'' or ''дтпи'' ( д т п и in standard type), resembling d m n u (in some fonts д can be used, since its italic form resembles a lowercase g; however, in most mainstream fonts, д instead resembles a partial differential sign, ). If capital letters are counted, А В С Е Н І Ј К М О Р Ѕ Т Х can substitute A B C E H I J K M O P S T X, in addition to the capitals for the lowercase Cyrillic homoglyphs. Cyrillic non-Russian problematic letters are і and i, ј and j, ԛ and q, ѕ and s, ԝ and w, Ү and Y, while Ғ and F, Ԍ and G bear some resemblance to each other. Cyrillic ӓ ё ї ӧ can also be used if an IDN itself is being spoofed, to fake ä ë ï ö. While Komi De ( ԁ),
shha Shha or He (Һ һ; italics: ) is a letter of the Cyrillic script. Its form is derived from the Latin letter H (H h ), but the capital forms are more similar to a rotated Cyrillic letter Che (Ч ч) or a stroke-less Tshe (Ћ ћ) be ...
( һ), palochka ( Ӏ) and
izhitsa Izhitsa or Izhica (Ѵ, ѵ; italics: ; OCS: Ѷжица, Russian: Ижица, Ukrainian: Іжиця) is a letter of the early Cyrillic alphabet and several later alphabets, usually the last in the row. It originates from the Greek letter upsilo ...
( ѵ) bear strong resemblance to Latin d, h, l and v, these letters are either rare or archaic and are not widely supported in most standard fonts (they are not included in the
WGL-4 Windows Glyph List 4, or more commonly WGL4 for short, also known as the ''Pan-European character set'', is a character repertoire on Microsoft operating systems comprising 657 Unicode characters, two of them private use. Its purpose is to provid ...
). Attempting to use them could cause a
ransom note effect In typography, the ransom note effect is the result of using an excessive number of juxtaposed typefaces. It takes its name from the appearance of a stereotypical ransom note, with the message formed from words or letters cut randomly from a m ...
.


Greek

From the
Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as ...
, only
omicron Omicron (; uppercase Ο, lowercase ο, ell, όμικρον) is the 15th letter of the Greek alphabet. This letter is derived from the Phoenician letter ayin: . In classical Greek, omicron represented the close-mid back rounded vowel in contr ...
ο and sometimes nu ν appear identical to a Latin alphabet letter in the lowercase used for URLs. Fonts that are in
italic type In typography, italic type is a cursive font based on a stylised form of calligraphic handwriting. Owing to the influence from calligraphy, italics normally slant slightly to the right. Italics are a way to emphasise key points in a printed ...
will feature Greek alpha ''α'' looking like a Latin ''a''. This list increases if close matches are also allowed (such as Greek εικηρτυωχγ for eiknptuwxy). Using
capital letter Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writin ...
s, the list expands greatly. Greek ΑΒΕΗΙΚΜΝΟΡΤΧΥΖ looks identical to Latin ABEHIKMNOPTXYZ. Greek ΑΓΒΕΗΚΜΟΠΡΤΦΧ looks similar to Cyrillic АГВЕНКМОПРТФХ (as do Cyrillic Л (Л) and Greek Λ in certain geometric sans-serif fonts), Greek letters κ and ο look similar to Cyrillic к and о. Besides this Greek τ, φ can be similar to Cyrillic т, ф in some fonts, Greek δ resembles Cyrillic б in the Serbian alphabet, and the Cyrillic ''а'' also italicizes the same as its Latin counterpart, making it possible to substitute it for alpha or vice versa. The lunate form of sigma, Ϲϲ, resembles both Latin Cc and Cyrillic Сс. If an IDN itself is being spoofed, Greek beta β can be a substitute for German eszett ß in some fonts (and in fact,
code page 437 Code page 437 ( CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters ( d ...
treats them as equivalent), as can Greek end-of-word-variant sigma ς for ç; accented Greek substitutes ''όίά'' can usually be used for ''óíá'' in many fonts, with the last of these (alpha) again only resembling ''a'' in italic type.


Armenian

The
Armenian alphabet The Armenian alphabet ( hy, Հայոց գրեր, ' or , ') is an alphabetic writing system used to write Armenian. It was developed around 405 AD by Mesrop Mashtots, an Armenian linguist and ecclesiastical leader. The system originally ha ...
can also contribute critical characters: several Armenian characters like օ, ո, ս, as well capital Տ and Լ are often completely identical to Latin characters in modern fonts, and symbols which similar enough to pass off, such as ցհոօզս which look like ghnoqu, յ which resembles j (albeit dotless), and ք, which can either resemble p or f depending on the font; ա can resemble Cyrillic ш. However, the use of Armenian is, luckily, a bit less reliable: Not all standard fonts feature Armenian glyphs (whereas the Greek and Cyrillic scripts are); Windows prior to Windows 7 rendered Armenian in a distinct font, Sylfaen, of which the mixing of Armenian with Latin would appear obviously different if using a font other than Sylfaen or a Unicode typeface. (This is known as a
ransom note effect In typography, the ransom note effect is the result of using an excessive number of juxtaposed typefaces. It takes its name from the appearance of a stereotypical ransom note, with the message formed from words or letters cut randomly from a m ...
.) The current version of Tahoma, used in Windows 7, supports Armenian (previous versions did not). Furthermore, this font differentiates Latin g from Armenian ց. Two letters in Armenian (Ձշ) also can resemble the number 2, Յ resembles 3, while another (վ) sometimes resembles the number 4.


Hebrew

Hebrew spoofing is generally rare. Only three letters from that alphabet can reliably be used: samekh (ס), which sometimes resembles o, vav with diacritic (וֹ), which resembles an i, and heth (ח), which resembles the letter n. Less accurate approximants for some other alphanumerics can also be found, but these are usually only accurate enough to use for the purposes of
foreign branding Foreign branding is an advertising and marketing term describing the use of foreign or foreign-sounding brand names for companies, products, and services to imply they are of foreign origin. This can also be used for foreign products if the countr ...
and not for substitution. Furthermore, the
Hebrew alphabet The Hebrew alphabet ( he, אָלֶף־בֵּית עִבְרִי, ), known variously by scholars as the Ktav Ashuri, Jewish script, square script and block script, is an abjad script used in the writing of the Hebrew language and other Jewi ...
is written from right to left and trying to mix it with left-to-right glyphs may cause problems.


Thai

Though the
Thai script The Thai script ( th, อักษรไทย, ) is the abugida used to write Thai, Southern Thai and many other languages spoken in Thailand. The Thai alphabet itself (as used to write Thai) has 44 consonant symbols ( th, พยัญชน ...
has historically had a distinct look with numerous loops and small flourishes, modern Thai typography, beginning with Manoptica in 1973 and continuing through IBM Plex in the modern era, has increasingly adopted a simplified style in which Thai characters are represented with glyphs strongly resembling Latin letters. ค (A), ท (n), น (u), บ (U), ป (J), พ (W), ร (S), and ล (a) are among the Thai glyphs that can closely resemble Latin.


Chinese

The
Chinese language Chinese (, especially when referring to written Chinese) is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in Greater China. About 1.3 billion people (or approximately 16% of the ...
can be problematic for homographs as many characters exist as both
traditional A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays or ...
(regular script) and
simplified Chinese characters Simplified Chinese characters are standardized Chinese characters used in mainland China, Malaysia and Singapore, as prescribed by the '' Table of General Standard Chinese Characters''. Along with traditional Chinese characters, they are o ...
. In the
.org The domain name .org is a generic top-level domain (gTLD) of the Domain Name System (DNS) used on the Internet. The name is truncated from ''organization''. It was one of the original domains established in 1985, and has been operated by th ...
domain, registering one variant renders the other unavailable to anyone; in
.biz .biz is a generic top-level domain (gTLD) in the Domain Name System of the Internet. It is intended for registration of domains to be used by businesses. The name is a phonetic spelling of the first syllable of ''business''. History The TLD ...
a single Chinese-language IDN registration delivers both variants as active domains (which must have the same domain name server and the same registrant).
.hk .hk is the designated Internet country code top-level domain (ccTLD) for Hong Kong. It is administered by the Hong Kong Internet Registration Corporation (HKIRC), the only organization endorsed by the Hong Kong Government to undertake the adminis ...
(.香港) also adopts this policy.


Other scripts

Other Unicode scripts in which homographs can be found include
Number Forms Number Forms is a Unicode block containing Unicode compatibility characters that have specific meaning as numbers, but are constructed from other characters. They consist primarily of vulgar fractions and Roman numerals. In addition to the cha ...
(
Roman numeral Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, ea ...
s), CJK Compatibility and
Enclosed CJK Letters and Months Enclosed CJK Letters and Months is a Unicode block containing circled and parenthesized Katakana, Hangul, and CJK ideographs. Also included in the block are miscellaneous glyphs that would more likely fit in CJK Compatibility or Enclosed Alp ...
(certain abbreviations), Latin (certain digraphs), Currency Symbols,
Mathematical Alphanumeric Symbols Mathematical Alphanumeric Symbols is a Unicode block comprising styled forms of Latin and Greek letters and decimal digits that enable mathematicians to denote different notions with different letter styles. The letters in various fonts o ...
, and Alphabetic Presentation Forms (
typographic ligature In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters æ and œ used in English and French, in which the letters 'a' and 'e' are joined for the firs ...
s).


Accented characters

Two names which differ only in an accent on one character may look very similar, particularly when the substitution involves the dotted letter i; the tittle (dot) on the i can be replaced with a
diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacrit ...
(such as a
grave accent The grave accent () ( or ) is a diacritical mark used to varying degrees in French, Dutch, Portuguese, Italian and many other western European languages, as well as for a few unusual uses in English. It is also used in other languages usin ...
or
acute accent The acute accent (), , is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts. For the most commonly encountered uses of the accent in the Latin and Greek alphabets, precomposed ...
; both ì and í are included in most standard character sets and fonts) that can only be detected with close inspection. In most top-level domain registries, wíkipedia.tld (xn--wkipedia-c2a.tld) and wikipedia.tld are two different names which may be held by different registrants. One exception is
.ca .ca is the Internet country code top-level domain (ccTLD) for Canada. The domain name registry that operates it is the Canadian Internet Registration Authority (CIRA). Registrants can register domains at the second level (e.g., ''example.ca'') ...
, where reserving the plain-
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
version of the domain prevents another registrant from claiming an accented version of the same name.


Non-displayable characters

Unicode includes many characters which are not displayed by default, such as the zero-width space. In general,
ICANN The Internet Corporation for Assigned Names and Numbers (ICANN ) is an American multistakeholder group and nonprofit organization responsible for coordinating the maintenance and procedures of several databases related to the namespaces ...
prohibits any domain with these characters from being registered, regardless of TLD.


Known homograph attacks

In 2011, an unknown source (registering under the name "Completely Anonymous") registered a domain name homographic to television station
KBOI-TV KBOI-TV (channel 2) is a television station in Boise, Idaho, United States, affiliated with CBS. It is owned by Sinclair Broadcast Group alongside low-power CW+ affiliate KYUU-LD (channel 35). Both stations share studios on North 16th Street i ...
's to create a fake news website. The sole purpose of the site was to spread an
April Fool's Day April Fools' Day or All Fools' Day is an annual custom on 1 April consisting of practical jokes and hoaxes. Jokesters often expose their actions by shouting "April Fools!" at the recipient. Mass media can be involved in these pranks, which may ...
joke regarding the
Governor of Idaho A governor is an administrative leader and head of a polity or political region, ranking under the head of state and in some cases, such as governors-general, as the head of state's official representative. Depending on the type of political r ...
issuing a supposed ban on the sale of music by
Justin Bieber Justin Drew Bieber ( ; born March 1, 1994) is a Canadian singer. Bieber is recognized for his genre-melding musicianship and has played an influential role in modern-day popular music. He was discovered by American record executive Scooter ...
. In September 2017, security researcher Ankit Anubhav discovered an IDN homograph attack where the attackers registered adoḅe.com to deliver the Betabot trojan.


Defending against the attack


Client-side mitigation

The simplest defense is for web browsers not to support IDNA or other similar mechanisms, or for users to turn off whatever support their browsers have. That could mean blocking access to IDNA sites, but generally browsers permit access and just display IDNs in
Punycode Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, wh ...
. Either way, this amounts to abandoning non-ASCII domain names. *
Google Chrome Google Chrome is a cross-platform web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS, ...
versions 51 and later use an algorithm similar to the one used by Firefox. Previous versions display an IDN only if all of its characters belong to one (and only one) of the user's preferred languages.
Chromium Chromium is a chemical element with the symbol Cr and atomic number 24. It is the first element in group 6. It is a steely-grey, lustrous, hard, and brittle transition metal. Chromium metal is valued for its high corrosion resistance and hard ...
and Chromium-based browsers such as
Microsoft Edge Microsoft Edge is a proprietary, cross-platform web browser created by Microsoft. It was first released in 2015 as part of Windows 10 and Xbox One and later ported to other platforms as a fork of Google's Chromium open-source project: Android ...
(since 2019) and
Opera Opera is a form of theatre in which music is a fundamental component and dramatic roles are taken by singers. Such a "work" (the literal translation of the Italian word "opera") is typically a collaboration between a composer and a libr ...
also use the same algorithm. * Safari's approach is to render problematic character sets as
Punycode Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, wh ...
. This can be changed by altering the settings in
Mac OS X macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lap ...
's system files. *
Mozilla Firefox Mozilla Firefox, or simply Firefox, is a free and open-source web browser developed by the Mozilla Foundation and its subsidiary, the Mozilla Corporation. It uses the Gecko rendering engine to display web pages, which implements current an ...
versions 22 and later display IDNs if either the TLD prevents homograph attacks by restricting which characters can be used in domain names or labels do not mix scripts for different languages. Otherwise IDNs are displayed in Punycode. *
Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated IE or MSIE) is a series of graphical web browsers developed by Microsoft which was used in the Windows line of operating systems (in ...
versions 7 and later allow IDNs except for labels that mix scripts for different languages. Labels that mix scripts are displayed in Punycode. There are exceptions to locales where ASCII characters are commonly mixed with localized scripts. Internet Explorer 7 was capable of using IDNs, but it imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provides an anti-phishing filter that checks suspicious Web sites against a remote database of known phishing sites. * Old Microsoft Edge converts all Unicode into Punycode. As an additional defense, Internet Explorer 7, Firefox 2.0 and above, and Opera 9.10 include phishing filters that attempt to alert users when they visit malicious websites. As of April 2017, several browsers (including Chrome, Firefox and Opera) were displaying IDNs consisting purely of Cyrillic characters normally (not as punycode), allowing spoofing attacks. Chrome tightened IDN restrictions in version 59 to prevent this attack. Browser extensions like No Homo-Graphs are available for
Google Chrome Google Chrome is a cross-platform web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS, ...
and
Firefox Mozilla Firefox, or simply Firefox, is a free and open-source web browser developed by the Mozilla Foundation and its subsidiary, the Mozilla Corporation. It uses the Gecko rendering engine to display web pages, which implements current ...
that check whether the user is visiting a website which is a homograph of another domain from a user-defined list. These methods of defense only extend to within a browser. Homographic URLs that house malicious software can still be distributed, without being displayed as Punycode, through
e-mail Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic (digital) version of, or counterpart to, mail, at a time when "mail" meant ...
,
social networking A social network is a social structure made up of a set of social actors (such as individuals or organizations), sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for ...
or other Web sites without being detected until the user actually clicks the link. While the fake link will show in Punycode when it is clicked, by this point the page has already begun loading into the browser and the malicious software may have already been downloaded onto the computer.


Server-side/registry operator mitigation

The IDN homographs database is a Python library that allows developers to defend against this using
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
-based character recognition.
ICANN The Internet Corporation for Assigned Names and Numbers (ICANN ) is an American multistakeholder group and nonprofit organization responsible for coordinating the maintenance and procedures of several databases related to the namespaces ...
has implemented a policy prohibiting any potential internationalized TLD from choosing letters that could resemble an existing Latin TLD and thus be used for homograph attacks. Proposed IDN TLDs
.бг (abbreviation of bg, България, translit=Balgarija) is an internationalized country code top-level domain (IDN ccTLD) for Bulgaria..укр The domain name .укр ( romanized as ''.ukr''; abbreviation of uk, Україна, tr. ''Ukrayina'') is an approved internationalized country code top-level domain (IDN ccTLD) for Ukraine. It is a common abbreviation used in Ukraine, as in '' ...
(Ukraine) and .ελ (Greece) have been rejected or stalled because of their perceived resemblance to Latin letters. All three (and Serbian
.срб (romanized as ''.srb''; abbreviation of sr, Србија/) is the Internationalised (Cyrillic) Internet country code top-level domain ( IDN ccTLD) for Serbia. It has been active since May 3, 2011. The Serbian National Internet Domain Registry ...
and Mongolian
.мон .мон is the internationalised (Cyrillic) internet country code top-level domain (ccTLD) for Mongolia. It is administered by .MN Registry, Datacom. The domain name is composed of the consonants in the three first letters of the country name. T ...
) have later been accepted. Three-letter TLD are considered safer than two-letter TLD, since they are harder to match to normal Latin ISO-3166 country domains; although the potential to match new generic domains remains, such generic domains are far more expensive than registering a second- or third-level domain address, making it cost-prohibitive to try to register a homoglyphic TLD for the sole purpose of making fraudulent domains (which itself would draw ICANN scrutiny). The Russian registry operator Coordination Center for TLD RU only accepts Cyrillic names for the top-level domain
.рф The domain name .рф (romanized as ''.rf''; abbreviation of ) is the Cyrillic country code top-level domain for the Russian Federation, in the Domain Name System of the Internet. In the Domain Name System it has the ASCII DNS name . The domain ...
, forbidding a mix with Latin or Greek characters. However the problem in
.com The domain name .com is a top-level domain (TLD) in the Domain Name System (DNS) of the Internet. Added at the beginning of 1985, its name is derived from the word ''commercial'', indicating its original intended purpose for domains registere ...
and other
gTLD Generic top-level domains (gTLDs) are one of the categories of top-level domains (TLDs) maintained by the Internet Assigned Numbers Authority (IANA) for use in the Domain Name System of the Internet. A top-level domain is the last level of eve ...
s remains open.Emoji to Zero-Day: Latin Homoglyphs in Domains and Subdomains


See also

* Security issues in Unicode *
Internationalized domain name An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-latin script or alphabet, such as Arabic, Bengali, Chinese ( Mandarin, simplif ...
*
Homoglyph In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties. Synoglyph ...
*
Duplicate characters in Unicode Unicode has a certain amount of duplication of characters. These are pairs of single Unicode code points that are canonically equivalent. The reason for this are compatibility issues with legacy systems. Unless two characters are canonically equi ...
* Unicode equivalence *
Typosquatting Typosquatting, also called URL hijacking, a sting site, or a fake URL, is a form of cybersquatting, and possibly brandjacking which relies on mistakes such as typos made by Internet users when inputting a website address into a web browser. Shoul ...


References

{{Reflist, 30em


External links


Homograph attack generator

Phishing with Unicode Domains
Internationalized domain names Nonstandard spelling Unicode Deception Obfuscation Web security exploits Orthography