HOME

TheInfoList



OR:

Punycode is a representation of Unicode with the limited
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
character subset used for Internet
hostname In computer networking, a hostname (archaically nodename) is a label that is assigned to a device connected to a computer network and that is used to identify the device in various forms of electronic communication, such as the World Wide Web. Hos ...
s. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, ''München'' (
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
name for Munich) is encoded as ''Mnchen-3ya''. While the Domain Name System (DNS) technically supports arbitrary sequences of octets in domain name labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host names, and require that string comparisons between DNS domain names should be case-insensitive. The Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for Comments 3492.RF
3492
''Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)'', A. Costello, The Internet Society (March 2003)


Encoding procedure

As stated in RFC 3492, "Punycode is an instance of a more general algorithm called ''Bootstring'', which allows strings composed from a small set of 'basic' code points to uniquely represent any string of code points drawn from a larger set." Punycode defines parameters for the general Bootstring algorithm to match the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using the example of the string "bücher" (''Bücher'' is
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
for ''books''), which is translated into the label "bcher-kva".


Separation of ASCII characters

First, all
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
characters in the string are copied from input to output, skipping over any other characters. For example, "bücher" is copied to "bcher". If any characters were copied, i.e. if there was at least one ASCII character in the input, an ASCII hyphen is appended to the output (e.g., "bücher" → "bcher-", but "ü" → ""). Note that hyphens are themselves ASCII characters. Thus, they can be present in the input and, if so, they will be copied to the output. This causes no ambiguity: if the output contains hyphens, the one that got added is always the last one. It marks the end of the ASCII characters.


Encoding the non-ASCII characters

For each non-ASCII character in the input, the encoder calculates two numbers: * ''i'' = the 0-indexed position of the non-ASCII character in the input string (i.e. "0" means that the non-ASCII character is the input string's first character). * ''n'' = the numeric code point, in Unicode, of the non-ASCII character, minus 127 (= the end of ASCII). The encoder then calculates i*n, and encodes the resulting number into a sequence of base-36 digits. It renders those in ASCII, and appends the result to the output string. The ASCII rendering is: 0 → 'a', ..., 25 → 'z', 26 → '0', ..., 35 → '9', with the number's digits arranged in
little-endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
order. The base-36 encoding process is more complex. It outputs variable-length integers. These have the property that each number's most significant digit (e.g. the digit "1" in the number "123") is recognizable without context. Thus, the digits from multiple numbers can be concatenated, with nothing separating them, yet the original numbers can still be recognized and extracted.


ACE prefix for internationalized domain names

To prevent hyphens in non-international domain names from triggering a Punycode decoding, the string xn-- is prepended to Punycode sequences in internationalized domain names. This is called ACE (ASCII Compatible Encoding). Thus the domain name "bücher.tld" would be represented in ASCII as "xn--bcher-kva.tld".


The decoder

The decoder is a
finite-state machine A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
with two state variables ''i'' and ''n''. ''i'' is an index into the string, ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end). ''i'' starts at zero. ''n'' starts at 128 (the first non-ASCII code point). The state progression is a
monotonic function In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of ord ...
. A state transition either increments ''i'' or, if ''i'' is at its maximum, resets ''i'' to zero and increments ''n''. At the next state transition, we resume incrementing ''i''. At each state, the code point denoted by ''n'' either gets inserted or not. The numbers generated by the encoder represent how many possibilities to skip before an insertion is made. There are six possible places to insert a character in the string "bcher" (including before the first character and after the last one). There are 124 code points between the last ASCII code point (127 = 0x7F, the end of ASCII) and "ü" (code point 252 = 0xFC, see Unicode's Latin-1 Supplement). There is one insertion position for the "ü" that must be skipped (position zero: before the 'b'). Thus, the decoder will skip a total of (6 × 124) + 1 = 745 possible insertions before reaching the required one. Once the character is inserted, there are now seven possible places to insert another character.


Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:
A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary. In this case a number system with 36 symbols is used, with the
case-insensitive In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog" a ...
'a' through 'z' equal to the decimal numbers 0 through 25, and '0' through '9' equal to the decimal numbers 26 through 35. Thus "kva", corresponds to the decimal number string "10 21 0".
To decode this string of symbols, a sequence of thresholds will be needed, in this case it's (1, 1, 26, 26, ...). The weight (or
place value Positional notation (or place-value notation, or positional numeral system) usually denotes the extension to any base of the Hindu–Arabic numeral system (or decimal system). More generally, a positional system is a numeral system in which the ...
) of the least-significant digit is always 1: 'k' (=10) with a weight of 1 equals 10. After this, the weight of the next digit depends on the first threshold: generally, for any ''n'', the weight of the (''n''+1)-th digit is the weight of the previous one times (36 − threshold of the ''n''-th digit). So the second symbol has a place value of 36 minus the previous threshold value, in this case, 35. Therefore, the sum of the first two symbols 'k' (=10) and 'v' (=21) is 10 × 1 + 21 × 35. Since the second symbol is not less than its threshold value of 1, there is more to come. However, since the third symbol in this example is 'a' (=0), we may ignore calculating its weight. Therefore, "kva" represents the decimal number (10 × 1) + (21 × 35) = 745. The thresholds themselves are determined for each successive encoded character by an algorithm keeping them between 1 and 26 inclusive. The case can then be used to provide information about the original case of the string. Because special characters are sorted by their code points by encoding algorithm, for the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes codes representing insertion of ý, the Unicode character following ü, starting with "ýbücher" with code "bcher-kvaf" (different from "übücher" coded "bcher-jvab"), etc. To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding. Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using
nameprep Nameprep is the process of case-folding a string to lowercase and removal of some generally invisible code points before it is suitable to represent a domain name, or other such canonical name. It is used by the Internationalizing Domain Names in ...
and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.


Examples

The following table shows examples of Punycode encodings for different types of input.The Punycode in this table was created using the builtin codec "punycode" of the Python programming language version 3.8 (s.encode("punycode")). See
talk page MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
.


See also

* Emoji domain * UTF-5 * UTF-6 * Website spoofing


References

{{Reflist


External links


IETF Punycode standard

ICU IDNA Demonstration
An online demonstration of how ICU performs IDN operations
List of TLDs considered by the Mozilla developers to have an effective anti-spoofing policy for name registration

IDN and Punycode in IE7

Simple Punycode converter

Online on-the-fly Punycode converter based on the Punycode.js JavaScript library

Online modular converter offering Punycode and Bootstring
Unicode Transformation Formats Internationalized domain names