Unicode equivalence is the specification by the
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
character
Character or Characters may refer to:
Arts, entertainment, and media Literature
* ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk
* ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
encoding standard that some sequences of
code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard
character set
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
s, which often included similar or identical characters.
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
provides two such notions,
canonical
The adjective canonical is applied in many contexts to mean "according to the canon" the standard, rule or primary source that is accepted as authoritative for the body of knowledge or literature in that context. In mathematics, "canonical examp ...
equivalence and compatibility.
Code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the
Latin
Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
lowercase "n") followed by U+0303 (the
combining tilde
The tilde () or , is a grapheme with several uses. The name of the character came into English from Spanish, which in turn came from the Latin '' titulus'', meaning "title" or "superscription". Its primary use is as a diacritic (accent) i ...
"◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "
ñ" of the
Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as
alphabetizing names or
searching
Searching or search may refer to:
Computing technology
* Search algorithm, including keyword search
** :Search algorithms
* Search and optimization for problem solving in artificial intelligence
* Search engine technology, software for findin ...
, and may be substituted for each other. Similarly, each
Hangul
The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the
typographic ligature
In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters æ and œ used in English and French, in which the letters 'a' and 'e' are joined for the firs ...
"ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as
sorting and
indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
The standard also defines a
text normalization
Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consis ...
procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).
Sources of equivalence
Character duplication
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the character "Å" can be encoded as U+00C5 (standard name "LATIN CAPITAL LETTER A WITH RING ABOVE", a letter of the
alphabet
An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
in
Swedish
Swedish or ' may refer to:
Anything from or related to Sweden, a country in Northern Europe. Or, specifically:
* Swedish language, a North Germanic language spoken primarily in Sweden and Finland
** Swedish alphabet, the official alphabet used by ...
and several other
language
Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
s) or as U+212B ("ANGSTROM SIGN"). Yet the symbol for
angstrom
The angstromEntry "angstrom" in the Oxford online dictionary. Retrieved on 2019-03-02 from https://en.oxforddictionaries.com/definition/angstrom.Entry "angstrom" in the Merriam-Webster online dictionary. Retrieved on 2019-03-02 from https://www.m ...
is defined to be that Swedish letter, and most other symbols that are letters (like "V" for
volt
The volt (symbol: V) is the unit of electric potential, electric potential difference (voltage), and electromotive force in the International System of Units (SI). It is named after the Italian physicist Alessandro Volta (1745–1827).
Defin ...
) do not have a separate code point for each usage. In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.
Combining and precomposed characters
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the
Dutch letter "
IJ")
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding
base character. Examples of these
combining characters are the combining tilde and the
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
diacritic
dakuten
The , colloquially , is a diacritic most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced, for instance, on sounds that have undergone rendaku (sequential voicing).
The , ...
("◌゛", U+3099).
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single
precomposed character
A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diacri ...
; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
Example
Typographical non-interaction
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are in general canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
Typographic conventions
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as
ligatures, the
half-width katakana characters, or the
full-width
In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK: 半角) characters. Unlik ...
Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in
subscript or
superscript positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
Encoding errors
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
and
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
(and also some other Unicode encodings) do not allow all possible sequences of
code unit
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
Normalization
A text processing software implementating the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
Algorithms
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the
representative
Representative may refer to:
Politics
* Representative democracy, type of democracy in which elected officials represent a group of people
* House of Representatives, legislative body in various countries or sub-national entities
* Legislator, som ...
element of an
equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some
typographic ligature
In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters æ and œ used in English and French, in which the letters 'a' and 'e' are joined for the firs ...
s like U+FB03 (ffi),
Roman numerals like U+2168 (Ⅸ) and even
subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping.
Transforming superscripts into baseline equivalents may not be appropriate however for
rich text
Rich may refer to:
Common uses
* Rich, an entity possessing wealth
* Rich, an intense flavor, color, sound, texture, or feeling
** Rich (wine), a descriptor in wine tasting
Places United States
* Rich, Mississippi, an unincorporated comm ...
software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply
, while for the superscript it is
. Rich text standards like
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.
Normal forms
The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.
All these algorithms are
idempotent
Idempotence (, ) is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of pl ...
transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
The normal forms are not
closed under string
concatenation
In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...
. For defective Unicode strings starting with a Hangul vowel or trailing
conjoining jamo, concatenation can break Composition.
However, they are not
injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not
bijective
In mathematics, a bijection, also known as a bijective function, one-to-one correspondence, or invertible function, is a function between the elements of two sets, where each element of one set is paired with exactly one element of the other ...
(can't be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining
ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.
Canonical ordering
The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be
diacritics
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a
stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are ''not'' considered equivalent.
For example, the character U+1EBF (ế), used in
Vietnamese
Vietnamese may refer to:
* Something of, from, or related to Vietnam, a country in Southeast Asia
** A citizen of Vietnam. See Demographics of Vietnam.
* Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam
** Overse ...
, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.
Errors due to normalization differences
When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance,
OS X normalized Unicode filenames sent from the
Samba file- and printer-sharing software.
Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.
Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
See also
*
Complex text layout
Complex text layout (CTL) or complex text rendering is the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes. The term is used in the field of software internationalizatio ...
*
Diacritic
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
*
IDN homograph attack
The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike (i.e., they ar ...
*
ISO/IEC 14651
*
Ligature (typography)
In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters æ and œ used in English and French, in which the letters 'a' and 'e' are joined for the fir ...
*
Precomposed character
A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can also be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diacri ...
* The
uconv
In computing, uconv is a command-line tool that is bundled with International Components for Unicode that converts text files between different character encodings. It is very similar to the ''iconv'' command that is part of the Single UNIX Specifi ...
tool can convert to and from NFC and NFD Unicode normalization forms.
*
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
*
Unicode compatibility characters
In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. As the Unicode Glossary says:
A character that would not have been encoded excep ...
Notes
References
Unicode Standard Annex #15: Unicode Normalization Forms
External links
Charlint - a character normalization toolwritten in Perl
{{Unicode navigation
Equivalence