Unicode normalization
   HOME

TheInfoList




Unicode equivalence is the specification by the
Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's wri ...

Unicode
character Character(s) may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to Theophrastus M ...
encoding standard that some sequences of
code point In character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, D ...
s represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard
character set Character encoding is the process of assigning numbers to graphical Graphics (from Greek Greek may refer to: Greece Anything of, from, or related to Greece Greece ( el, Ελλάδα, , ), officially the Hellenic Republic, is a country ...
s, which often included similar or identical characters.
Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's wri ...

Unicode
provides two such notions,
canonical Canonical may refer to: Science and technology * Canonical form, a natural unique representation of an object, or a preferred notation for some object Mathematics * Canonical coordinates, sets of coordinates that can be used to describe a physic ...
equivalence and compatibility.
Code point In character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, D ...
sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the
Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken in the area around Rome, known as Latium. Through the power of the Roman Republic, it became ...

Latin
lowercase "n") followed by U+0303 (the combining
tilde The tilde (
in the American Heritage dictionary
), or , is a

tilde
"◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "" of the
Spanish alphabet Spanish orthography is the orthography used in the Spanish language. The alphabet uses the Latin script. The spelling is fairly phonemic orthography, phonemic, especially in comparison to more opaque orthographies like English orthography, Engl ...
). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or
searching Searching or search may refer to: Computing technology * Search algorithm, including keyword search ** :Search algorithms * Search and optimization for problem solving in artificial intelligence * Search engine technology, software for findin ...
, and may be substituted for each other. Similarly, each
Hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. in South Korea South Korea, officially the Republic of Korea (ROK), is a country in East Asia, constituting the ...

Hangul
syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo. Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the
typographic ligature In writing Writing is a medium of human communication that involves the representation of a language with written symbols. Writing systems are not themselves human languages (with the debatable exception of computer languages); they are mean ...
"ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as
sorting Sorting is any process of arranging items systematically, and has two common, yet distinct meanings: # Collating order, ordering: arranging items in a sequence ordered by some criterion; # categorization, categorizing: grouping items with simil ...

sorting
and
index Index may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastructure in the ''Halo'' series ...
ing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true. The standard also defines a
text normalization Text normalization is the process of transforming text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an ar ...
procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones).


Sources of equivalence


Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the character "Å" can be encoded as U+00C5 (standard name "LATIN CAPITAL LETTER A WITH RING ABOVE", a letter of the
alphabet An alphabet is a standardized set of basic written symbols A symbol is a mark, sign, or word In linguistics, a word of a spoken language can be defined as the smallest sequence of phonemes that can be uttered in isolation with semanti ...

alphabet
in
Swedish Swedish or ' may refer to: * Anything from or related to Sweden, a country in Northern Europe * Swedish language, a North Germanic language spoken primarily in Sweden and Finland * Swedish alphabet, the official alphabet used by the Swedish langu ...
and several other
language A language is a structured system of communication Communication (from Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken in the ...

language
s) or as U+212B ("ANGSTROM SIGN"). Yet the symbol for
angstrom The angstromEntry "angstrom" in the Oxford online dictionary. Retrieved on 2019-03-02 from https://en.oxforddictionaries.com/definition/angstrom.Entry "angstrom" in the Merriam-Webster online dictionary. Retrieved on 2019-03-02 from https://www.m ...

angstrom
is defined to be that Swedish letter, and most other symbols that are letters (like "V" for
volt The volt is the derived unit for electric potential The electric potential (also called the ''electric field potential'', potential drop, the electrostatic potential) is defined as the amount of work (physics), work energy needed to move a ...

volt
) do not have a separate code point for each usage. In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.


Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the Dutch letter " IJ") For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding
base character Base or BASE may refer to: Brands and enterprises *Base (mobile telephony provider), a Belgian mobile telecommunications operator *Base CRM, an enterprise software company founded in 2009 with offices in Mountain View and Kraków, Poland *Base D ...
. Examples of these
combining character In digital typography Desktop publishing (DTP) is the creation of documents using page layout software on a personal ("desktop") personal computer, computer. It was first used almost exclusively for print publications, but now it also assist ...
s are the combining tilde and the
Japanese Japanese may refer to: * Something from or related to Japan Japan ( ja, 日本, or , and formally ) is an island country An island country or an island nation is a country A country is a distinct territory, territorial body or ...
diacritic
dakuten The , colloquially , is a diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph The term glyph is used in typography File:metal movable type.jpg, 225px, Movable type being assembl ...
("◌゛", U+3099). In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single
precomposed character A precomposed character (alternatively composite character or decomposable character) is a Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encodi ...
; and character decomposition is the opposite process. In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.


Example


Typographical non-interaction

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are in general canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.


Typographic conventions

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width
katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived fro ...
characters, or the double-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in
subscript A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the Baseline (typog ...

subscript
or
superscript A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the baseline, whil ...

superscript
positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.


Encoding errors

UTF-8 UTF-8 is a variable-width character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be ...
and
UTF-16 UTF-16 (16-bit 16-bit microcomputer A microcomputer is a small, relatively inexpensive computer A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations automatically. Modern computers ...
(and also some other Unicode encodings) do not allow all possible sequences of
code unit Character encoding is the process of assigning numbers to graphical Graphics (from Greek Greek may refer to: Greece Anything of, from, or related to Greece Greece ( el, Ελλάδα, , ), officially the Hellenic Republic, is a country ...
s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.


Normalization

The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation. Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the
representative Representative may refer to: Politics *Representative democracy Representative democracy, also known as indirect democracy, is a type of democracy where elected persons represent Represent may refer to: * Represent (Compton's Most Wanted alb ...
element of an
equivalence class In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are contained (geometry), and quantities an ...
, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique. In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some
typographic ligature In writing Writing is a medium of human communication that involves the representation of a language with written symbols. Writing systems are not themselves human languages (with the debatable exception of computer languages); they are mean ...
s like U+FB03 (ffi),
Roman numerals Roman numerals are a that originated in and remained the usual way of writing numbers throughout Europe well into the . Numbers in this system are represented by combinations of letters from the . Modern style uses seven symbols, each with a ...
like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping. Transforming superscripts into baseline equivalents may not be appropriate however for
rich text In computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes and development of both computer hardware , hardware and sof ...
software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply , while for the superscript it is . Rich text standards like
HTML The HyperText Markup Language, or HTML is the standard markup language #REDIRECT Markup language In computer text processing, a markup language is a system for annotation, annotating a document in a way that is Syntax (logic), syntacticall ...

HTML
take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.


Normal forms

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below. All these algorithms are
idempotent Idempotence (, ) is the property of certain operations Operation or Operations may refer to: Science and technology * Surgical operation Surgery ''cheirourgikē'' (composed of χείρ, "hand", and ἔργον, "work"), via la, chirurgiae, ...
transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm. The normal forms are not closed under string
concatenation In formal language theory In logic Logic is an interdisciplinary field which studies truth and reasoning Reason is the capacity of consciously making sense of things, applying logic Logic (from Ancient Greek, Greek: grc, wikt ...
. For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition. However, they are not
injective In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are contained (geometry), and quantities and ...

injective
(they map different original glyphs and sequences to the same normalized sequence) and thus also not
bijective In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are contained (geometry), and quantities and ...

bijective
(can't be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining
ring above A ring diacritic A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph The term glyph is used in typography File:metal movable type.jpg, 225px, Movable type being assembled on a composing sti ...

ring above
"°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å"). A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.


Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be
diacritics A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph The term glyph is used in typography File:metal movable type.jpg, 225px, Movable type being assembled on a composing stick using pieces that ...
, even though in general some diacritics are not combining characters, and some combining characters are not diacritics. Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are ''not'' considered equivalent. For example, the character U+1EBF (ế), used in
Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia ** A citizen of Vietnam. See Demographics of Vietnam. * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Oversea ...
, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302. Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.


Errors due to normalization differences

When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance,
OS X macOS (; previously Mac OS X and later OS X) is a proprietary {{Short pages monitor
Equivalence Equivalence or Equivalent may refer to: Arts and entertainment *Album-equivalent unit The album-equivalent unit is a measurement unit in music industry to define the consumption of music that equals the purchase of one album copy. This consumpti ...