Unicode Compatibility Characters
   HOME

TheInfoList



OR:

In
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. As the Unicode Glossary says:
A character that would not have been encoded except for compatibility and round-trip convertibility with other standards
Although ''compatibility'' is used in names, it is not marked as a property. However, the definition is more complicated than the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters' decomposition or
compatibility decomposition Unicode equivalence is the specification by the Unicode character (computing), character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibil ...
. Over five thousand characters do have a compatibility decomposition mapping that compatibility character to one or more other UCS characters. By setting a character's decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are varied and are discussed in further detail below. The term ''decomposition'' sometimes confuses because a character's decomposition can, in some cases, be a singleton. In these cases the decomposition of one character is simply another approximately (but not canonically)
equivalent Equivalence or Equivalent may refer to: Arts and entertainment *Album-equivalent unit, a measurement unit in the music industry *Equivalence class (music) *''Equivalent VIII'', or ''The Bricks'', a minimalist sculpture by Carl Andre *''Equivale ...
character.


Compatibility character types and keywords

The compatibility decomposition property for the 5,402 Unicode compatibility characters includes a keyword that divides the compatibility characters into 17 logical groups. Those characters with a compatibility decomposition but without a keyword are termed canonical decomposable characters and those characters are not compatibility characters. Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. Compatibility characters fall in three basic categories: # Characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities. # Characters included from other character sets or otherwise added to the UCS that constitute
rich text Rich may refer to: Common uses * Rich, an entity possessing wealth * Rich, an intense flavor, color, sound, texture, or feeling ** Rich (wine), a descriptor in wine tasting Places United States * Rich, Mississippi, an unincorporated commun ...
rather than the plain text goals of Unicode. # Some other characters that are semantically distinct, but visually similar. Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter 'I' and their software application fails to find the visually similar
Roman numeral Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, eac ...
'Ⅰ'.


Compatibility mappings types


Glyph substitution and composition

Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include: ; Ligatures: Ligatures such as 'ffi' in the Latin script were often encoded as a separate character in legacy character sets. Unicode's approach to ligatures is to treat them as rich text and, if turned on, handle them through glyph substitution. ;Precomposed Roman numerals: For example, Roman numeral twelve ('Ⅻ': U+216B) can be decomposed into a Roman numeral ten ('Ⅹ': U+2169) and two Roman numeral ones ('Ⅰ': U+2160). Precomposed characters are in the
Number Forms Number Forms is a Unicode block containing Unicode compatibility characters that have specific meaning as numbers, but are constructed from other characters. They consist primarily of vulgar fractions and Roman numerals. In addition to the cha ...
block. ;Precomposed
fractions A fraction (from la, fractus, "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight ...
: These decomposition have the keyword <fraction>. A fully conforming text handler should display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the
Number Forms Number Forms is a Unicode block containing Unicode compatibility characters that have specific meaning as numbers, but are constructed from other characters. They consist primarily of vulgar fractions and Roman numerals. In addition to the cha ...
block. ;Contextual glyphs or forms: These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as
OpenType OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior. OpenType is a registered trademark o ...
and TrueTypeGX, Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing. The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore, these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated (see
Unicode normalization Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
). Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for round-trip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character. In order to dispense with these compatibility characters, text software must conform to several Unicode protocols. The software must be able to: #Compose diacritic marked graphemes from letter characters and one or more separate combining diacritic marks. #Substitute (at the author or reader's discretion) ligatures and contextual glyph variants. #Lay out CJKV text vertically (at the author's or reader's discretion), substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed. #Combine fractions using the '
Fraction Slash The slash is the oblique slanting line punctuation mark . Also known as a stroke, a solidus or several other historical or technical names including oblique and virgule. Once used to mark periods and commas, the slash is now used to represen ...
' character (⁄ U+2044) and any other arbitrary characters. #Combine a ' Combining Long Solidus Overlay' ( ̸ U+0338) with other symbols: for example ∄ or ∄ for (U+2203). All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters (the exceptions include those <compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in § Semantically distinct characters).


Rich text compatibility characters

Many other compatibility characters constitute what Unicode considers rich text and therefore outside the goals of Unicode and UCS. In some sense even compatibility characters discussed in the previous section—those that aid legacy software in displaying ligatures and vertical text—constitute a form of rich text, since the rich text protocols determine whether text is displayed in one way or another. However, the choice to display text with or without ligatures or vertically versus horizontally are both non-semantic rich text. They are simply style differences. This is in contrast to other rich text such as italics, superscripts and subscripts, or list markers where the styling of the rich text implies certain semantics along with it. For comparing, collating, handling and storing plain text, rich text variants are semantically redundant. For example, using a superscript character for the numeral 4 is likely indistinguishable from using the standard character for a numeral 4 and then using rich text protocols to make it superscript. Such alternate rich text characters therefore create ambiguity because they appear visually the same as their plain text counterpart characters with rich text formatting applied. These rich text compatibility characters include: ;
Mathematical Alphanumeric Symbols Mathematical Alphanumeric Symbols is a Unicode block comprising styled forms of Latin alphabet, Latin and Greek alphabet, Greek letters and decimal numerical digit, digits that enable mathematicians to denote different notions with different ...
: These symbols are simply clones of the Latin and Greek alphabets and Indic-Arabic decimal digits repeated in 15 various typefaces. They are intended as an arbitrary palette for mathematical notation. However, they tend to undermine the distinction between encoding characters versus encoding visual glyphs as well as Unicode's goals of supporting only plain text characters. Such alternate styling for a mathematical symbol palette could be easily created through rich text protocols instead. ;
Enclosed Alphanumerics Enclosed Alphanumerics is a Unicode block of Typography, typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop. It is currently fully allocated. Within the Basic Multili ...
and ideographs (markers): These are characters included primarily for list markers. They do not constitute plain text characters. Moreover, the use of other rich text protocols is more appropriate since, the set of enclosed alphanumerics or ideographs provisioned in the UCS is limited. ;Circled alphanumerics and ideographs: The circled forms are also likely for use as markers. Again, using characters along with rich text protocols to encircle characters strings is more flexible. ; Spaces and no-break spaces of varying widths: These characters are simply rich text variants of the core space (U+0020) and No-break Space (U+00A0). Other rich text protocols should be used instead such as tracking, kerning or word-spacing attributes. ;Some
subscript and superscript A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the baseline, whil ...
form characters: Many of the subscript and superscript characters are actually semantically distinct characters from the
International Phonetic Alphabet The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic transcription, phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standa ...
and other writing systems and do not really fall in the category of rich text. However, others simply constitute rich text presentation forms of other Greek, Latin and numeral characters. These rich text superscript and subscript characters therefore properly belong to this category of rich text compatibility characters. Most of these are in the "Superscripts and Subscripts" or the "Basic Latin" blocks. For all of these rich text compatibility characters the display of glyphs is typically distinct from their compatibility decomposition (related) characters. However, these are considered compatibility characters and discouraged for use by the Unicode consortium because they are not plain text characters, which is what Unicode seeks to support with its UCS and associated protocols. Rich text should be handled through non-Unicode protocols such as HTML, CSS, RTF and other such protocols. The rich text compatibility characters comprise 1,451 of the 5,402 compatibility characters. These include all of the compatibility characters marked with keywords <circle> and <font> (except three listed in the semantically distinct below); 11 spaces variants from the <compat> and canonical characters; and some of the keyword <superscript> and <subscript> from the "Superscripts and Subscripts" block.


Semantically distinct characters

Many compatibility characters are semantically distinct characters, though they may share representational glyphs with other characters. Some of these characters may have been included because most other characters sets that focused on one script or writing system. So for example, the ISO and other Latin character sets likely included a character for π (pi) since, when focusing on primarily one writing system or script, those character sets would not have otherwise had characters for the common mathematical symbol π;. However, with Unicode, mathematicians are free to use characters from any known script in the World to stand in for a mathematical set or mathematical constant. To date, Unicode has only added specific semantic support for a few such mathematical constants (for example the Planck constant, U+210E, and Euler constant, U+2107, both of which Unicode considers to be compatibility characters). Therefore, Unicode designates several mathematical symbols based on letters from Greek and Hebrew as compatibility characters. These include: *
Hebrew letter The Hebrew alphabet ( he, אָלֶף־בֵּית עִבְרִי, ), known variously by scholars as the Ktav Ashuri, Jewish script, square script and block script, is an abjad script used in the writing of the Hebrew language and other Jewis ...
based symbols (4): alef (ℵ U+2135), bet (ℶ U+2136), gimel (ℷ U+2137) and dalet (ℸ U+2138) *
Greek letter The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as w ...
based symbols (7): beta (ϐ U+03D0), theta (ϑ U+03D1), phi (ϕ U+03D5), pi (ϖ U+03D6), kappa (ϰ U+03F0), rho (ϱ U+03F1), capital theta (ϴ U+03F4) While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word "symbol" to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total. Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead. * Unit symbols (6):
Angstrom The angstromEntry "angstrom" in the Oxford online dictionary. Retrieved on 2019-03-02 from https://en.oxforddictionaries.com/definition/angstrom.Entry "angstrom" in the Merriam-Webster online dictionary. Retrieved on 2019-03-02 from https://www.m ...
(Å U+212B: use U+00C5 instead),
Ohm Ohm (symbol Ω) is a unit of electrical resistance named after Georg Ohm. Ohm or OHM may also refer to: People * Georg Ohm (1789–1854), German physicist and namesake of the term ''ohm'' * Germán Ohm (born 1936), Mexican boxer * Jörg Ohm (b ...
(Ω, U+2126: use U+03A9 instead),
Kelvin The kelvin, symbol K, is the primary unit of temperature in the International System of Units (SI), used alongside its prefixed forms and the degree Celsius. It is named after the Belfast-born and University of Glasgow-based engineer and phys ...
(K U+212A: use U+004B instead),
Fahrenheit The Fahrenheit scale () is a temperature scale based on one proposed in 1724 by the physicist Daniel Gabriel Fahrenheit (1686–1736). It uses the degree Fahrenheit (symbol: °F) as the unit. Several accounts of how he originally defined his ...
(℉ U+2109: use
U+00B0 U or u, is the twenty-first and sixth-to-last Letter (alphabet), letter and fifth vowel letter of the Latin alphabet, used in the English alphabet, modern English alphabet, the alphabets of other western European languages and others worldwide ...
and U+0046 instead),
Celsius The degree Celsius is the unit of temperature on the Celsius scale (originally known as the centigrade scale outside Sweden), one of two temperature scales used in the International System of Units (SI), the other being the Kelvin scale. The ...
(℃ U+2103: use U+00B0 and U+0043 instead),
Micro Micro may refer to: Measurement * micro- (μ), a metric prefix denoting a factor of 10−6 Places * Micro, North Carolina, town in U.S. People * DJ Micro, (born Michael Marsicano) an American trance DJ and producer *Chii Tomiya (都宮 ちい ...
Sign (µ U+00B5: use U+03BC instead) Unicode also designates twenty-two (22) other letter-like symbols as compatibility characters. * Other Greek letter-based symbols (4): lunate epsilon (ϵ U+03F5), lunate sigma (ϲ U+03F2), capital lunate sigma (Ϲ U+03F9), upsilon with hook (ϒ U+03D2) * Mathematical constants (3): Euler constant ( U+2107),
Planck constant The Planck constant, or Planck's constant, is a fundamental physical constant of foundational importance in quantum mechanics. The constant gives the relationship between the energy of a photon and its frequency, and by the mass-energy equivale ...
(ℎ U+210E),
reduced Planck constant The Planck constant, or Planck's constant, is a fundamental physical constant of foundational importance in quantum mechanics. The constant gives the relationship between the energy of a photon and its frequency, and by the mass-energy equivale ...
(ℏ U+210F), * Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC) * Punctuation (4): one dot leader (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C) * Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), numero (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120) In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character — similar to a hybrid between a diacritic and a letter — in the writing system (130 total). * 112 characters representing abstract phonemes from phonetic alphabets such as the
International Phonetic Alphabet The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic transcription, phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standa ...
use such positional glyphs to represent semantic differences (U+1D2C – U+1D6A, U+1D78, U+1D9B – U+1DBF, U+02B0 – U+02B8, U+02E0 – U+02E4) * 14 characters from the
Kanbun A is a form of Classical Chinese used in Japan from the Nara period to the mid-20th century. Much of Japanese literature was written in this style and it was the general writing style for official and intellectual works throughout the period. A ...
block (U+3192 – U+319F) * 1 character from the
Tifinagh Tifinagh ( Tuareg Berber language: or , ) is a script used to write the Berber languages. Tifinagh is descended from the ancient Libyco-Berber alphabet. The traditional Tifinagh, sometimes called Tuareg Tifinagh, is still favored by the Tuare ...
script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F) * 1 character from the
Georgian script The Georgian scripts are the three writing systems used to write the Georgian language: Asomtavruli, Nuskhuri and Mkhedruli. Although the systems differ in appearance, their letters share the same names and alphabetical order and are written hor ...
: Modifier Letter Georgian Nar (ჼ U+10FC) * masculine ( U+00BA) and feminine ( U+00AA) ordinal indicators included in the Latin-1 supplement block Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs. * Capital Roman Numerals (7): One (Ⅰ U+2160), Five (Ⅴ U+2164), Ten (Ⅹ U+2169), Fifty (Ⅼ U+216C), One Hundred (Ⅽ U+216D), Five Hundred (Ⅾ U+216E), One Thousand (Ⅿ U+216F) * and lower case variants (7): One (ⅰ U+2170), Five (ⅴ U+2174), Ten (ⅹ U+2179), Fifty (ⅼ U+217C), One Hundred (ⅽ U+217D), Five Hundred (ⅾ U+217E) and One Thousand (ⅿ U+217F) * 18 precomposed Roman numerals in uppercase and lowercase variants (2–4, 6–9 and 11–12) Roman numeral One Thousand actually has a third character representing a third form or glyph for the same semantic unit: One Thousand C D (ↀ U+2180). From this glyph, one can see where the practice of using a Latin M may have arisen. Strangely, though Unicode unifies the sign-value Roman numerals with the very different (though visually similar) Latin letters, the Indic Arabic
place-value Positional notation (or place-value notation, or positional numeral system) usually denotes the extension to any base of the Hindu–Arabic numeral system (or decimal system). More generally, a positional system is a numeral system in which the ...
(positional) decimal digit numerals are repeated 24 times (a total of 240 code points for 10 numerals) throughout the UCS without any relational or decomposition mapping between them. The presence of these 167 semantically distinct though visually similar characters (plus the borderline 11 Hebrew and Greek letter based symbols and the 6 measurement unit symbols) among the decomposable characters complicates the topic of compatibility characters. The Unicode standard discourages the use of compatibility characters by content authors. However, in certain specialized areas, these characters are important and quite similar to other characters that have not been included among the compatibility characters. For example, in certain academic circles the use of Roman numerals as distinct from Latin letters that share the same glyphs would be no different from the use of Cuneiform numerals or ancient Greek numerals. Collapsing the Roman numeral characters to Latin letter characters eliminates a semantic distinction. A similar situation exists for phonetic alphabet characters that use subscript or superscript positioned glyphs. In the specialized circles that use phonetic alphabets, authors should be able to do so without resorting to rich text protocols. As another example the keyword 'circle' compatibility characters are often used for describing the game Go. However, these uses of the compatibility characters constitute exceptions where the author has a special reason to use the otherwise discouraged characters.


Compatibility blocks

Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example. Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The "Enclosed CJK Letters and Months" block contains a single non-compatibility character: the 'Korean Standard Symbol' (㉿ U+327F). That symbol and 12 other characters have been included in the blocks for unknown reasons. The "CJK Compatibility Ideographs" block contains these non-compatibility unified Han ideographs: # (U+FA0E): 﨎 # (U+FA0F): 﨏 # (U+FA11): 﨑 # (U+FA13): 﨓 # (U+FA14): 﨔 # (U+FA1F): 﨟 # (U+FA21): 﨡 # (U+FA23): 﨣 # (U+FA24): 﨤 # (U+FA27): 﨧 # (U+FA28): 﨨 # (U+FA29): 﨩 These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B. In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice. Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support: Alphabetic Presentation Forms (1) # Hebrew Point Judeo-Spanish Varika (U+FB1E): ﬞ. This is a glyph variant of Hebrew Point
Rafe In Hebrew orthography the rafe or raphe ( he, רָפֶה, , meaning "weak, limp") is a diacritic (), a subtle horizontal overbar placed above certain letters to indicate that they are to be pronounced as fricatives. It originated with the Ti ...
(U+05BF): ֿ, though Unicode provides no compatibility mapping. Arabic Presentation Forms (4) # "Ornate Left Parenthesis" (U+FD3E): ﴾. A glyph variant for U+0029 ')' # "Ornate Right Parenthesis" (U+FD3F): ﴿. A glyph variant for U+0028 '(' # "Ligature Bismillah Ar-Rahman Ar-Raheem" (U+FDFD): ﷽. Bismillah Ar-Rahman Ar-Raheem is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. Unicode chart FB50-FDFF (PDF)
(Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.) # "Arabic Tail Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶) # Sesame Dot (U+FE45): ﹅ # White Sesame Dot (U+FE46): ﹆ Enclosed Alphanumerics (21 rich text variants) # 10 Negative Circled Numbers (0 and 11 through 20) (U+24FF and U+24EB through U+24F4): ⓫ – ⓴ # 11 Double Circled Numbers (0 through 10) (U+24F5 through U+24FE): ⓵ – ⓾


Normalization

Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy).


References


External links


Normalization (Chinese Text Project)
- Unicode normalization issues in classical Chinese, with list of normalized CJK codepoints {{DEFAULTSORT:Unicode Compatibility Characters Unicode