Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...

has a certain amount of duplication of

characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...

. These are pairs of single Unicode code points that are canonically equivalent. The reason for this are compatibility issues with legacy systems. Unless two characters are canonically equivalent, they are not "duplicate" in the narrow sense. There is, however, room for disagreement on whether two Unicode characters really encode the same

grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called '' graphemi ...

in cases such as the versus . This should be clearly distinguished from Unicode characters that are rendered as identical glyphs or near-identical glyphs (

homoglyph In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. The designation is also applied to sequences of characters sharing these properties. Synoglyph ...

s), either because they are historically cognate (such as Greek Η vs. Latin H) or because of coincidental similarity (such as Greek Ρ vs. Latin P, or Greek Η vs. Cyrillic Н, or the following homoglyph septuplet: astronomical symbol for "Sun" ☉, "circled dot operator" ⊙, the Gothic letter 𐍈, the IPA symbol for a bilabial click , the Osage letter 𐓃, the

Tifinagh Tifinagh ( Tuareg Berber language: or , ) is a script used to write the Berber languages. Tifinagh is descended from the ancient Libyco-Berber alphabet. The traditional Tifinagh, sometimes called Tuareg Tifinagh, is still favored by the Tuar ...

letter ⵙ, and the archaic cyrillic letter Ꙩ.)

Duplicate vs. derived character

Unicode aims at encoding graphemes, not individual "meanings" ("semantics") of graphemes, and not

glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...

s. It is a matter of case-by-case judgement whether such characters should receive separate encoding when used in technical contexts, e.g. Greek letters used as mathematical symbols: thus, the choice to have a "

micro- ''Micro'' (Greek letter μ ( U+03BC) or the legacy symbol µ (U+00B5)) is a unit prefix in the metric system denoting a factor of 10−6 (one millionth). Confirmed in 1960, the prefix comes from the Greek ('), meaning "small". The symbol for ...

sign" µ separate from Greek μ, but not a "

Mega Mega or MEGA may refer to: Science * mega-, a metric prefix denoting 106 * Mega (number), a certain very large integer in Steinhaus–Moser notation * "mega-" a prefix meaning "large" that is used in taxonomy * Gravity assist, for ''Moon-Eart ...

sign" separate from Latin M was a pragmatic decision by Unicode consortium for historical reasons (compatibility with

Latin-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1 ...

which included a micro sign). Technically µ and μ are not duplicate characters in that the consortium viewed these symbols as distinct characters (while it regarded M for "Mega" and Latin M as one and the same character). Note that merely having different "meanings" is not sufficient grounds to split a grapheme into several characters: Thus, the

acute accent The acute accent (), , is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts. For the most commonly encountered uses of the accent in the Latin and Greek alphabets, precomposed ...

may represent word accent in Welsh or Swedish, it may express vowel quality in French, and it may express vowel length in Hungarian, Icelandic or Irish. Since all these languages are written in the same

script Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of ha ...

, namely

Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern ...

, the acute accent in its various meanings is considered one and the same combining diacritic character (U+0301), as well as the accented letter é is the same character in French and Hungarian. There is a separate "combining diacritic acute tone mark" at U+0341 for the romanization of tone languages, one important difference between the two being that in a language like French, the acute accent can replace the dot over the lowercase i, whereas in a language like Vietnamese, the acute tone mark is added above the dot. Diacritic signs for alphabets considered independent may be encoded separately, such as the acute ("tonos") for the Greek alphabet at U+0384, and for the Armenian alphabet at U+055B. Some Cyrillic-based alphabets (such as Russian) also use the acute accent, but there is no "Cyrillic acute" encoded separately and U+0301 should be used for Cyrillic as well as Latin (see

Cyrillic characters in Unicode As of Unicode version 15.0 Cyrillic script is encoded across several blocks: * CyrillicU+0400–U+04FF 256 characters * Cyrillic SupplementU+0500–U+052F 48 characters * Cyrillic Extended-AU+2DE0–U+2DFF 32 characters * Cyrillic Extended-BU ...

). The point that the same grapheme can have many "meanings" is even more obvious considering e.g. the letter U, which has entirely different phonemic referents in the various languages that use it in their orthographies (English etc., French , German , etc., not to mention various uses of U as a symbol).

Compatibility issues

CJK fullwidth forms

In traditional

Chinese character encoding In computing, Chinese character encodings can be used to represent text written in the CJK languages— Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose characte ...

s, characters usually took either a single

byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...

(known as halfwidth) or two bytes (known as fullwidth). Characters that took a single byte were generally displayed at half the width of those that took two bytes. Some characters such as the

Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and the ...

were available in both halfwidth and fullwidth versions. As the halfwidth versions were more commonly used, they were generally the ones mapped to the standard code points for those characters. Therefore a separate section was needed for the fullwidth forms to preserve the distinction.

Letterlike symbols

In some cases, specific graphemes have acquired a specialized symbolic or technical meaning separate from their original function. A prominent example is the Greek letter π which is widely recognized as the symbol for the mathematical constant of a circle's circumference divided by its diameter even by people not literate in Greek. Several variants of the entire Greek and Latin alphabets specifically for use as mathematical symbols are encoded in the

Mathematical Alphanumeric Symbols Mathematical Alphanumeric Symbols is a Unicode block comprising styled forms of Latin and Greek letters and decimal digits that enable mathematicians to denote different notions with different letter styles. The letters in various fonts o ...

range. This range disambiguates characters that would usually be considered font variants but are encoded separately because of widespread use of font variants (e.g. L vs. "script L" vs. "blackletter L" vs. "boldface blackletter L" ) as distinctive

mathematical symbols A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. ...

. It is intended for use only in mathematical or technical notation, not use in non-technical text.Draft Unicode Technical Report #25
/ref>

List

Greek

Many

Greek letters The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as we ...

are used as

technical symbol Miscellaneous Technical is a Unicode block ranging from U+2300 to U+23FF, which contains various common symbols which are related to and used in the various technical, programming language, and academic professions. For example: * Symbol ⌂ ( ...

s. All of the Greek letters are encoded in the Greek section of Unicode but many are encoded a second time under the name of the technical symbol they represent. The "

micro sign ''Micro'' (Greek letter μ ( U+03BC) or the legacy symbol µ (U+00B5)) is a unit prefix in the metric system denoting a factor of 10−6 (one millionth). Confirmed in 1960, the prefix comes from the Greek ('), meaning "small". The symbol fo ...

" (U+00B5, µ) is obviously inherited from

ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...

, but the origin of the others is less clear. Other Greek glyph variants encoded as separate characters include the

lunate sigma Sigma (; uppercase Σ, lowercase σ, lowercase in word-final position ς; grc-gre, σίγμα) is the eighteenth letter of the Greek alphabet. In the system of Greek numerals, it has a value of 200. In general mathematics, uppercase Σ is used as ...

Ϲ ϲ contrasting with Σ σ, final sigma ς (strictly speaking a contextual glyph variant) contrasting with σ, The

Qoppa Koppa or qoppa (; as a modern numeral sign: ) is a letter that was used in early forms of the Greek alphabet, derived from Phoenician qoph (). It was originally used to denote the sound, but dropped out of use as an alphabetic character in fav ...

numeral symbol Ϟ ϟ contrasting with archaic Ϙ ϙ. Greek letters assigned separate "symbol" codepoints include the Letterlike Symbols ϐ, ϵ, ϑ, ϖ, ϱ, ϒ, and ϕ (contrasting with β, ε, θ, π, ρ, Υ, φ); the Ohm symbol Ω (contrasting with Ω); and the

mathematical operators Mathematical Operators is a Unicode block containing characters for mathematical, logical, and set notation. Notably absent are the plus sign (+), greater than sign (>) and less than sign (<), due to them already appearing in the Bas ...

for the product ∏ and sum ∑ (contrasting with Π and Σ).

Roman numerals

Unicode has a number of characters specifically designated as

Roman numerals Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, ...

, as part of the ''Number Forms'' range from U+2160 to U+2183. For example, Roman 1988 (MCMLXXXVIII) could alternatively be written as ⅯⅭⅯⅬⅩⅩⅩⅧ. This range includes both upper- and lowercase numerals, as well as pre-combined glyphs for numbers up to 12 (Ⅻ for XII), mainly intended for clock faces. The pre-combined glyphs should only be used to represent the individual numbers where the use of individual glyphs is not wanted, and not to replace compounded numbers. For example, one can combine Ⅹ with Ⅰ to mean Roman numeral eleven (ⅩⅠ), so U+216A (Ⅺ) is canonically equivalent to ⅩⅠ. Such characters are also referred to as composite compatibility characters or decomposable compatibility characters. Such characters would not normally have been included within the Unicode standard except for compatibility with other existing encodings (see

Unicode compatibility characters In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. As the Unicode Glossary says: A character that would not have been encoded excep ...

). The goal was to accommodate simple translation from existing encodings into Unicode. This makes translations in the opposite direction complicated because multiple Unicode characters may map to a single character in another encoding. Without the compatibility concerns the only characters necessary would be: Ⅰ, Ⅴ, Ⅹ, Ⅼ, Ⅽ, Ⅾ, Ⅿ, ⅰ, ⅴ, ⅹ, ⅼ, ⅽ, ⅾ, ⅿ, ↀ, ↁ, ↂ, ↇ, ↈ, and Ↄ; all other Roman numerals can be composed from these.

References

{{Unicode navigation Unicode