HOME

TheInfoList



OR:

Many scripts in Unicode, such as Arabic, have special
orthographic rules Morphological parsing, in natural language processing, is the process of determining the morphemes from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'f ...
that require certain combinations of letterforms to be combined into special ligature forms. In English, the common
ampersand The ampersand, also known as the and sign, is the logogram , representing the grammatical conjunction, conjunction "and". It originated as a typographic ligature, ligature of the letters of the word (Latin for "and"). Etymology Tradi ...
(&) developed from a ligature in which the handwritten Latin letters ''e'' and ''t'' (spelling ''et'',
Latin Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
for ''and'') were combined. The rules governing ligature formation in Arabic can be quite complex, requiring special script-shaping technologies such as the Arabic Calligraphic Engine by Thomas Milo's DecoType.unicode.org
Biography: Thomas Milo - DecoType
'
As of
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
, the
Arabic script The Arabic script is the writing system used for Arabic (Arabic alphabet) and several other languages of Asia and Africa. It is the second-most widely used alphabetic writing system in the world (after the Latin script), the second-most widel ...
is contained in the following blocks: *
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
(0600–06FF, 256 characters) * Arabic Supplement (0750–077F, 48 characters) * Arabic Extended-B (0870–089F, 42 characters) * Arabic Extended-A (08A0–08FF, 96 characters) *
Arabic Presentation Forms-A Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. This block also allocates 32 noncharacters in Unicode, designed specifically ...
(FB50–FDFF, 631 characters) *
Arabic Presentation Forms-B Arabic Presentation Forms-B is a Unicode block encoding spacing forms of Arabic diacritics, and contextual letter forms. The special codepoint ZWNBSP (''zero width no-break space'') is also here, which is only meant for a byte order mark (that ma ...
(FE70–FEFF, 141 characters) * Rumi Numeral Symbols (10E60–10E7F, 31 characters) * Arabic Extended-C (10EC0-10EFF, 7 characters) * Indic Siyaq Numbers (1EC70–1ECBF, 68 characters) * Ottoman Siyaq Numbers (1ED00–1ED4F, 61 characters) * Arabic Mathematical Alphabetic Symbols (1EE00–1EEFF, 143 characters) The basic Arabic range encodes the standard letters and diacritics, but does not encode contextual forms (U+0621–U+0652 being directly based on ISO 8859-6); and also includes the most common diacritics and Arabic-Indic digits. The Arabic Supplement range encodes letter variants mostly used for writing African (non-Arabic) languages. The Arabic Extended-B and Arabic Extended-A ranges encode additional Qur'anic annotations and letter variants used for various non-Arabic languages. The Arabic Presentation Forms-A range encodes contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The Arabic Presentation Forms-B range encodes spacing forms of Arabic diacritics, and more contextual letter forms. The presentation forms are present only for compatibility with older standards, and are not currently needed for coding text. The Arabic Mathematical Alphabetical Symbols block encodes characters used in Arabic mathematical expressions. The Indic Siyaq Numbers block contains a specialized subset of Arabic script that was used for accounting in India under the
Mughal Empire The Mughal Empire was an Early modern period, early modern empire in South Asia. At its peak, the empire stretched from the outer fringes of the Indus River Basin in the west, northern Afghanistan in the northwest, and Kashmir in the north, to ...
by the 17th century through the middle of the 20th century. The Ottoman Siyaq Numbers block contains a specialized subset of Arabic script, also known as ''Siyakat'' numbers, used for accounting in
Ottoman Turkish Ottoman Turkish (, ; ) was the standardized register of the Turkish language in the Ottoman Empire (14th to 20th centuries CE). It borrowed extensively, in all aspects, from Arabic and Persian. It was written in the Ottoman Turkish alphabet. ...
documents.


Contextual forms

Below is a demonstration for the basic alphabet used in
Modern Standard Arabic Modern Standard Arabic (MSA) or Modern Written Arabic (MWA) is the variety of Standard language, standardized, Literary language, literary Arabic that developed in the Arab world in the late 19th and early 20th centuries, and in some usages al ...
illustrating how Arabic letters are expected to appear in different contexts. Codepoints listed as contextual forms should "should ''not'' be used in general interchange". Unicode has other methods of encoding the difference if necessary, such as
Zero-width joiner The zero-width joiner (ZWJ, ; rendered: ; HTML entity: or ) is a non-printing character used in the computerized typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes (complex ...
.


Punctuation and ornaments

Only the Arabic question mark ⟨⟩ and the Arabic comma ⟨⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma ⟨ ,⟩ which is also used as the
decimal separator FIle:Decimal separators.svg, alt=Four types of separating decimals: a) 1,234.56. b) 1.234,56. c) 1'234,56. d) ١٬٢٣٤٫٥٦., Both a comma and a full stop (or period) are generally accepted decimal separators for international use. The apost ...
when the
Eastern Arabic numerals The Eastern Arabic numerals, also called Indo-Arabic numerals or Arabic-Indic numerals as known by Unicode, are the symbols used to represent numerical digits in conjunction with the Arabic alphabet in the countries of the Mashriq (the east o ...
are used (e.g. ⟨100.6⟩ compared to ⟨⟩). * * * * * * * *U+066D ٭ * * * * * * *


Word ligatures

Arabic Presentation Forms-A has a few characters defined as "word ligatures" for terms frequently used in formulaic expressions in Arabic. They are rarely used out of professional liturgical typing, also the Rial grapheme is normally written fully, not by the ligature. * * * *, as in the phrase ' * * * * * * * * * *


Code blocks


Arabic


Character table


Compact table


Arabic Supplement


Arabic Extended-B


Arabic Extended-A


Arabic Presentation Forms A

They are mostly ligatures which can be created from the previous charts' characters, with the exception of the bracket-like graphemes and some of them are ligatures of common liturgical phrases.


Arabic Presentation Forms B

These can all be created from the basic chart's characters.


Rumi Numeral Symbols


Arabic Extended-C


Indic Siyaq Numbers


Ottoman Siyaq Numbers


Arabic Mathematical Alphabetic Symbols


References


External links

* * * /software.sil.org/Scheherazade Scheherazadeor /fonts.google.com/specimen/Scheherazade+New?subset=arabic Scheherazade New an extended Arabic script font designed by
SIL International SIL Global (formerly known as the Summer Institute of Linguistics International) is an evangelical Christian nonprofit organization whose main purpose is to study, develop and document languages, especially those that are lesser-known, to expan ...
, distributed under the
SIL Open Font License The SIL Open Font License (or OFL in short) is one of the major open font licenses, which allows embedding, or "bundling", of the font in commercially sold products. OFL is a free and open source license. It was created by SIL Global, the ...
(OFL) * /fonts.google.com/specimen/Harmattan?subset=arabic Harmattan an extended Arabic script font designed by
SIL International SIL Global (formerly known as the Summer Institute of Linguistics International) is an evangelical Christian nonprofit organization whose main purpose is to study, develop and document languages, especially those that are lesser-known, to expan ...
for West Africa, distributed under the
SIL Open Font License The SIL Open Font License (or OFL in short) is one of the major open font licenses, which allows embedding, or "bundling", of the font in commercially sold products. OFL is a free and open source license. It was created by SIL Global, the ...
(OFL) {{Unicode navigation *
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...