Bi-directional text
   HOME

TheInfoList



OR:

A bidirectional text contains two
text direction A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form ...
alities,
right-to-left In a script (commonly shortened to right to left or abbreviated RTL, RL-TB or R2L), writing starts from the right of the page and continues to the left, proceeding from top to bottom for new lines. Arabic, Hebrew, Persian, Pashto, Urdu, Kashmir ...
(RTL) and left-to-right (LTR). It generally involves text containing different types of
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
s, but may also refer to
boustrophedon Boustrophedon is a style of writing in which alternate lines of writing are reversed, with letters also written in reverse, mirror-style. This is in contrast to modern European languages, where lines always begin on the same side, usually the le ...
, which is changing text direction in each row. Many computer programs fail to display bidirectional text correctly. For example, this page is mostly LTR English script, and here is the RTL Hebrew name Sarah: שרה, spelled sin (ש) on the right, resh (ר) in the middle, and heh (ה) on the left. Some so-called
right-to-left script In a script (commonly shortened to right to left or abbreviated RTL, RL-TB or R2L), writing starts from the right of the page and continues to the left, proceeding from top to bottom for new lines. Arabic, Hebrew, Persian, Pashto, Urdu, Kashmir ...
such as the
Persian script The Persian alphabet ( fa, الفبای فارسی, Alefbâye Fârsi) is a writing system that is a version of the Arabic script used for the Persian language spoken in Iran (Iranian Persian, Western Persian) and Afghanistan (Dari, Dari Persi ...
(and Arabic) are mostly but not exclusively right-to-left; mathematical expressions, numeric dates and numbers bearing units are embedded from left to right. That also happens if e.g. English is embedded in them, or vice versa, if Arabic, Persian or Hebrew is embedded in a left-to-right script.


Bidirectional script support

Bidirectional script support is the capability of a computer system to correctly display bidirectional text. The term is often shortened to "BiDi" or "bidi". Early computer installations were designed only to support a single
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
, typically for left-to-right scripts based on the
Latin alphabet The Latin alphabet or Roman alphabet is the collection of letters originally used by the ancient Romans to write the Latin language. Largely unaltered with the exception of extensions (such as diacritics), it used to write English and th ...
only. Adding new
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
s and
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
s enabled a number of other left-to-right scripts to be supported, but did not easily support right-to-left scripts such as
Arabic Arabic (, ' ; , ' or ) is a Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C. E.Watson; Walter ...
or
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
, and mixing the two was not practical. Right-to-left scripts were introduced through encodings like ISO/IEC 8859-6 and ISO/IEC 8859-8, storing the letters (usually) in writing and reading order. It is possible to simply flip the left-to-right display order to a right-to-left display order, but doing this sacrifices the ability to correctly display left-to-right scripts. With bidirectional script support, it is possible to mix characters from different scripts on the same page, regardless of writing direction. In particular, the
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed.


Unicode bidi support

The Unicode standard calls for characters to be ordered 'logically', i.e. in the sequence they are intended to be interpreted, as opposed to 'visually', the sequence they appear. This distinction is relevant for bidi support because at any bidi transition, the visual presentation ceases to be the 'logical' one. Thus, in order to offer bidi support, Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation. For this purpose, the Unicode encoding standard divides all its characters into one of four types: 'strong', 'weak', 'neutral', and 'explicit formatting'.


Strong characters

Strong characters are those with a definite direction. Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs, non-European or non-Arabic digits, and punctuation characters ''that are specific to only those scripts''.


Weak characters

Weak characters are those with vague direction. Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols.


Neutral characters

Neutral characters have direction indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters. Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, and the no-break-space also fall within this category.


Explicit formatting

Explicit formatting characters, also referred to as "directional formatting characters", are special Unicode sequences that direct the algorithm to modify its default behavior. These characters are subdivided into "marks", "embeddings", "isolates", and "overrides". Their effects continue until the occurrence of either a paragraph separator, or a "pop" character.


Marks

If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with "pseudo-strong" characters. Such Unicode control characters are called ''marks''. The mark ( or ) is to be inserted into a location to make an enclosed weak character inherit its writing direction. For example, to correctly display the for an English name brand (LTR) in an Arabic (RTL) passage, an LRM mark is inserted after the trademark symbol if the symbol is not followed by LTR text (e.g. ""). If the LRM mark is not added, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order (e.g. "").


Embeddings

The "embedding" directional formatting characters are the classical Unicode method of explicit formatting, and as of Unicode 6.3, are being discouraged in favor of "isolates". An "embedding" signals that a piece of text is to be treated as directionally distinct. The text within the scope of the embedding formatting characters is not independent of the surrounding text. Also, characters within an embedding can affect the ordering of characters outside. Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.


Isolates

The "isolate" directional formatting characters signal that a piece of text is to be treated as directionally isolated from its surroundings. As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents – once target platforms are known to support them. These formatting characters were introduced after it became apparent that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use. Unlike the legacy 'embedding' directional formatting characters, 'isolate' characters have no effect on the ordering of the text outside their scope. Isolates can be nested, and may be placed within embeddings and overrides.


Overrides

The "override" directional formatting characters allow for special cases, such as for part numbers (e.g. to force a part number made of mixed English, digits and Hebrew letters to be written from right to left), and are recommended to be avoided wherever possible. As is true of the other directional formatting characters, "overrides" can be nested one inside another, and in embeddings and isolates.


Pops

The "pop" directional formatting characters terminate the scope of the most recent "embedding", "override", or "isolate".


Runs

In the algorithm, each sequence of concatenated strong characters is called a "run". A "weak" character that is located between two "strong" characters with the same orientation will inherit their orientation. A "weak" character that is located between two "strong" characters with a different writing direction will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).


Table of possible BiDi character types


Security

Unicode bidirectional characters are used in the Trojan Source vulnerability.
Visual Studio Code Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include support for debugging, syntax highlighting, intelligent code comple ...
highlights BiDi control characters since version 1.62 released in October 2021. Visual Studio highlights BiDi control characters since version 17.0.3 released on December 14, 2021.


Scripts using bidirectional text


Egyptian hieroglyphs

Egyptian
hieroglyphs A hieroglyph (Greek for "sacred carvings") was a character of the ancient Egyptian writing system. Logographic scripts that are pictographic in form in a way reminiscent of ancient Egyptian are also sometimes called "hieroglyphs". In Neoplatonis ...
were written bidirectionally, where the signs that had a distinct "head" or "tail" faced the beginning of the line.


Chinese characters and other CJK scripts

Chinese characters Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji ...
can be written in either direction as well as vertically (top to bottom then right to left), especially in signs (such as plaques), but the orientation of the individual characters does not change. This can often be seen on tour buses in China, where the company name customarily runs from the front of the vehicle to its rear — that is, from right to left on the right side of the bus, and from left to right on the left side of the bus. English texts on the right side of the vehicle are also quite commonly written in reverse order. (See pictures of tour bus and post vehicle below.) Likewise, other CJK scripts made up of the same square characters, such as the
Japanese writing system The modern Japanese writing system uses a combination of logographic kanji, which are adopted Chinese characters, and syllabic kana. Kana itself consists of a pair of syllabaries: hiragana, used primarily for native or naturalised Japanese ...
and
Korean writing system Korean (South Korean: , ''hangugeo''; North Korean: , ''chosŏnmal'') is the native language for about 80 million people, mostly of Korean descent. It is the official and national language of both North Korea and South Korea (geographically Ko ...
, can also be written in any direction, although left-to-right, top-to-bottom and, right-to-left are most common. Image:Yangzhou-tour-bus--right-side-3182.jpg, The right side (text runs from right to left, including the English text) Image:Yangzhou-tour-bus--leftt-side-3184.jpg, The left side (text runs from left to right) Image:Hainan Airlines - Boeing 737-86N.jpg, On the right side of this Hainan Airlines aircraft, the text runs from right to left (空航南海). Image:Hainan Airlines.JPG, The left side of this Hainan Airlines aircraft, however, shows the text running from left to right (海南航空). File:VM 5485 China Post Office car at Zhengzhou Train Station.jpg, A photo that shows text on both sides of a China Post vehicle. On the right door, appears as .


Boustrophedon

Boustrophedon Boustrophedon is a style of writing in which alternate lines of writing are reversed, with letters also written in reverse, mirror-style. This is in contrast to modern European languages, where lines always begin on the same side, usually the le ...
is a writing style found in ancient
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
inscriptions and in
Hungarian runes The Old Hungarian script or Hungarian runes ( hu, Székely-magyar rovás, 'székely-magyar runiform', or ) is an alphabetic writing system used for writing the Hungarian language. Modern Hungarian is written using the Latin-based Hungarian alph ...
. This method of writing alternates direction, and usually reverses the individual characters, on each successive line.


Moon type

Moon type is an embossed adaptation of the Latin alphabet invented as a
tactile alphabet A tactile alphabet is a system for writing material that the blind can read by touch. While currently the Braille system is the most popular and some materials have been prepared in Moon type, historically, many other tactile alphabets have exist ...
for the blind. Initially the text changed direction (but not character orientation) at the end of the lines. Special embossed lines connected the end of a line and the beginning of the next.
Moon Type for the Blind
', Ramseyer Bible Collection, Kathryn A. Martin Library, University of Minnesota Duluth.
Around 1990, it changed to a left-to-right orientation.


See also

*
Internationalization and localization In computing, internationalization and localization (American) or internationalisation and localisation (British English), often abbreviated i18n and L10n, are means of adapting computer software to different languages, regional peculiarities and ...
*
Horizontal and vertical writing in East Asian scripts Many East Asian scripts can be written horizontally or vertically. Chinese, Japanese, Vietnamese Hán- Nôm and Korean scripts can be oriented along either axis, as they consist mainly of disconnected logographic or syllabic units, each occ ...
* *
Combining Cyrillic Millions Cyrillic numerals are a numeral system derived from the Cyrillic script, developed in the First Bulgarian Empire in the late 10th century. It was used in the First Bulgarian Empire and by South and East Slavic peoples. The system was us ...
*
Right-to-left mark ‏The right-to-left mark (RLM) is a non-printing character used in the computerized typesetting of bi-directional text containing a mix of left-to-right scripts (such as Latin and Cyrillic) and right-to-left scripts (such as Arabic, Syriac, an ...
*
Transformation of text Transformations of text are strategies to perform geometric transformations on text (reversal, rotations, etc.), particularly in systems that do not natively support transformation, such as HTML, seven-segment displays and plain text. Implementa ...
*
Boustrophedon Boustrophedon is a style of writing in which alternate lines of writing are reversed, with letters also written in reverse, mirror-style. This is in contrast to modern European languages, where lines always begin on the same side, usually the le ...


References


External links


Unicode Standards Annex #9
The Bidirectional Algorithm
W3C guidelines on authoring techniques for bi-directional text
- includes examples and good explanations
ICU
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...
contains an implementation of the bi-directional algorithm — along with other internationalization services {{Unicode navigation Character encoding Unicode algorithms Internationalization and localization Writing direction