Unicode control characters
   HOME

TheInfoList



OR:

Many
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the
null character The null character (also null terminator) is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646 (or ASCII), the C0 control code, the Universal Coded Ch ...
NULL Null may refer to: Science, technology, and mathematics Computing * Null (SQL) (or NULL), a special marker and keyword in SQL indicating that something has no value * Null character, the zero-valued ASCII character, also designated by , often use ...
) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string (as opposed to a starting address and a length), since the string ends once the program reads the null character. In the narrowest sense, a ''control code'' is a character with the general category , which comprises the
C0 and C1 control codes The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
, a concept defined in
ISO/IEC 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
and inherited by Unicode, with the most common set being defined in ISO/IEC 6429. Control codes are handled distinctly from ordinary Unicode characters, for example, by not being assigned character names (although they are assigned normative formal aliases). In a broader sense, other non-printing format characters, such as those used in
bidirectional text A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in ea ...
, are also referred to as ''control characters'' by software; these are mostly assigned to the general category (format), used for format effectors introduced and defined by Unicode itself.


Category "Cc" control codes (C0 and C1)

The control code ranges 0x00–0x1F ("C0") and 0x7F originate from the 1967 edition of
US-ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
. The standard
ISO/IEC 2022 ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the ...
(ECMA-35) defines extension methods for ASCII, including a secondary "C1" range of 8-bit control codes from 0x80 to 0x9F, equivalent to 7-bit sequences of with the bytes 0x40 through 0x5F. Collectively, codes in these ranges are known as the
C0 and C1 control codes The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
. Although ISO/IEC 2022 allows for the existence of multiple control code sets specifying differing interpretations of these control codes, their most common interpretation is specified in ISO/IEC 6429 (ECMA-48). The
ISO/IEC 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
series of encodings conforms to ISO/IEC 4873 (ECMA-43) level 1, a subset of ISO/IEC 2022 designed for 8-bit character encodings, and therefore reserves the range 0x80–0x9F for use as non-printing codes by C1 control code sets such as ISO/IEC 6429. Unicode inherits its first and second blocks (comprising U+0000 through U+00FF) from ASCII and ISO/IEC 8859-1, thus incorporating the C0 and C1 control code ranges (U+0000–U+001F, U+007F–U+009F) as general category "Cc". It does not assign normative names to these control codes, though it does assign them normative aliases. Category "Cc" control codes can serve a variety of purposes, not limited to format effectors: for example, the default ASCII C0 set includes six format effectors (, , , , and ), ten transmission controls, four device controls, four information separators and eight other control codes. Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by
terminal emulator A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote term ...
s. Certain characters are commonly used for formatting or sentinel purposes: *   (used in
null-terminated string In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with a value of zero, called NUL in this article). Alternative names are C str ...
s) *   (inserted by the tab key) *   (used as a line break) *   (denotes a
page break A page break is a marker in an electronic document that tells the document interpreter that the content which follows is part of a new page. A page break causes a form feed to be sent to the printer during spooling of the document to the printer. ...
in a plain text file) *   (used in some line-breaking conventions) *   (sometimes used as a line break in text transcoded from
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
) Unicode only specifies semantics for , , and (the ASCII format effectors except for , plus the ASCII information separators and the C1 ). The rest of the "Cc" control codes are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default. Furthermore, certain specialised higher-level protocols, such as transcoded
Teletext A British Ceefax football index page from October 2009, showing the three-digit page numbers for a variety of football news stories Teletext, or broadcast teletext, is a standard for displaying text and rudimentary graphics on suitably equipp ...
, may include a different interpretation of the entire C0 control code range.


Unicode introduced separators

In an attempt to simplify the several newline characters used in legacy text, Unicode introduces its own newline characters to separate either lines or paragraphs: (abbreviated LS or LSEP) and (abbreviated PS or PSEP). Like CR and LF, LS and PS are effectors for text formatting; unlike CR and LF, they are not treated as "control codes" for ECMA-35/ ECMA-48 purposes (category ), rather having semantics defined entirely by Unicode itself. They are assigned to '' sui generis'' Unicode categories and respectively, under the major category (separator) used for certain
whitespace character In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
s.


Language tags

Unicode previously included 128 characters, now deprecated, for language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to BCP 47. For example, to indicate subsequent text as the variant of English as written in the United States, the sequence , , , , and would have been used. These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in. The tag characters and were deprecated in Unicode 5.1 (2008) and should not be used for language information. The characters were also deprecated, but were restored with the release of Unicode 8.0 (2015). The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags". Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.


Interlinear annotation

Three formatting characters provide support for interlinear annotation (, , ). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C Ruby markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.


Bidirectional text control

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسم الله”) (translated into English as "Bismillah") right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, directionality may not be detected correctly if left-to-right text is quoted at the beginning of a right-to-left paragraph (or ''vice versa''), and the support for bidirectional text becomes even more complicated when text flowing in opposite directions is embedded hierarchically, for example if an English text quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides twelve characters to help control these embedded bidirectional text levels up to 125 levels deep: * * * * * * * * * * * *


Variation selectors

Many characters map to alternate glyphs depending on the context. For example, Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute. However, for other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant. As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character.


Control pictures

Unicode provides graphic characters for representing
C0 control codes The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
(and
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually cons ...
and a generic newline) in the Control Pictures block. They are visual representations, not the actual control codes themselves. There are no equivalent characters for the
C1 control codes The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
.


See also

*
Specials (Unicode block) Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 code points, five have been assigned since Unicode 3.0: *, marks start of annotated text *, marks start ...
*
ISO 2047 ISO 2047 (Information processing – Graphical representations for the control characters of the 7-bit coded character set) is a standard for graphical representation of the control characters for debugging purposes, such as may be found in the ...


References

{{unicode navigation Unicode special code points