.
History
The basic character set of the C programming language is a subset of theImplementations
Trigraphs are not commonly encountered outsideTRIGRAPH.EXE
), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).
Language support
Different systems define different sets of digraphs and trigraphs, as described below.ALGOL
Early versions of:=
for ←
(assignment) and >=
for ≥
(greater than or equal).
Pascal
The Pascal programming language supports digraphs(.
, .)
, (*
and *)
for , ">/code>,
/code>,
respectively. Unlike all other cases mentioned here, (*
and *)
were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (*
cannot be closed with }
and vice versa.
J
The J programming language
The J programming language, developed in the early 1990s by Kenneth E. Iverson and Roger Hui, is an array programming language based primarily on APL (also by Iverson).
To avoid repeating the APL special-character problem, J uses only the basic ...
is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, .
(dot) and :
(colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols".
Unlike the use of digraphs and trigraphs in C and C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, there are no single-character equivalents to these in J.
C
The C preprocessor
The C preprocessor is the macro preprocessor for the C, Objective-C and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line control ...
(used for C and with slight differences in C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23).
A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ?
tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literal
A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
s, and comments. This is particularly a problem for the classic Mac OS, where the constant '????'
may be used as a file type or creator. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..."
or an escape sequence
In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters.
Examples
* In C and ma ...
"...?\?..."
.
???
is not itself a trigraph sequence, but when followed by a character such as -
it will be interpreted as ?
+ ??-
, as in the example below which has 16 ?
s before the /
.
The ??/
trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:
which is a single logical comment line (used in C++ and C99
C99 (previously known as C9X) is an informal name for ISO/IEC 9899:1999, a past version of the C programming language standard. It extends the previous version ( C90) with new features for the language and the standard library, and helps impl ...
), and
which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.
In 1994, a normative amendment to the C standard, included in C99, supplied digraphs as more readable alternatives to five of the trigraphs.
Unlike trigraphs, digraphs are handled during tokenization
Tokenization may refer to:
* Tokenization (lexical analysis) in language processing
* Tokenization (data security) in the field of data security
* Word segmentation
* Tokenism
Tokenism is the practice of making only a perfunctory or symbolic ...
, and any digraph must always represent a full token by itself, or compose the token %:%:
replacing the preprocessor concatenation token ##
. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.
C++
C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
(through C++14
C++14 is a version of the ISO/IEC 14882 standard for the C++ programming language. It is intended to be a small extension over C++11, featuring mainly bug fixes and small improvements, and was replaced by C++17. Its approval was announced on Augus ...
, see below) behaves like C, including the C99 additions, but with additional tokens listed in the table.
As a note, %:%:
is treated as a single token, rather than two occurrences of %:
.
In the sequence <::
if the subsequent character is neither :
nor >
, the <
is treated as a preprocessing token by itself and not as the first character of the alternative token <:
. This is done so certain uses of templates are not broken by the substitution.
The C++ Standard makes this comment with regards to the term "digraph":
Trigraphs were proposed for deprecation in C++0x
C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions b ...
, which was released as C++11
C++11 is a version of the ISO/ IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions b ...
. This was opposed by IBM, speaking on behalf of itself and other users of C++, and as a result trigraphs were retained in C++11. Trigraphs were then proposed again for removal (not only deprecation) in C++17. This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM. Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs.
RPL
Hewlett-Packard calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called ''TIO codes'') to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set on foreign platforms, and to ease keyboard input without using the application. The first character of all TIO codes is a \
, followed by two other ASCII characters vaguely resembling the glyph to be substituted. All other characters can be entered using the special \nnn
TIO code syntax with nnn being a three-digit decimal number
The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...
(with leading zeros if necessary) of the corresponding code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
(thereby formally representing a ''tetragraph
A tetragraph (from the el, τετρα-, ''tetra-'', "four" and γράφω, ''gráphō'', "write") is a sequence of four letters used to represent a single sound (phoneme), or a combination of sounds, that do not necessarily correspond to the indi ...
'').
Application support
Vim
The Vim text editor supports digraphs for actual entry of text characters, following . The entry of digraphs is bound
Bound or bounds may refer to:
Mathematics
* Bound variable
* Upper and lower bounds, observed limits of mathematical functions
Physics
* Bound state, a particle that has a tendency to remain localized in one or more regions of space
Geography
*B ...
to by default. The list of all possible digraphs in Vim can be displayed by typing .
GNU Screen
GNU Screen
GNU Screen is a terminal multiplexer, a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate login sessions inside a single terminal window, or detach and reattach sessions fro ...
has a digraph command, bound to by default.
Lotus
Lotus 1-2-3
Lotus 1-2-3 is a discontinued spreadsheet program from Lotus Software (later part of IBM). It was the first killer application of the IBM PC, was hugely popular in the 1980s, and significantly contributed to the success of IBM PC-compatibles i ...
for DOS
DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems.
DOS may also refer to:
Computing
* Data over signalling (DoS), multiplexing data onto a signalling channel
* Denial-of-service attack (DoS), an attack on a communicat ...
uses as compose key to allow easier input of many special characters of the Lotus International Character Set (LICS) and Lotus Multi-Byte Character Set
The Lotus Multi-Byte Character Set (LMBCS) is a proprietary multi-byte character encoding originally conceived in 1988 at Lotus Development Corporation with input from Bob Balaban and others. Created around the same time and addressing some of the ...
(LMBCS).
See also
* Compose key
* List of XML and HTML character entity references
In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...
* Escape sequence
In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters.
Examples
* In C and ma ...
* Escape sequences in C
Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a ...
* C alternative tokens
C alternative tokens refer to a set of alternative spellings of common operators in the C programming language. They are implemented as a group of macro constants in the C standard library in the iso646.h header. The tokens were created by Bjar ...
References
External links
* {{IETF_RFC, 1345
C (programming language)
Character encoding
Input/output