HOME

TheInfoList



OR:

In
computer programming Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
and, according to a
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
's specification, should be treated as if they were single characters. Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
of the language, input of special characters may be difficult,
text editor A text editor is a type of computer program that edits plain text. An example of such program is "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be used to c ...
s may reserve some characters for special use and so on. Trigraphs might also be used for some
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding si ...
code page In computing, a code page is a character encoding and as such it is a specific association of a set of printable character (computing), characters and control characters with unique numbers. Typically each number represents the binary value in a s ...
s that lack characters such as .


History

The basic character set of the
C programming language C (''pronounced'' '' – like the letter c'') is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of ...
is a subset of the
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
character set that includes nine characters which lie outside the
ISO 646 ISO/IEC 646 ''Information technology — ISO 7-bit coded character set for information interchange'', is an International Organization for Standardization, ISO/International Electrotechnical Commission, IEC standard in the ...
invariant character set. This can pose a problem for writing
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
when the
encoding In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
(and possibly keyboard) being used does not support one or more of these nine characters. The
ANSI C ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and the ...
committee invented trigraphs as a way of entering source code using keyboards that support any national version of the ISO 646 character set. With the widespread adoption of
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
and
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
/
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
, trigraph use is limited today, and trigraph support has been removed from C as of C23.


Implementations

Trigraphs are not commonly encountered outside
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
test suites. Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files.
Borland Borland Software Corporation was a computing technology company founded in 1983 by Niels Jensen, Ole Henriksen, Mogens Glad, and Philippe Kahn. Its main business was developing and selling software development and software deployment products. B ...
supplied a separate program, the trigraph preprocessor (TRIGRAPH.EXE), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).


Language support

Different systems define different sets of digraphs and trigraphs, as described below.


ALGOL

Early versions of
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
predated the standardized ASCII and EBCDIC character sets, and were typically implemented using a manufacturer-specific
six-bit character code A six-bit character code is a character encoding designed for use on computers with word lengths a multiple of 6. Six bits can only encode 64 distinct characters, so these codes generally include only the upper-case letters, the numerals, some pun ...
. A number of ALGOL operations either lacked codepoints in the available character set or were not supported by peripherals, leading to a number of substitutions including := for (assignment) and >= for (greater than or equal).


Pascal

The Pascal programming language supports digraphs (., .), (* and *) for /code>, /code>, respectively. Unlike all other cases mentioned here, (* and *) were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (* cannot be closed with } and vice versa.


J

The J programming language is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, . (dot) and : (colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols". Unlike the use of digraphs and trigraphs in C and C++, there are no single-character equivalents to these in J.


C

The C preprocessor (used for C and with slight differences in C++; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23). A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ? tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants,
string literal string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where , "foo ...
s, and comments. This is particularly a problem for the
classic Mac OS Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Mac (computer), Macintosh family of personal computers by Apple Computer, Inc. from 1984 to 2001, starting with System 1 and end ...
, where the constant '????' may be used as a file type or creator. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..." or an
escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
"...?\?...". ??? is not itself a trigraph sequence, but when followed by a character such as - it will be interpreted as ? + ??-, as in the example below which has 16 ?s before the /. The ??/ trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example: which is a single logical comment line (used in C++ and C99), and which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed. In 1994, a normative amendment to the C standard, C95, included in C99, supplied digraphs as more readable alternatives to five of the trigraphs. Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%: replacing the preprocessor concatenation token ##. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.


C++

C++ (through
C++14 C14, C.XIV or C-14 may refer to: Time * The 14th century * Carbon-14, a radioactive isotope of carbon ** Radiocarbon dating, C-14 dating, a method for dating events Science * IEC 60320#C14, IEC 60320 C14, a polarised, three pole socket electrical ...
, see below) behaves like C, including the C99 additions. As a note, %:%: is treated as a single token, rather than two occurrences of %:. In the sequence <:: if the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:. This is done so certain uses of templates are not broken by the substitution. The C++ Standard makes this comment with regards to the term "digraph": Trigraphs were proposed for deprecation in C++0x, which was released as
C++11 C++11 is a version of a joint technical standard, ISO/IEC 14882, by the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC), for the C++ programming language. C++11 replaced the prior vers ...
. This was opposed by
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
, speaking on behalf of itself and other users of C++, and as a result trigraphs were retained in C++11. Trigraphs were then proposed again for removal (not only deprecation) in
C++17 C17, C-17 or C.17 may refer to: Transportation * , a 1917 British C-class submarine Air * Boeing C-17 Globemaster III, a military transport aircraft * Lockheed Y1C-17 Vega, a six-passenger monoplane * Cierva C.17, a 1928 English experimental ...
. This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM. Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs.


RPL

Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called ''TIO codes'') to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set on foreign platforms, and to ease keyboard input without using the application. The first character of all TIO codes is a \, followed by two other ASCII characters vaguely resembling the glyph to be substituted. All other characters can be entered using the special \nnn TIO code syntax with nnn being a three-digit decimal number (with
leading zero A leading zero is any 0 digit that comes before the first nonzero digit in a number string in positional notation.. For example, James Bond's famous identifier, 007, has two leading zeros. Any zeros appearing to the left of the first non-zero dig ...
s if necessary) of the corresponding
code point A code point, codepoint or code position is a particular position in a Table (database), table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dime ...
(thereby formally representing a ''
tetragraph A tetragraph, , is a sequence of four letters used to represent a single sound (phoneme), or a combination of sounds, that do not necessarily correspond to the individual values of the letters. In German, for example, the tetragraph ''tsch'' repre ...
'').


Application support


Vim

The Vim text editor supports digraphs for actual entry of text characters, following . The entry of digraphs is bound to by default. The list of all possible digraphs in Vim can be displayed by typing .


GNU Screen

GNU Screen has a digraph command, bound to by default.


Lotus

Lotus 1-2-3 Lotus 1-2-3 is a discontinued spreadsheet program from Lotus Software (later part of IBM). It was the first killer application of the IBM PC, was hugely popular in the 1980s, and significantly contributed to the success of IBM PC-compatibles ...
for DOS uses as
compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For insta ...
to allow easier input of many special characters of the Lotus International Character Set (LICS) and Lotus Multi-Byte Character Set (LMBCS).


See also

*
Compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For insta ...
* List of XML and HTML character entity references *
Escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
* Escape sequences in C * C alternative tokens


References


External links

* {{IETF_RFC, 1345 C (programming language) Character encoding Input/output