This article needs additional citations for verification .(September 2008) |
In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters. Trigraphs have been removed from the C++ language, and will be from C as of C23, thus likely aren't used much in practice in C already, nor in any other mainstream language (use of them in the language J is an exception). In the modern world of Unicode/UTF-8 (even just with ASCII) there's no need for trigraphs in language design, which were considered a burden, and neither really digraphs, that likely have very few users, at least in those languages.
Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as {
and }
.
The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set. [1]
Trigraphs are not commonly encountered outside compiler test suites. [2] Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor (TRIGRAPH.EXE
), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).
Different systems define different sets of digraphs and trigraphs, as described below.
Early versions of ALGOL predated the standardized ASCII and EBCDIC character sets, and were typically implemented using a manufacturer-specific six-bit character code. A number of ALGOL operations either lacked codepoints in the available character set or were not supported by peripherals, leading to a number of substitutions including :=
for ←
(assignment) and >=
for ≥
(greater than or equal).
The Pascal programming language supports digraphs (.
, .)
, (*
and *)
for [
, ]
, {
and }
respectively. Unlike all other cases mentioned here, (*
and *)
were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (*
cannot be closed with }
and vice versa.
The J programming language is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, .
(dot) and :
(colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols". [3]
Unlike the use of digraphs and trigraphs in C and C++, there are no single-character equivalents to these in J.
Trigraph | Equivalent |
---|---|
??= | # |
??/ | \ |
??' | ^ |
??( | [ |
??) | ] |
??! | | |
??< | { |
??> | } |
??- | ~ |
The C preprocessor (used for C and with slight differences in C++; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23 [4] ). [5] [6]
A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ?
tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. This is particularly a problem for the classic Mac OS, where the constant '????'
may be used as a file type or creator. [7] To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..."
or an escape sequence "...?\?..."
.
???
is not itself a trigraph sequence, but when followed by a character such as -
it will be interpreted as ?
+ ??-
, as in the example below which has 16 ?
s before the /
.
The ??/
trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:
<nowiki/>// Will the next line be executed????????????????/a++;
which is a single logical comment line (used in C++ and C99), and
<nowiki/>/??/*Acomment*??//
which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.
<nowiki/>inttrigraphsavailable()// returns 0 or 1; language standard C99 or later{// are trigraphs available??/return0;return1;}
Digraph | Equivalent |
---|---|
<: | [ |
:> | ] |
<% | { |
%> | } |
%: | # |
In 1994, a normative amendment to the C standard, C95, [8] [9] included in C99, supplied digraphs as more readable alternatives to five of the trigraphs.
Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%:
replacing the preprocessor concatenation token ##
. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.
Token | Equivalent |
---|---|
compl | ~ |
not | ! |
bitand | & |
bitor | | |
and | && |
or | || |
xor | ^ |
and_eq | &= |
or_eq | |= |
xor_eq | ^= |
not_eq | != |
C++ (through C++14, see below) behaves like C, including the C99 additions, but with additional tokens listed in the table. [10]
As a note, %:%:
is treated as a single token, rather than two occurrences of %:
.
In the sequence <::
if the subsequent character is neither :
nor >
, the <
is treated as a preprocessing token by itself and not as the first character of the alternative token <:
. This is done so certain uses of templates are not broken by the substitution.
The C++ Standard makes this comment with regards to the term "digraph": [11]
The term "digraph" (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is
%:%:
and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as "digraphs".
Trigraphs were proposed for deprecation in C++0x, which was released as C++11. [12] This was opposed by IBM, speaking on behalf of itself and other users of C++, [13] and as a result trigraphs were retained in C++11. Trigraphs were then proposed again for removal (not only deprecation) in C++17. [14] This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM. [15] Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs. [14]
Hewlett-Packard calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called TIO codes) to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set [16] [17] [18] on foreign platforms, and to ease keyboard input without using the CHARS application. [19] [20] [17] [18] The first character of all TIO codes is a \
, followed by two other ASCII characters vaguely resembling the glyph to be substituted. [19] [20] [17] [18] [21] All other characters can be entered using the special \nnn
TIO code syntax with nnn being a three-digit decimal number (with leading zeros if necessary) of the corresponding code point (thereby formally representing a tetragraph ). [19] [17] [18]
The Vim text editor supports digraphs for actual entry of text characters, following RFC 1345. The entry of digraphs is bound to Ctrl+K by default. [22] The list of all possible digraphs in Vim can be displayed by typing :dig.
GNU Screen has a digraph command, bound to Ctrl+ACtrl+V by default. [23]
Lotus 1-2-3 for DOS uses Alt+F1 as compose key to allow easier input of many special characters of the Lotus International Character Set (LICS) [24] and Lotus Multi-Byte Character Set (LMBCS).
In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.
HP-GL, short for Hewlett-Packard Graphics Language and often written as HPGL, is a printer control language created by Hewlett-Packard (HP). HP-GL was the primary printer control language used by HP plotters. It was introduced with the plotter HP-9872 in 1977 and became a standard for almost all plotters. Hewlett-Packard's printers also usually support HP-GL/2 in addition to PCL.
In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.
In computer science, a preprocessor is a program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages.
The backslash\ is a mark used mainly in computing and mathematics. It is the mirror image of the common slash /. It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, escape, reverse slash, slosh, downwhack, backslant, backwhack, bash, reverse slant, reverse solidus, and reversed virgule.
The C preprocessor is the macro preprocessor for several computer programming languages, such as C, Objective-C, C++, and a variety of Fortran languages. The preprocessor provides inclusion of header files, macro expansions, conditional compilation, and line control.
The HP 49/50 series are Hewlett-Packard (HP) manufactured graphing calculators. They are the successors of the popular HP 48 series.
The HP-41C series are programmable, expandable, continuous memory handheld RPN calculators made by Hewlett-Packard from 1979 to 1990. The original model, HP-41C, was the first of its kind to offer alphanumeric display capabilities. Later came the HP-41CV and HP-41CX, offering more memory and functionality.
Box-drawing characters, also known as line-drawing characters, are a form of semigraphics widely used in text user interfaces to draw various geometric frames and boxes. These characters are characterized by being designed to be connected horizontally and/or vertically with adjacent characters, which requires proper alignment. Box-drawing characters therefore typically only work well with monospaced fonts.
The HP 300 "Amigo" was a computer produced by Hewlett-Packard (HP) in the late 1970s based loosely on the stack-based HP 3000, but with virtual memory for both code and data. The HP300 was cut-short from being a commercial success despite the huge engineering effort, which included HP-developed and -manufactured silicon on sapphire (SOS) processor and I/O chips.
The HP 2640A and other HP 264X models were block-mode "smart" and intelligent ASCII standard serial terminals produced by Hewlett-Packard using the Intel 8008 and 8080 microprocessors.
HP Time-Shared BASIC is a BASIC programming language interpreter for Hewlett-Packard's HP 2000 line of minicomputer-based time-sharing computer systems. TSB is historically notable as the platform that released the first public versions of the game Star Trek.
The HP 250 was a multiuser business computer by Hewlett-Packard running HP 250 BASIC language as its OS with access to HP's IMAGE database management. It was produced by the General Systems Division (GSD), but was a major repackaging of desktop workstation HP 9835 from the HP 9800 series which had been sold in small business configurations. The HP 9835's processor was initially used in the first HP 250s.
Extended ASCII is a repertoire of character encodings that include the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case.
In computing HP Roman is a family of character sets consisting of HP Roman Extension, HP Roman-8, HP Roman-9 and several variants. Originally introduced by Hewlett-Packard around 1978, revisions and adaptations were published several times up to 1999. The 1985 revisions were later standardized as IBM codepages 1050 and 1051. Supporting many European languages, the character sets were used by various HP workstations, terminals, calculators as well as many printers, also from third-parties.
C++17 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++17 replaced the prior version of the C++ standard, called C++14, and was later replaced by C++20.
The RPL character set is an 8-bit character set and encoding used by most RPL calculators manufactured by Hewlett-Packard as well as by the HP 82240B thermo printer. It is sometimes referred to simply as "ECMA-94" in documentation, although it is for the most part a superset of ISO/IEC 8859-1 / ECMA-94 in terms of printable characters, and it differs from ISO/IEC 8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F range of code points.
The Lotus International Character Set (LICS) is a proprietary single-byte character encoding introduced in 1985 by Lotus Development Corporation. It is based on the 1983 DEC Multinational Character Set (MCS) for VT220 terminals. As such, LICS is also similar to two other descendants of MCS, the ECMA-94 character set of 1985 and the ISO 8859-1 (Latin-1) character set of 1987.
In computing FOCAL character set refers to a group of 8-bit single byte character sets introduced by Hewlett-Packard since 1979. It was used in several RPN calculators supporting the FOCAL programming language, like the HP-41C/CV/CX as well as the later HP-42S, which was introduced in 1988 and produced up to 1995. As such, it is also used by SwissMicros' DM41/L, both introduced in 2015, and is implicitly supported by the DM42, introduced in 2017.
Caret is the name used familiarly for the character ^ provided on most QWERTY keyboards by typing ⇧ Shift+6. The symbol has a variety of uses in programming and mathematics. The name "caret" arose from its visual similarity to the original proofreader's caret, a mark used in proofreading to indicate where a punctuation mark, word, or phrase should be inserted into a document. The formal ASCII standard (X3.64.1977) calls it a "circumflex".