Digraphs and trigraphs (programming)

Last updated

In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters. Trigraphs have been removed from the C++ language, and will be from C as of C23, thus likely aren't used much in practice in C already, nor in any other mainstream language (use of them in the language J is an exception). In the modern world of Unicode/UTF-8 (even just with ASCII) there's no need for trigraphs in language design, which were considered a burden, and neither really digraphs, that likely have very few users, at least in those languages.

Contents

Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as { and }.

History

The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set. [1]

Implementations

Trigraphs are not commonly encountered outside compiler test suites. [2] Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor (TRIGRAPH.EXE), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).

Language support

Different systems define different sets of digraphs and trigraphs, as described below.

ALGOL

Early versions of ALGOL predated the standardized ASCII and EBCDIC character sets, and were typically implemented using a manufacturer-specific six-bit character code. A number of ALGOL operations either lacked codepoints in the available character set or were not supported by peripherals, leading to a number of substitutions including := for (assignment) and >= for (greater than or equal).

Pascal

The Pascal programming language supports digraphs (., .), (* and *) for [, ], { and } respectively. Unlike all other cases mentioned here, (* and *) were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (* cannot be closed with } and vice versa.

J

The J programming language is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, . (dot) and : (colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols". [3]

Unlike the use of digraphs and trigraphs in C and C++, there are no single-character equivalents to these in J.

C

TrigraphEquivalent
??=#
??/\
??'^
??([
??)]
??!|
??<{
??>}
??-~

The C preprocessor (used for C and with slight differences in C++; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23 [4] ). [5] [6]

A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ? tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. This is particularly a problem for the classic Mac OS, where the constant '????' may be used as a file type or creator. [7] To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..." or an escape sequence "...?\?...".

??? is not itself a trigraph sequence, but when followed by a character such as - it will be interpreted as ? + ??-, as in the example below which has 16 ?s before the /.

The ??/ trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:

<nowiki/>// Will the next line be executed????????????????/a++;

which is a single logical comment line (used in C++ and C99), and

<nowiki/>/??/*Acomment*??//

which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.

<nowiki/>inttrigraphsavailable()// returns 0 or 1; language standard C99 or later{// are trigraphs available??/return0;return1;}
Alternative digraphs introduced in the C standard in 1994
DigraphEquivalent
<:[
:>]
<%{
%>}
%:#

In 1994, a normative amendment to the C standard,[ specify ] included in C99, supplied digraphs as more readable alternatives to five of the trigraphs.

Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%: replacing the preprocessor concatenation token ##. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.

C++

TokenEquivalent
compl~
not!
bitand&
bitor|
and&&
or||
xor^
and_eq&=
or_eq|=
xor_eq^=
not_eq!=

C++ (through C++14, see below) behaves like C, including the C99 additions, but with additional tokens listed in the table. [8]

As a note, %:%: is treated as a single token, rather than two occurrences of %:.

In the sequence <:: if the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:. This is done so certain uses of templates are not broken by the substitution.

The C++ Standard makes this comment with regards to the term "digraph": [9]

The term "digraph" (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as "digraphs".

Trigraphs were proposed for deprecation in C++0x, which was released as C++11. [10] This was opposed by IBM, speaking on behalf of itself and other users of C++, [11] and as a result trigraphs were retained in C++11. Trigraphs were then proposed again for removal (not only deprecation) in C++17. [12] This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM. [13] Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs. [12]

RPL

Hewlett-Packard calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called TIO codes) to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set [14] [15] [16] on foreign platforms, and to ease keyboard input without using the CHARS application. [17] [18] [15] [16] The first character of all TIO codes is a \, followed by two other ASCII characters vaguely resembling the glyph to be substituted. [17] [18] [15] [16] [19] All other characters can be entered using the special \nnn TIO code syntax with nnn being a three-digit decimal number (with leading zeros if necessary) of the corresponding code point (thereby formally representing a tetragraph ). [17] [15] [16]

Application support

Vim

The Vim text editor supports digraphs for actual entry of text characters, following RFC   1345. The entry of digraphs is bound to Ctrl+K by default. [20] The list of all possible digraphs in Vim can be displayed by typing :dig.

GNU Screen

GNU Screen has a digraph command, bound to Ctrl+ACtrl+V by default. [21]

Lotus

Lotus 1-2-3 for DOS uses Alt+F1 as compose key to allow easier input of many special characters of the Lotus International Character Set (LICS) [22] and Lotus Multi-Byte Character Set (LMBCS).

See also

Related Research Articles

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

HP-GL, short for Hewlett-Packard Graphics Language and often written as HPGL, is a printer control language created by Hewlett-Packard (HP). HP-GL was the primary printer control language used by HP plotters. It was introduced with the plotter HP-9872 in 1977 and became a standard for almost all plotters. Hewlett-Packard's printers also usually support HP-GL/2 in addition to PCL.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

The backslash\ is a mark used mainly in computing and mathematics. It is the mirror image of the common slash /. It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, escape, reverse slash, slosh, downwhack, backslant, backwhack, bash, reverse slant, reverse solidus, and reversed virgule.

The C preprocessor is the macro preprocessor for several computer programming languages, such as C, Objective-C, C++, and a variety of Fortran languages. The preprocessor provides inclusion of header files, macro expansions, conditional compilation, and line control.

<span class="mw-page-title-main">Box-drawing characters</span> Unicode block group

Box-drawing characters, also known as line-drawing characters, are a form of semigraphics widely used in text user interfaces to draw various geometric frames and boxes. These characters are characterized by being designed to be connected horizontally and/or vertically with adjacent characters, which requires proper alignment. Box-drawing characters therefore typically only work well with monospaced fonts.

<span class="mw-page-title-main">HP 300</span>

The HP 300 "Amigo" was a computer produced by Hewlett-Packard (HP) in the late 1970s based loosely on the stack-based HP 3000, but with virtual memory for both code and data. The HP300 was cut-short from being a commercial success despite the huge engineering effort, which included HP-developed and -manufactured silicon on sapphire (SOS) processor and I/O chips.

<span class="mw-page-title-main">HP 2640</span> Serial computer terminal

The HP 2640A and other HP 264X models were block-mode "smart" and intelligent ASCII standard serial terminals produced by Hewlett-Packard using the Intel 8008 and 8080 microprocessors.

HP Time-Shared BASIC is a BASIC programming language interpreter for Hewlett-Packard's HP 2000 line of minicomputer-based time-sharing computer systems. TSB is historically notable as the platform that released the first public versions of the game Star Trek.

<span class="mw-page-title-main">HP 250</span> Computer by Hewlett-Packard

The HP 250 was a multiuser business computer by Hewlett-Packard running HP 250 BASIC language as its OS with access to HP's IMAGE database management. It was produced by the General Systems Division (GSD), but was a major repackaging of desktop workstation HP 9835 from the HP 9800 series which had been sold in small business configurations. The HP 9835's processor was initially used in the first HP 250s.

The National Replacement Character Set (NRCS) was a feature supported by later models of Digital's (DEC) computer terminal systems, starting with the VT200 series in 1983. NRCS allowed individual characters from one character set to be replaced by one from another set, allowing the construction of different character sets on the fly. It was used to customize the character set to different local languages, without having to change the terminal's ROM for different countries, or alternately, include many different sets in a larger ROM. Many 3rd party terminals and terminal emulators supporting VT200 codes also supported NRCS.

<span class="mw-page-title-main">Extended ASCII</span> Nickname for 8-bit ASCII-derived character sets

Extended ASCII is a repertoire of character encodings that include the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case.

In computing HP Roman is a family of character sets consisting of HP Roman Extension, HP Roman-8, HP Roman-9 and several variants. Originally introduced by Hewlett-Packard around 1978, revisions and adaptations were published several times up to 1999. The 1985 revisions were later standardized as IBM codepages 1050 and 1051. Supporting many European languages, the character sets were used by various HP workstations, terminals, calculators as well as many printers, also from third-parties.

In the C programming language, an escape sequence is specially delimited text in a character or string literal that represents one or more other characters to the compiler. It allows a programmer to specify characters that are otherwise difficult or impossible to specify in a literal.

C++17 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++17 replaced the prior version of the C++ standard, called C++14, and was later replaced by C++20.

The RPL character set is an 8-bit character set and encoding used by most RPL calculators manufactured by Hewlett-Packard as well as by the HP 82240B thermo printer. It is sometimes referred to simply as "ECMA-94" in documentation, although it is for the most part a superset of ISO/IEC 8859-1 / ECMA-94 in terms of printable characters, and it differs from ISO/IEC 8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F range of code points.

The Lotus International Character Set (LICS) is a proprietary single-byte character encoding introduced in 1985 by Lotus Development Corporation. It is based on the 1983 DEC Multinational Character Set (MCS) for VT220 terminals. As such, LICS is also similar to two other descendants of MCS, the ECMA-94 character set of 1985 and the ISO 8859-1 (Latin-1) character set of 1987.

In computing FOCAL character set refers to a group of 8-bit single byte character sets introduced by Hewlett-Packard since 1979. It was used in several RPN calculators supporting the FOCAL programming language, like the HP-41C/CV/CX as well as the later HP-42S, which was introduced in 1988 and produced up to 1995. As such, it is also used by SwissMicros' DM41/L, both introduced in 2015, and is implicitly supported by the DM42, introduced in 2017.

Caret is the name used familiarly for the character ^ provided on most QWERTY keyboards by typing ⇧ Shift+6. The symbol has a variety of uses in programming and mathematics. The name "caret" arose from its visual similarity to the original proofreader's caret, a mark used in proofreading to indicate where a punctuation mark, word, or phrase should be inserted into a document. The formal ASCII standard (X3.64.1977) calls it a "circumflex".

References

  1. Rationale for International Standard—Programming Languages—C (PDF). Revision 5.10. pp. 20–21.
  2. Jones, Derek M. "Sentence 117". The New C Standard: An Economic and Cultural Commentary.
  3. Hui, Roger. "Vocabulary". jsoftware.com. Archived from the original on 2019-04-02. Retrieved 2015-04-16.
  4. "Removing trigraphs??!" (PDF).
  5. British Standards Institute (2003). The C Standard - Incorporating TC1 - BS ISO/IEC 9899:1999. John Wiley & Sons. ISBN   0-470-84573-2.
  6. "Rationale for International Standard - Programming Languages - C" (PDF). 5.10. April 2003. Archived (PDF) from the original on 2016-06-06. Retrieved 2010-10-17.
  7. "File Basics". whitefiles.org. Retrieved 2024-05-08.
  8. Stroustrup, Bjarne (1994-03-29). Design and Evolution of C++ (1 ed.). Addison-Wesley Publishing Company. ISBN   0-201-54330-3.
  9. Du Toit, Stefanus, ed. (2012-01-16). "Working Draft, Standard for Programming Language C++" (PDF). N3337. Archived (PDF) from the original on 2019-05-08. Retrieved 2019-05-08.
  10. "C++0X, CD 1, National Body Comments" (PDF). 2009-01-30. SC22/WG21 N2837 comment UK 11. Archived (PDF) from the original on 2017-08-01. Retrieved 2019-05-12.
  11. Wong, Michael; Tong, Hubert; Klarer, Robert; McIntosh, Ian; Mak, Raymond; Cambly, Christopher; LaBonté, Alain (2009-06-19). "Comment on Proposed Trigraph Deprecation" (PDF). N2910. Archived (PDF) from the original on 2017-08-01. Retrieved 2019-05-12.
  12. 1 2 Smith, Richard (2014-05-06). "Removing trigraphs??!". N3981. Archived from the original on 2018-07-09. Retrieved 2019-05-12.
  13. Wong, Michael; Tong, Hubert; Bhakta, Rajan; Inglis, Derek (2014-10-10). "IBM comment on preparing for a Trigraph-adverse future in C++17" (PDF). IBM paper N4210. Archived (PDF) from the original on 2018-09-11. Retrieved 2019-05-12.
  14. HP 82240B Infrared Printer (1 ed.). Corvallis, OR, USA: Hewlett-Packard. August 1989. HP reorder number 82240-90014. Archived from the original on 2016-08-14. Retrieved 2016-08-01.
  15. 1 2 3 4 HP 48G Series – User's Guide (UG) (8 ed.). Hewlett-Packard. December 1994 [1993]. pp. 2–5, 27–16. HP 00048-90126, (00048-90104). Archived from the original on 2016-08-06. Retrieved 2015-09-06.
  16. 1 2 3 4 HP 50g / 49g+ / 48gII graphing calculator advanced user's reference manual (AUR) (2 ed.). Hewlett-Packard. 2009-07-14 [2005]. pp. J-1, J-2. HP F2228-90010. Archived from the original on 2018-07-08. Retrieved 2015-10-10. Searchable PDF
  17. 1 2 3 "HP RPL TIO Table". holyjoe.org. Archived from the original on 2016-05-23. Retrieved 2015-01-23.
  18. 1 2 Heinz, Sr., Michael W. (2005). "HP-ASCII and Trigraphs". Archived from the original on 2016-08-02. Retrieved 2016-08-02.
  19. Finseth, Craig A. (2012-02-25). "chars". Archived from the original on 2017-12-21. Retrieved 2017-12-21.
  20. "Vim documentation: *digraphs-default*". 2011-01-15. Archived from the original on 2018-12-20. Retrieved 2019-05-12.
  21. "Digraph - Screen User's Manual". Archived from the original on 2018-12-31. Retrieved 2019-05-12.
  22. "Appendix F". HP 95LX User's Guide (PDF) (2 ed.). Corvallis, OR, USA: Hewlett-Packard Company, Corvallis Division. June 1991 [March 1991]. F0001-90003. Archived (PDF) from the original on 2016-11-28. Retrieved 2016-11-27.