Escape sequences in C

Last updated

In the C programming language, an escape sequence is specially delimited text in a character or string literal that represents one or more other characters to the compiler. It allows a programmer to specify characters that are otherwise difficult or impossible to specify in a literal.

Contents

An escape sequence starts with a backslash (\) called the escape character and subsequent characters define the meaning of the escape sequence. For example, \n denotes a newline character.

The same or similar escape sequences are used in other, related languages such C++, C#, Java and PHP.

Value

To demonstrate the value of the escape sequence feature, to output the text Foo on one line and Bar on the next line, the code must output a newline between the two words.

The following code achieves the goal via text formatting and a hard-coded ASCII character value for newline (0x0A). This behaves as desired with the words on sequential lines, but an escape sequence has advantages.

#include<stdio.h>intmain(){printf("Foo%cBar",0x0A);return0;}

The \n escape sequence allows for shorter code by specifying the newline in the string literal, and for faster runtime by eliminating the text formatting operation. Also, the compiler can map the escape sequence to a character encoding system other than ASCII and thus make the code more portable.

#include<stdio.h>intmain(){printf("Foo\nBar");return0;}

How it works

An escape sequence changes how the compiler interprets character data in a literal. For example, \n does not represent a backslash followed by the letter n. The backslash escapes the compiler's normal, literal way of interpreting character data. After a backslash, the compiler expects subsequent characters to complete one of the defined escape sequences, and then translates the escape sequence into the characters it represents.

This syntax does require special handling to encode a backslash character since it is a metacharacter that changes literal interpretation behavior; not the literal backslash character. The issue is solved by using two backslashes (\\) to mean one.

Escape sequences

The following table includes escape sequences defined in standard C as well as some non-standard sequences. The C standard requires an escape sequence that does not match a defined sequence to be diagnosed i.e., the compiler must print an error message. Regardless, some compilers define additional escape sequences.

The table shows the ASCII value a sequence maps to, however, it may map to different values based on encoding.

Escape sequenceHex value in ASCIICharacter represented
\a07 Alert (Beep, Bell) (added in C89) [1]
\b08 Backspace
\e note 1 1B Escape character
\f0C Formfeed Page Break
\n0A Newline (Line Feed); see below
\r0D Carriage Return
\t09 Horizontal Tab
\v0B Vertical Tab
\\5C Backslash
\'27 Apostrophe or single quotation mark
\"22Double quotation mark
\?3F Question mark (used to avoid trigraphs)
\nnn note 2 nnn (octal)The byte whose numerical value is given by nnn interpreted as an octal number
\xhh…hh…The byte whose numerical value is given by hh… interpreted as a hexadecimal number
\uhhhh note 3 non-ASCII Unicode code point below 10000 hexadecimal (added in C99) [1] :26
\Uhhhhhhhh note 4 non-ASCIIUnicode code point where h is a hexadecimal digit

Escape

^ The non-standard sequence \e represents the escape character in GCC, [2] clang and tcc. It was not added to the C standard because it has no meaningful equivalent in some character sets (such as EBCDIC). [1]

Newline

Sequence \n maps to one byte, despite the fact that the platform may use more than one byte to denote a newline, such as the DOS/Windows CRLF sequence, 0x0D 0x0A. The translation from 0x0A to 0x0D 0x0A on DOS and Windows occurs when the byte is written out to a file or to the console, and the inverse translation is done when text files are read.

Hex

A hex escape sequence must have at least one hex digit following \x, with no upper bound; it continues for as many hex digits as there are. Thus, for example, \xABCDEFG denotes the byte with the numerical value ABCDEF16, followed by the letter G, which is not a hex digit. However, if the resulting integer value is too large to fit in a single byte, the actual numerical value assigned is implementation-defined. Most platforms have 8-bit char types, which limits a useful hex escape sequence to two hex digits. However, hex escape sequences longer than two hex digits might be useful inside a wide character or wide string literal (prefixed with L):

// single char with value 0x12 (18 decimal)chars1[]="\x12";// single char with implementation-defined value, unless char is long enoughchars1[]="\x1234";// single wchar_t with value 0x1234, provided wchar_t is long enough (16 bits suffices)wchar_ts2[]=L"\x1234";

Octal

^ An octal escape sequence consists of a backslash followed by one to three octal digits. The octal escape sequence ends when it either contains three octal digits, or the next character is not an octal digit. For example, \11 is an octal escape sequence denoting a byte with decimal value 9 (11 in octal). However, \1111 is the octal escape sequence \111 followed by the digit 1. In order to denote the byte with numerical value 1, followed by the digit 1, one could use "\1""1", since C concatenates adjacent string literals.

Some three-digit octal escape sequences are too large to fit in a single byte. This results in an implementation-defined value for the resulting byte.

The escape sequence \0 is a commonly used octal escape sequence, which denotes the null character, with value zero in ASCII and most encoding systems.

Universal character names

^ ^ Since the C99 standard, C supports escape sequences that denote Unicode code points, called universal character names. They have the form \uhhhh or \Uhhhhhhhh, where h stands for a hex digit. Unlike other escape sequences, a universal character name may expand into more than one code unit.

The sequence \uhhhh denotes the code point hhhh, interpreted as a hexadecimal number. The sequence \Uhhhhhhhh denotes the code point hhhhhhhh, interpreted as a hexadecimal number. Code points located at U+10000 or higher must be denoted with the \U syntax, whereas lower code points may use \u or \U. The code point is converted into a sequence of code units in the encoding of the destination type on the target system. For example, where the encoding is UTF-8, and UTF-16 for wchar_t:

// A single byte with the value 0xC0; not valid UTF-8chars1[]="\xC0";// Two bytes with values 0xC3, 0x80; the UTF-8 encoding of U+00C0chars2[]="\u00C0";// A single wchar_t with the value 0x00C0wchar_ts3[]=L"\xC0";// A single wchar_t with the value 0x00C0wchar_ts4[]=L"\u00C0";

A value greater than \U0000FFFF may be represented by a single wchar_t if the UTF-32 encoding is used, or two if UTF-16 is used.

Importantly, the universal character name \u00C0 always denotes the character "À", regardless of what kind of string literal it is used in, or the encoding in use. The octal and hex escape sequences always denote certain sequences of numerical values, regardless of encoding. Therefore, universal character names are complementary to octal and hex escape sequences; while octal and hex escape sequences represent code units, universal character names represent code points, which may be thought of as "logical" characters.

Alternatives

Some languages provide different mechanisms for coding behavior that the escape sequence provide. For example, the following Pascal code writes the two words on sequential lines:

writeln('Foo');write('Bar');

writeln outputs a newline after the parameter text, while write does not.

See also

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

<span class="mw-page-title-main">Character (computing)</span> Primitive data type

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.

The backslash\ is a mark used mainly in computing and mathematics. It is the mirror image of the common slash /. It is a relatively recent mark, first documented in the 1930s. It is sometimes called a hack, whack, escape, reverse slash, slosh, downwhack, backslant, backwhack, bash, reverse slant, reverse solidus, and reversed virgule.

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

The null character is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646, the C0 control code, the Universal Coded Character Set, and EBCDIC. It is available in nearly all mainstream programming languages. It is often abbreviated as NUL. In 8-bit codes, it is known as a null byte.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.

The Basic Latin Unicode block, sometimes informally called C0 Controls and Basic Latin, is the first block of the Unicode standard, and the only block which is encoded in one byte in UTF-8. The block contains all the letters and control codes of the ASCII encoding. It ranges from U+0000 to U+007F, contains 128 characters and includes the C0 controls, ASCII punctuation and symbols, ASCII digits, both the uppercase and lowercase of the English alphabet and a control character.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

References

  1. 1 2 3 "Rationale for International Standard - Programming Languages - C" (PDF). 5.10. April 2003. Archived (PDF) from the original on 2016-06-06. Retrieved 2010-10-17.
  2. "6.35 The Character <ESC> in Constants". GCC 4.8.2 Manual. Archived from the original on 2019-05-12. Retrieved 2014-03-08.

Further reading