Null character

Last updated

The null character (also null terminator) is a control character with the value zero. [1] [2] [3] [4] It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646 (or ASCII), the C0 control code, the Universal Coded Character Set (or Unicode), and EBCDIC. It is available in nearly all mainstream programming languages. [5] It is often abbreviated as NUL (or NULL, though in some contexts that term is used for the null pointer). In 8-bit codes, it is known as a null byte.

Contents

The original meaning of this character was like NOP when sent to a printer or a terminal, it has no effect (some terminals, however, incorrectly display it as space). When electromechanical teleprinters were used as computer output devices, one or more null characters were sent at the end of each printed line to allow time for the mechanism to return to the first printing position on the next line.[ citation needed ] On punched tape, the character is represented with no holes at all, so a new unpunched tape is initially filled with null characters, and often text could be inserted at a reserved space of null characters by punching the new characters into the tape over the nulls.

Today the character has much more significance in the programming language C and its derivatives and in many data formats, where it serves as a reserved character used to signify the end of a string, [6] often called a null-terminated string. [7] This allows the string to be any length with only the overhead of one byte; the alternative of storing a count requires either a string length limit of 255 or an overhead of more than one byte (there are other advantages/disadvantages described in the null-terminated string article).

Representation

In source code, the null character is often represented as the escape sequence \0 in string literals (for example, "abc\0def") or in character constants ('\0'); the latter may also be written instead simply as 0 (without quotes nor slash). [8] In many languages (such as C, which introduced this notation), this is not a separate escape sequence, but an octal escape sequence with a single octal digit 0; as a consequence, \0 must not be followed by any of the digits 0 through 7; otherwise it is interpreted as the start of a longer octal escape sequence. [9] Other escape sequences that are found in use in various languages are \000, \x00, \z, or \u0000. A null character can be placed in a URL with the percent code %00.

The ability to represent a null character does not always mean the resulting string will be correctly interpreted, as many programs will consider the null to be the end of the string. Thus the ability to type it (in case of unchecked user input) creates a vulnerability known as null byte injection and can lead to security exploits. [10]

In caret notation the null character is ^@. On some keyboards, one can enter a null character by holding down Ctrl and pressing @ (on US layouts just Ctrl+2 will often work, there being no need for ⇧ Shift to get the @ sign).

The Hexadecimal notation for null is 00. Decoding the Base64 string AA== also yields the null character.

In documentation, the null character is sometimes represented as a single-em-width symbol containing the letters "NUL". In Unicode, there is a character for this: U+2400.

Encoding

In all modern character sets, the null character has a code point value of zero. In most encodings, this is translated to a single code unit with a zero value. For instance, in UTF-8 it is a single zero byte. However, in Modified UTF-8 the null character is encoded as two bytes: 0xC0,0x80. This allows the byte with the value of zero, which is now not used for any character, to be used as a string terminator.

Related Research Articles

<span class="mw-page-title-main">ASCII</span> American character encoding standard

ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. Modern computer systems have evolved to use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set.

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

In computing and telecommunication, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters, also known as printing characters, except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which rings a terminal bell.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

<span class="mw-page-title-main">ANSI escape code</span> Method used for display options on video text terminals

ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape character and a bracket character, are embedded into text. The terminal interprets these sequences as commands, rather than text to display verbatim.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character. Alternative names are C string, which refers to the C programming language and ASCIIZ.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

In the C programming language, an escape sequence is specially delimited text in a character or string literal that represents one or more other characters to the compiler. It allows a programmer to specify characters that are otherwise difficult or impossible to specify in a literal.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

References

  1. ASCII format for Network Interchange. IETF. sec. 5.2. doi: 10.17487/RFC0020 . RFC 20. NUL (Null): The all-zeros character which may serve to accomplish time fill and media fill.
  2. "The set of control characters of the ISO 646" (PDF). Secretariat ISO/TC 97/SC 2. 1975-12-01. p. 4.4. Archived from the original (PDF) on 2014-05-12. Position: 0/0, Name: Null, Abbreviation: Nul
  3. "Unicode Character 'NULL' (U+0000)" . Retrieved 2018-10-20.
  4. "C0 Controls and Basic Latin" (PDF). Unicode Consortium. 2018. Retrieved 2018-10-20.
  5. "A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string literal." — ANSI/ISO 9899:1990 (the ANSI C standard), section 5.2.1
  6. "A string is a contiguous sequence of characters terminated by and including the first null character" — ANSI/ISO 9899:1990 (the ANSI C standard), section 7.1.1
  7. Working Draft, Standard for Programming Language C++ (PDF) (ISO 14882 standard working draft), ISO/IEC, 28 February 2011, p. 427, N3242=11-0012, retrieved 27 February 2013, A null-terminated byte string, or NTBS, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); no other element in the sequence has the value zero.
  8. Kernighan and Ritchie, C, p. 38: "The character constant '\0' represents the character with value zero, the null character. '\0' is often written instead of 0 to emphasize the character nature of some expression, but the numeric value is just 0."}}
  9. In YAML this combination is a separate escape sequence.
  10. Null Byte Injection WASC Threat Classification Null Byte Attack section.