Null-terminated string

Last updated

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with an internal value of zero, called "NUL" in this article, not same as the glyph zero). Alternative names are C string , which refers to the C programming language and ASCIIZ [1] (although C can use encodings other than ASCII).

Contents

The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not in the string).

History

Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.

At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "length-prefixed"), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time), but limited string length to 255 characters. C designer Dennis Ritchie chose to follow the convention of null-termination to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator. [2] [3]

This had some influence on CPU instruction set design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80 and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992 and the vector string instructions to the IBM z13 in 2015. [4]

FreeBSD developer Poul-Henning Kamp, writing in ACM Queue , referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever. [5]

Limitations

While simple to implement, this representation has been prone to errors and performance problems.

Null-termination has historically created security problems. [6] A NUL inserted into the middle of a string will truncate it unexpectedly. [7] A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because the block of memory already contained zeros. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow if it was too long.

The inability to store a zero requires that text and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.

The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as in strlcpy . However, this does not always result in an intuitive API.

Character encodings

Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible ASCII or UTF-8 string. [8] [9] [10] However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "modified UTF-8" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an overlong encoding, and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.

UTF-16 uses 2-byte integers and as either byte may be zero (and in fact every other byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit UTF-16 characters, terminated by a 16-bit NUL (0x0000).

Improvements

Many attempts to make C string handling less error prone have been made. One strategy is to add safer functions such as strdup and strlcpy , whilst deprecating the use of unsafe functions such as gets . Another is to add an object-oriented wrapper around C strings so that only safe calls can be done. However, it is possible to call the unsafe functions anyway.

Most modern libraries replace C strings with a structure containing a 32-bit or larger length value (far more than were ever considered for length-prefixed strings), and often add another pointer, a reference count, and even a NUL to speed up conversion back to a C string. Memory is far larger now, such that if the addition of 3 (or 16, or more) bytes to each string is a real problem the software will have to be dealing with so many small strings that some other storage method will save even more memory (for instance there may be so many duplicates that a hash table will use less memory). Examples include the C++ Standard Template Library std::string , the Qt QString, the MFC CString, and the C-based implementation CFString from Core Foundation as well as its Objective-C sibling NSString from Foundation, both by Apple. More complex structures may also be used to store strings such as the rope.

See also

Related Research Articles

<span class="mw-page-title-main">Buffer overflow</span> Anomaly in computer security and programming

In programming and information security, a buffer overflow or buffer overrun is an anomaly whereby a program writes data to a buffer beyond the buffer's allocated memory, overwriting adjacent memory locations.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

<span class="mw-page-title-main">Shellcode</span> Small piece of code used as a payload to exploit a software vulnerability

In hacking, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised machine, but any piece of code that performs a similar task can be called shellcode. Because the function of a payload is not limited to merely spawning a shell, some have suggested that the name shellcode is insufficient. However, attempts at replacing the term have not gained wide acceptance. Shellcode is commonly written in machine code.

The null character is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646, the C0 control code, the Universal Coded Character Set, and EBCDIC. It is available in nearly all mainstream programming languages. It is often abbreviated as NUL. In 8-bit codes, it is known as a null byte.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In formal language theory, the empty string, or empty word, is the unique string of length zero.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

<span class="mw-page-title-main">CD-Text</span> CD-based format that allows for song information to be stored alongside audio data

CD-Text is an extension of the Red Book Compact Disc specifications standard for audio CDs. It allows storage of additional information on a standards-compliant audio CD.

The C++ programming language has support for string handling, mostly implemented in its standard library. The language standard specifies several string types, some inherited from C, some designed to make use of the language's features, such as classes and RAII. The most-used of these is std::string.

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.

bcrypt is a password-hashing function designed by Niels Provos and David Mazières, based on the Blowfish cipher and presented at USENIX in 1999. Besides incorporating a salt to protect against rainbow table attacks, bcrypt is an adaptive function: over time, the iteration count can be increased to make it slower, so it remains resistant to brute-force search attacks even with increasing computation power.

Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

Although C++ is one of the most widespread programming languages, many prominent software engineers criticize C++ for being overly complex and fundamentally flawed. Among the critics have been: Robert Pike, Joshua Bloch, Linus Torvalds, Donald Knuth, Richard Stallman, and Ken Thompson. C++ was widely adopted and implemented as a systems language through most of its existence, it has been used to build many pieces of very important software.

References

  1. "Chapter 15 - MIPS Assembly Language" (PDF). Carleton University. Retrieved 9 October 2023.
  2. Ritchie, Dennis M. (April 1993). The development of the C language. Second History of Programming Languages conference. Cambridge, MA.
  3. Ritchie, Dennis M. (1996). "The development of the C language". In Bergin, Jr., Thomas J.; Gibson, Jr., Richard G. (eds.). History of Programming Languages (2 ed.). New York: ACM Press. ISBN   0-201-89502-1 via Addison-Wesley (Reading, Mass).
  4. IBM z/Architecture Principles of Operation
  5. Kamp, Poul-Henning (25 July 2011), "The Most Expensive One-byte Mistake", ACM Queue, 9 (7): 40–43, doi: 10.1145/2001562.2010365 , ISSN   1542-7730, S2CID   30282393
  6. Rain Forest Puppy (9 September 1999). "Perl CGI problems". Phrack Magazine. artofhacking.com. 9 (55): 7. Retrieved 3 January 2016.
  7. "Null byte injection on PHP?".
  8. "UTF-8, a transformation format of ISO 10646" . Retrieved 19 September 2013.
  9. "Unicode/UTF-8-character table" . Retrieved 13 September 2013.
  10. Kuhn, Markus. "UTF-8 and Unicode FAQ" . Retrieved 13 September 2013.