Null pointer

Last updated

In computing, a null pointer or null reference is a value saved for indicating that the pointer or reference does not refer to a valid object. Programs routinely use null pointers to represent conditions such as the end of a list of unknown length or the failure to perform some action; this use of null pointers can be compared to nullable types and to the Nothing value in an option type.

Contents

A null pointer should not be confused with an uninitialized pointer: a null pointer is guaranteed to compare unequal to any pointer that points to a valid object. However, in general, most languages do not offer such guarantee for uninitialized pointers. It might compare equal to other, valid pointers; or it might compare equal to null pointers. It might do both at different times; or the comparison might be undefined behaviour. Also, in languages offering such support, the correct use depends on the individual experience of each developer and linter tools. Even when used properly, null pointers are semantically incomplete, since they do not offer the possibility to express the difference between a "Not Applicable" value versus a "Not known" value or versus a "Future" value.

Because a null pointer does not point to a meaningful object, an attempt to access the data stored at that (invalid) memory location may cause a run-time error or immediate program crash. This is the null pointer error. It is one of the most common types of software weaknesses, [1] and Tony Hoare, who introduced the concept, has referred to it as a "billion dollar mistake". [2]

C

In C, two null pointers of any type are guaranteed to compare equal. [3] Prior to C23, the preprocessor macro NULL was provided, defined as an implementation-defined null pointer constant in <stdlib.h>, [4] which in C99 can be portably expressed as ((void *)0), the integer value 0 converted to the type void* (see pointer to void type). [5] Since C23, a null pointer is represented with nullptr which is of type nullptr_t (something originally introduced to C++11), providing a type safe null pointer. [6]

The C standard does not say that the null pointer is the same as the pointer to memory address  0, though that may be the case in practice. Dereferencing a null pointer is undefined behavior in C, [7] and a conforming implementation is allowed to assume that any pointer that is dereferenced is not null.

In practice, dereferencing a null pointer may result in an attempted read or write from memory that is not mapped, triggering a segmentation fault or memory access violation. This may manifest itself as a program crash, or be transformed into a software exception that can be caught by program code. There are, however, certain circumstances where this is not the case. For example, in x86 real mode, the address 0000:0000 is readable and also usually writable, and dereferencing a pointer to that address is a perfectly valid but typically unwanted action that may lead to undefined but non-crashing behavior in the application; if a null pointer is represented as a pointer to that address, dereferencing it will lead to that behavior. There are occasions when dereferencing a pointer to address zero is intentional and well-defined; for example, BIOS code written in C for 16-bit real-mode x86 devices may write the interrupt descriptor table (IDT) at physical address 0 of the machine by dereferencing a pointer with the same value as a null pointer for writing. It is also possible for the compiler to optimize away the null pointer dereference, avoiding a segmentation fault but causing other undesired behavior. [8]

C++

In C++, while the NULL macro was inherited from C, the integer literal for zero has been traditionally preferred to represent a null pointer constant. [9] However, C++11 introduced the explicit null pointer constant nullptr and type nullptr_t to be used instead, providing a type safe null pointer. nullptr and type nullptr_t were later introduced to C in C23.

Other languages

In some programming language environments (at least one proprietary Lisp implementation, for example),[ citation needed ] the value used as the null pointer (called nil in Lisp) may actually be a pointer to a block of internal data useful to the implementation (but not explicitly reachable from user programs), thus allowing the same register to be used as a useful constant and a quick way of accessing implementation internals. This is known as the nil vector.

In languages with a tagged architecture, a possibly null pointer can be replaced with a tagged union which enforces explicit handling of the exceptional case; in fact, a possibly null pointer can be seen as a tagged pointer with a computed tag.

Programming languages use different literals for the null pointer. In Python, for example, a null value is called None. In Java and C#, the literal null is provided as a literal for reference types. In Pascal and Swift, a null pointer is called nil. In Eiffel, it is called a void reference.

Null dereferencing

Because a null pointer does not point to a meaningful object, an attempt to dereference (i.e., access the data stored at that memory location) a null pointer usually (but not always) causes a run-time error or immediate program crash. MITRE lists the null pointer error as one of the most commonly exploited software weaknesses. [10]

Mitigation

While we could have languages with no nulls, most do have the possibility of nulls so there are techniques to avoid or aid debugging null pointer dereferences. [13] Bond et al. [13] suggest modifying the Java Virtual Machine (JVM) to keep track of null propagation.

We have three levels of handling null references, in order of effectiveness:

  1. languages with no null;
  2. languages that can statically analyse code to avoid the possibility of null dereference at run time;
  3. if null dereference can occur at runtime, tools that aid debugging.

Pure functional languages are an example of level 1 since no direct access is provided to pointers and all code and data is immutable. User code running in interpreted or virtual-machine languages generally does not suffer the problem of null pointer dereferencing.[ dubious discuss ]

Where a language does provide or utilise pointers which could become void, it is possible to avoid runtime null dereferences by providing compilation-time checking via static analysis or other techniques, with syntactic assistance from language features such as those seen in the Eiffel programming language with Void safety [14] to avoid null derefences, D, [15] and Rust. [16]

In some languages analysis can be performed using external tools, but these are weak compared to direct language support with compiler checks since they are limited by the language definition itself.

The last resort of level 3 is when a null reference occurs at runtime, debugging aids can help.

Alternatives to null pointers

As a rule of thumb, for each type of struct or class, define some objects representing some state of the business logic replacing the undefined behaviour on null. For example, "future" to indicate a field inside a structure that will not be available right now (but for which we know in advance that in the future it will be defined), "not applicable" to indicate a field in a non-normalized structure, "error", "timeout" to indicate that the field could not be initialized (probably stopping normal execution of the full program, thread, request or command).

History

In 2009, Tony Hoare stated [17] [18] that he invented the null reference in 1965 as part of the ALGOL W language. In that 2009 reference Hoare describes his invention as a "billion-dollar mistake":

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

See also

Notes

  1. "CWE-476: NULL Pointer Dereference". MITRE.
  2. "Null References: The Billion Dollar Mistake". InfoQ. Retrieved 5 September 2024.
  3. ISO/IEC 9899, clause 6.3.2.3, paragraph 4.
  4. ISO/IEC 9899, clause 7.17, paragraph 3: NULL... which expands to an implementation-defined null pointer constant...
  5. ISO/IEC 9899, clause 6.3.2.3, paragraph 3.
  6. "WR14-N3042: Introduce the nullptr constant". open-std.org. 2022-07-22. Archived from the original on December 24, 2022.
  7. 1 2 ISO/IEC 9899, clause 6.5.3.2, paragraph 4, esp. footnote 87.
  8. Lattner, Chris (2011-05-13). "What Every C Programmer Should Know About Undefined Behavior #1/3". blog.llvm.org. Archived from the original on 2023-06-14. Retrieved 2023-06-14.
  9. Stroustrup, Bjarne (March 2001). "Chapter 5:
    The const qualifier (§5.4) prevents accidental redefinition of NULL and ensures that NULL can be used where a constant is required.". The C++ Programming Language (14th printing of 3rd ed.). United States and Canada: Addison–Wesley. p.  88. ISBN   0-201-88954-4.
  10. "CWE-476: NULL Pointer Dereference". MITRE.
  11. The Objective-C 2.0 Programming Language, section "Sending Messages to nil".
  12. "OS X exploitable kernel NULL pointer dereference in AppleGraphicsDeviceControl"
  13. 1 2 Bond, Michael D.; Nethercote, Nicholas; Kent, Stephen W.; Guyer, Samuel Z.; McKinley, Kathryn S. (2007). "Tracking bad apples". Proceedings of the 22nd annual ACM SIGPLAN conference on Object oriented programming systems and applications - OOPSLA '07. p. 405. doi:10.1145/1297027.1297057. ISBN   9781595937865. S2CID   2832749.
  14. "Void-safety: Background, definition, and tools" . Retrieved 2021-11-24.
  15. Bartosz Milewski. "SafeD – D Programming Language" . Retrieved 17 July 2014.
  16. "Fearless Security: Memory Safety". Archived from the original on 8 November 2020. Retrieved 4 November 2020.
  17. Tony Hoare (2009-08-25). "Null References: The Billion Dollar Mistake". InfoQ.com.
  18. Tony Hoare (2009-08-25). "Presentation: "Null References: The Billion Dollar Mistake"". InfoQ.com.

Related Research Articles

C is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems code, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

In computer programming, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification of the programming language in which the source code is written. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

In computer programming, run-time type information or run-time type identification (RTTI) is a feature of some programming languages that exposes information about an object's data type at runtime. Run-time type information may be available for all types or only to types that explicitly have it. Run-time type information is a specialization of a more general concept called type introspection.

Short-circuit evaluation, minimal evaluation, or McCarthy evaluation is the semantics of some Boolean operators in some programming languages in which the second argument is executed or evaluated only if the first argument does not suffice to determine the value of the expression: when the first argument of the AND function evaluates to false, the overall value must be false; and when the first argument of the OR function evaluates to true, the overall value must be true.

In the C++ programming language, a reference is a simple reference datatype that is less powerful but safer than the pointer type inherited from C. The name C++ reference may cause confusion, as in computer science a reference is a general concept datatype, with pointers and C++ references being specific reference datatype implementations. The definition of a reference in C++ is such that it does not need to exist. It can be implemented as a new name for an existing object.

<span class="mw-page-title-main">Dangling pointer</span> Pointer that does not point to a valid object

Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations. More generally, dangling references and wild references are references that do not resolve to a valid destination.

In computing, an uninitialized variable is a variable that is declared but is not set to a definite known value before it is used. It will have some value, but not a predictable one. As such, it is a programming error and a common source of bugs in software.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

In the C++ programming language, new and delete are a pair of language constructs that perform dynamic memory allocation, object construction and object destruction.

assert.h is a header file in the C standard library. It defines the C preprocessor macro assert and implements runtime assertion in C.

C++11 is a version of a joint technical standard, ISO/IEC 14882, by the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC), for the C++ programming language. C++11 replaced the prior version of the C++ standard, named C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

Nullable types are a feature of some programming languages which allow a value to be set to the special value NULL instead of the usual possible values of the data type. In statically typed languages, a nullable type is an option type, while in dynamically typed languages, equivalent behavior is provided by having a single null value.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

In object-oriented computer programming, a null object is an object with no referenced value or with defined neutral (null) behavior. The null object design pattern, which describes the uses of such objects and their behavior, was first published as "Void Value" and later in the Pattern Languages of Program Design book series as "Null Object".

C's offsetof macro is an ANSI C library feature found in stddef.h. It evaluates to the offset of a given member within a struct or union type, an expression of type size_t. The offsetof macro takes two parameters, the first being a structure or union name, and the second being the name of a subobject of the structure/union that is not a bit field. It cannot be described as a C prototype.

Memory safety is the state of being protected from various software bugs and security vulnerabilities when dealing with memory access, such as buffer overflows and dangling pointers. For example, Java is said to be memory-safe because its runtime error detection checks array bounds and pointer dereferences. In contrast, C and C++ allow arbitrary pointer arithmetic with pointers implemented as direct memory addresses with no provision for bounds checking, and thus are potentially memory-unsafe.

References