Reference (computer science)

Last updated

In computer programming, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

Contents

A reference is an abstract data type and may be implemented in many ways. Typically, a reference refers to data stored in memory on a given system, and its internal value is the memory address of the data, i.e. a reference is implemented as a pointer. For this reason a reference is often said to "point to" the data. Other implementations include an offset (difference) between the datum's address and some fixed "base" address, an index, or identifier used in a lookup operation into an array or table, an operating system handle, a physical address on a storage device, or a network address such as a URL.

Formal representation

A reference R is a value that admits one operation, dereference(R), which yields a value. Usually the reference is typed so that it returns values of a specific type, e.g.: [1] [2]

interfaceReference<T>{Tvalue();}

Often the reference also admits an assignment operation store(R, x), meaning it is an abstract variable. [1]

Use

References are widely used in programming, especially to efficiently pass large or mutable data as arguments to procedures, or to share such data among various uses. In particular, a reference may point to a variable or record that contains references to other data. This idea is the basis of indirect addressing and of many linked data structures, such as linked lists. References increase flexibility in where objects can be stored, how they are allocated, and how they are passed between areas of code. As long as one can access a reference to the data, one can access the data through it, and the data itself need not be moved. They also make sharing of data between different code areas easier; each keeps a reference to it.

References can cause significant complexity in a program, partially due to the possibility of dangling and wild references and partially because the topology of data with references is a directed graph, whose analysis can be quite complicated. Nonetheless, references are still simpler to analyze than pointers due to the absence of pointer arithmetic.

The mechanism of references, if varying in implementation, is a fundamental programming language feature common to nearly all modern programming languages. Even some languages that support no direct use of references have some internal or implicit use. For example, the call by reference calling convention can be implemented with either explicit or implicit use of references.

Examples

Pointers are the most primitive type of reference. Due to their intimate relationship with the underlying hardware, they are one of the most powerful and efficient types of references. However, also due to this relationship, pointers require a strong understanding by the programmer of the details of memory architecture. Because pointers store a memory location's address, instead of a value directly, inappropriate use of pointers can lead to undefined behavior in a program, particularly due to dangling pointers or wild pointers. Smart pointers are opaque data structures that act like pointers but can only be accessed through particular methods.

A handle is an abstract reference, and may be represented in various ways. A common example are file handles (the FILE data structure in the C standard I/O library), used to abstract file content. It usually represents both the file itself, as when requesting a lock on the file, and a specific position within the file's content, as when reading a file.

In distributed computing, the reference may contain more than an address or identifier; it may also include an embedded specification of the network protocols used to locate and access the referenced object, the way information is encoded or serialized. Thus, for example, a WSDL description of a remote web service can be viewed as a form of reference; it includes a complete specification of how to locate and bind to a particular web service. A reference to a live distributed object is another example: it is a complete specification for how to construct a small software component called a proxy that will subsequently engage in a peer-to-peer interaction, and through which the local machine may gain access to data that is replicated or exists only as a weakly consistent message stream. In all these cases, the reference includes the full set of instructions, or a recipe, for how to access the data; in this sense, it serves the same purpose as an identifier or address in memory.

If we have a set of keys K and a set of data objects D, any well-defined (single-valued) function from K to D ∪ {null} defines a type of reference, where null is the image of a key not referring to anything meaningful.

An alternative representation of such a function is a directed graph called a reachability graph. Here, each datum is represented by a vertex and there is an edge from u to v if the datum in u refers to the datum in v. The maximum out-degree is one. These graphs are valuable in garbage collection, where they can be used to separate accessible from inaccessible objects.

External and internal storage

In many data structures, large, complex objects are composed of smaller objects. These objects are typically stored in one of two ways:

  1. With internal storage, the contents of the smaller object are stored inside the larger object.
  2. With external storage, the smaller objects are allocated in their own location, and the larger object only stores references to them.

Internal storage is usually more efficient, because there is a space cost for the references and dynamic allocation metadata, and a time cost associated with dereferencing a reference and with allocating the memory for the smaller objects. Internal storage also enhances locality of reference by keeping different parts of the same large object close together in memory. However, there are a variety of situations in which external storage is preferred:

Some languages, such as Java, Smalltalk, Python, and Scheme, do not support internal storage. In these languages, all objects are uniformly accessed through references.

Language support

Assembly

In assembly language, it is typical to express references using either raw memory addresses or indexes into tables. These work, but are somewhat tricky to use, because an address tells you nothing about the value it points to, not even how large it is or how to interpret it; such information is encoded in the program logic. The result is that misinterpretations can occur in incorrect programs, causing bewildering errors.

Lisp

One of the earliest opaque references was that of the Lisp language cons cell, which is simply a record containing two references to other Lisp objects, including possibly other cons cells. This simple structure is most commonly used to build singly linked lists, but can also be used to build simple binary trees and so-called "dotted lists", which terminate not with a null reference but a value.

C/C++

The pointer is still one of the most popular types of references today. It is similar to the assembly representation of a raw address, except that it carries a static datatype which can be used at compile-time to ensure that the data it refers to is not misinterpreted. However, because C has a weak type system which can be violated using casts (explicit conversions between various pointer types and between pointer types and integers), misinterpretation is still possible, if more difficult. Its successor C++ tried to increase type safety of pointers with new cast operators, a reference type &, and smart pointers in its standard library, but still retained the ability to circumvent these safety mechanisms for compatibility.

Fortran

Fortran does not have an explicit representation of references, but does use them implicitly in its call-by-reference calling semantics. A Fortran reference is best thought of as an alias of another object, such as a scalar variable or a row or column of an array. There is no syntax to dereference the reference or manipulate the contents of the referent directly. Fortran references can be null. As in other languages, these references facilitate the processing of dynamic structures, such as linked lists, queues, and trees.

Object-oriented languages

A number of object-oriented languages such as Eiffel, Java, C#, and Visual Basic have adopted a much more opaque type of reference, usually referred to as simply a reference. These references have types like C pointers indicating how to interpret the data they reference, but they are typesafe in that they cannot be interpreted as a raw address and unsafe conversions are not permitted. References are extensively used to access and assign objects. References are also used in function/method calls or message passing, and reference counts are frequently used to perform garbage collection of unused objects.

Functional languages

In Standard ML, OCaml, and many other functional languages, most values are persistent: they cannot be modified by assignment. Assignable "reference cells" provide mutable variables, data that can be modified. Such reference cells can hold any value, and so are given the polymorphic type α ref, where α is to be replaced with the type of value pointed to. These mutable references can be pointed to different objects over their lifetime. For example, this permits building of circular data structures. The reference cell is functionally equivalent to a mutable array of length 1.

To preserve safety and efficient implementations, references cannot be type-cast in ML, nor can pointer arithmetic be performed. In the functional paradigm, many structures that would be represented using pointers in a language like C are represented using other facilities, such as the powerful algebraic datatype mechanism. The programmer is then able to enjoy certain properties (such as the guarantee of immutability) while programming, even though the compiler often uses machine pointers "under the hood".

Perl/PHP

Perl supports hard references, which function similarly to those in other languages, and symbolic references, which are just string values that contain the names of variables. When a value that is not a hard reference is dereferenced, Perl considers it to be a symbolic reference and gives the variable with the name given by the value. [3] PHP has a similar feature in the form of its $$var syntax. [4]

See also

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called a one-dimensional array.

C is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems code, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence. In its most basic form, each node contains data, and a reference to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. More complex variants add additional links, allowing more efficient insertion or removal of nodes at arbitrary positions. A drawback of linked lists is that data access time is linear in respect to the number of nodes in the list. Because nodes are serially linked, accessing any node requires that the prior node be accessed beforehand. Faster access, such as random access, is not feasible. Arrays have better cache locality compared to linked lists.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

In computer programming, an indirection is a way of referring to something using a name, reference, or container instead of the value itself. The most common form of indirection is the act of manipulating a value through its memory address. For example, accessing a variable through the use of a pointer. A stored pointer that exists to provide a reference to an object by double indirection is called an indirection node. In some older computer architectures, indirect words supported a variety of more-or-less complicated addressing modes.

In object-oriented (OO) and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

This article compares two programming languages: C# with Java. While the focus of this article is mainly the languages and their features, such a comparison will necessarily also consider some features of platforms and libraries.

<span class="mw-page-title-main">Dangling pointer</span> Pointer that does not point to a valid object

Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations. More generally, dangling references and wild references are references that do not resolve to a valid destination.

In computing, a null pointer or null reference is a value saved for indicating that the pointer or reference does not refer to a valid object. Programs routinely use null pointers to represent conditions such as the end of a list of unknown length or the failure to perform some action; this use of null pointers can be compared to nullable types and to the Nothing value in an option type.

In some programming languages, const is a type qualifier that indicates that the data is read-only. While this can be used to declare constants, const in the C family of languages differs from similar constructs in other languages in that it is part of the type, and thus has complicated behavior when combined with pointers, references, composite data types, and type-checking. In other languages, the data is not in a single memory location, but copied at compile time for each use. Languages which use it include C, C++, D, JavaScript, Julia, and Rust.

<span class="mw-page-title-main">Data (computer science)</span> Quantities, characters, or symbols on which operations are performed by a computer

In computer science, data is any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represented using the binary number system of ones (1) and zeros (0), instead of analog representation. In modern (post-1960) computer systems, all data is digital.

In computer science, a tagged pointer is a pointer with additional data associated with it, such as an indirection bit or reference count. This additional data is often "folded" into the pointer, meaning stored inline in the data representing the address, taking advantage of certain properties of memory addressing. The name comes from "tagged architecture" systems, which reserved bits at the hardware level to indicate the significance of each word; the additional data is called a "tag" or "tags", though strictly speaking "tag" refers to data specifying a type, not other data; however, the usage "tagged pointer" is ubiquitous.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In certain computer programming languages, data types are classified as either value types or reference types, where reference types are always implicitly accessed via references, whereas value type variables directly contain the values themselves.

Memory safety is the state of being protected from various software bugs and security vulnerabilities when dealing with memory access, such as buffer overflows and dangling pointers. For example, Java is said to be memory-safe because its runtime error detection checks array bounds and pointer dereferences. In contrast, C and C++ allow arbitrary pointer arithmetic with pointers implemented as direct memory addresses with no provision for bounds checking, and thus are potentially memory-unsafe.

This comparison of programming languages compares how object-oriented programming languages such as C++, Java, Smalltalk, Object Pascal, Perl, Python, and others manipulate data structures.

In computer programming, a variable is an abstract storage location paired with an associated symbolic name, which contains some known or unknown quantity of data or object referred to as a value; or in simpler terms, a variable is a named container for a particular set of bits or type of data. A variable can eventually be associated with or identified by a memory address. The variable name is the usual way to reference the stored value, in addition to referring to the variable itself, depending on the context. This separation of name and content allows the name to be used independently of the exact information it represents. The identifier in computer source code can be bound to a value during run time, and the value of the variable may thus change during the course of program execution.

In computer programming, a constant is a value that is not altered by the program during normal execution. When associated with an identifier, a constant is said to be "named," although the terms "constant" and "named constant" are often used interchangeably. This is contrasted with a variable, which is an identifier with a value that can be changed during normal execution. To simplify, constants' values remains, while the values of variables varies, hence both their names.

References

  1. 1 2 Sherman, Mark S. (April 1985). Paragon: A Language Using Type Hierarchies for the Specification, Implementation, and Selection of Abstract Data Types. Springer Science & Business Media. p. 175. ISBN   978-3-540-15212-5.
  2. "Reference (Java Platform SE 7)". docs.oracle.com. Retrieved 10 May 2022.
  3. "perlref". perldoc.perl.org. Retrieved 2013-08-19.
  4. "Variable variables - Manual". PHP. Retrieved 2013-08-19.