Tagged pointer

Last updated

In computer science, a tagged pointer is a pointer (concretely a memory address) with additional data associated with it, such as an indirection bit or reference count. This additional data is often "folded" into the pointer, meaning stored inline in the data representing the address, taking advantage of certain properties of memory addressing. The name comes from "tagged architecture" systems, which reserved bits at the hardware level to indicate the significance of each word; the additional data is called a "tag" or "tags", though strictly speaking "tag" refers to data specifying a type, not other data; however, the usage "tagged pointer" is ubiquitous.

Contents

Folding tags into the pointer

There are various techniques for folding tags into a pointer. [1] [ unreliable source? ]

Most architectures are byte-addressable (the smallest addressable unit is a byte), but certain types of data will often be aligned to the size of the data, often a word or multiple thereof. This discrepancy leaves a few of the least significant bits of the pointer unused, which can be used for tags – most often as a bit field (each bit a separate tag) – as long as code that uses the pointer masks out these bits before accessing memory. E.g., on a 32-bit architecture (for both addresses and word size), a word is 32 bits = 4 bytes, so word-aligned addresses are always a multiple of 4, hence end in 00, leaving the last 2 bits available; while on a 64-bit architecture, a word is 64 bits = 8 bytes, so word-aligned addresses end in 000, leaving the last 3 bits available. In cases where data is aligned at a multiple of word size, further bits are available. In case of word-addressable architectures, word-aligned data does not leave any bits available, as there is no discrepancy between alignment and addressing, but data aligned at a multiple of word size does.

Conversely, in some operating systems, virtual addresses are narrower than the overall architecture width, which leaves the most significant bits available for tags; this can be combined with the previous technique in case of aligned addresses. This is particularly the case on 64-bit architectures, as 64 bits of address space are far above the data requirements of all but the largest applications, and thus many practical 64-bit processors have narrower addresses. Note that the virtual address width may be narrower than the physical address width, which in turn may be narrower than the architecture width; for tagging of pointers in user space, the virtual address space provided by the operating system (in turn provided by the memory management unit) is the relevant width. In fact, some processors specifically forbid use of such tagged pointers at the processor level, notably x86-64, which requires the use of canonical form addresses by the operating system, with most significant bits all 0s or all 1s.

Lastly, the virtual memory system in most modern operating systems reserves a block of logical memory around address 0 as unusable. This means that, for example, a pointer to 0 is never a valid pointer and can be used as a special null pointer value. Unlike the previously mentioned techniques, this only allows a single special pointer value, not extra data for pointers generally.

Examples

One of the earliest examples of hardware support for tagged pointers in a commercial platform was the IBM System/38. [2] IBM later added tagged pointer support to the PowerPC architecture to support the IBM i operating system, which is an evolution of the System/38 platform. [3]

A significant example of the use of tagged pointers is the Objective-C runtime on iOS 7 on ARM64, notably used on the iPhone 5S. In iOS 7, virtual addresses only contain 33 bits of address information but are 64-bits long leaving 31 bits for tags. Objective-C class pointers are 8-byte aligned freeing up an additional 3 bits of address space, and the tag fields are used for many purposes, such as storing a reference count and whether the object has a destructor. [4] [5]

Early versions of MacOS used tagged addresses called Handles to store references to data objects. The high bits of the address indicated whether the data object was locked, purgeable, and/or originated from a resource file, respectively. This caused compatibility problems when MacOS addressing advanced from 24 bits to 32 bits in System 7. [6]

Null versus aligned pointer

Use of zero to represent a null pointer is extremely common, with many programming languages (such as Ada) explicitly relying on this behavior. In theory, other values in an operating system-reserved block of logical memory could be used to tag conditions other than a null pointer, but these uses appear to be rare, perhaps because they are at best non-portable. It is generally accepted practice in software design that if a special pointer value distinct from null (such as a sentinel in certain data structures) is needed, the programmer should explicitly provide for it.

Taking advantage of the alignment of pointers provides more flexibility than null pointers/sentinels because it allows pointers to be tagged with information about the type of data pointed to, conditions under which it may be accessed, or other similar information about the pointer's use. This information can be provided along with every valid pointer. In contrast, null pointers/sentinels provide only a finite number of tagged values distinct from valid pointers.

In a tagged architecture, a number of bits in every word of memory are reserved to act as a tag. Tagged architectures, such as the Lisp machines, often have hardware support for interpreting and processing tagged pointers.

GNU libc malloc() provides 8-byte aligned memory addresses for 32-bit platforms, and 16-byte alignment for 64-bit platforms. [7] Larger alignment values can be obtained with posix_memalign(). [8]

Examples

Example 1

In the following C code, the value of zero is used to indicate a null pointer:

voidoptionally_return_a_value(int*optional_return_value_pointer){/* ... */intvalue_to_return=1;/* is it non-NULL? (note that NULL, logical false, and zero compare equally in C) */if(optional_return_value_pointer)/* if so, use it to pass a value to the calling function */*optional_return_value_pointer=value_to_return;/* otherwise, the pointer is never dereferenced */}

Example 2

Here, the programmer has provided a global variable, whose address is then used as a sentinel:

#define SENTINEL &sentinel_snode_tsentinel_s;voiddo_something_to_a_node(node_t*p){if(NULL==p)/* do something */elseif(SENTINEL==p)/* do something else */else/* treat p as a valid pointer to a node */}

Example 3

Assume we have a data structure table_entry that is always aligned to a 16 byte boundary. In other words, the least significant 4 bits of a table entry's address are always 0 (24 = 16). We could use these 4 bits to mark the table entry with extra information. For example, bit 0 might mean read only, bit 1 might mean dirty (the table entry needs to be updated), and so on.

If pointers are 16-bit values, then:

Advantages

The major advantage of tagged pointers is that they take up less space than a pointer along with a separate tag field. This can be especially important when a pointer is a return value from a function. It can also be important in large tables of pointers.

A more subtle advantage is that by storing a tag in the same place as the pointer, it is often possible to guarantee the atomicity of an operation that updates both the pointer and its tag without external synchronization mechanisms.[ further explanation needed ] This can be an extremely large performance gain, especially in operating systems.

Disadvantages

Tagged pointers have some of the same difficulties as xor linked lists, although to a lesser extent. For example, not all debuggers will be able to properly follow tagged pointers; however, this is not an issue for a debugger that is designed with tagged pointers in mind.

The use of zero to represent a null pointer does not suffer from these disadvantages: it is pervasive, most programming languages treat zero as a special null value, and it has thoroughly proven its robustness. An exception is the way that zero participates in overload resolution in C++, where zero is treated as an integer rather than a pointer; for this reason the special value nullptr is preferred over the integer zero. However, with tagged pointers zeros are usually not used to represent null pointers.

Related Research Articles

<span class="mw-page-title-main">IBM System/370</span> Family of mainframe computers 1970–1990

The IBM System/370 (S/370) is a model range of IBM mainframe computers announced on June 30, 1970, as the successors to the System/360 family. The series mostly maintains backward compatibility with the S/360, allowing an easy migration path for customers; this, plus improved performance, were the dominant themes of the product announcement. In September 1990, the System/370 line was replaced with the System/390.

In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPUs and ALUs are those that are based on processor registers, address buses, or data buses of that size. A computer that uses such a processor is a 64-bit computer.

In computing, protected mode, also called protected virtual address mode, is an operational mode of x86-compatible central processing units (CPUs). It allows system software to use features such as segmentation, virtual memory, paging and safe multi-tasking designed to increase an operating system's control over application software.

<span class="mw-page-title-main">Memory management unit</span> Hardware translating virtual addresses to physical address

A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware unit that examines all memory references on the memory bus, translating these requests, known as virtual memory addresses, into physical addresses in main memory.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In computing, a bus error is a fault raised by hardware, notifying an operating system (OS) that a process is trying to access memory that the CPU cannot physically address: an invalid address for the address bus, hence the name. In modern use on most architectures these are much rarer than segmentation faults, which occur primarily due to memory access violations: problems in the logical address or permissions.

C dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc, aligned_alloc and free.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

<span class="mw-page-title-main">Memory address</span> Reference to a specific memory location

In computing, a memory address is a reference to a specific memory location used at various levels by software and hardware. Memory addresses are fixed-length sequences of digits conventionally displayed and manipulated as unsigned integers. Such numerical semantic bases itself upon features of CPU, as well upon use of the memory like an array endorsed by various programming languages.

Mach-O, short for Mach object file format, is a file format for executables, object code, shared libraries, dynamically loaded code, and core dumps. It was developed to replace the a.out format.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

The zero page or base page is the block of memory at the very beginning of a computer's address space; that is, the page whose starting address is zero. The size of a page depends on the context, and the significance of zero page memory versus higher addressed memory is highly dependent on machine architecture. For example, the Motorola 6800 and MOS Technology 6502 processor families treat the first 256 bytes of memory specially, whereas many other processors do not.

Byte addressing in hardware architectures supports accessing individual bytes. Computers with byte addressing are sometimes called byte machines, in contrast to word-addressable architectures, word machines, that access data by word.

Descriptors are an architectural feature of Burroughs large systems, including the current Unisys Clearpath/MCP systems. Apart from being stack- and tag-based, a notable architectural feature of these systems is that it is descriptor-based. Descriptors are the means of having data that does not reside on the stack as for arrays and objects. Descriptors are also used for string data as in compilers and commercial applications.

In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits in a word is an important characteristic of any specific processor design or computer architecture.

z/Architecture, initially and briefly called ESA Modal Extensions (ESAME), is IBM's 64-bit complex instruction set computer (CISC) instruction set architecture, implemented by its mainframe computers. IBM introduced its first z/Architecture-based system, the z900, in late 2000. Later z/Architecture systems include the IBM z800, z990, z890, System z9, System z10, zEnterprise 196, zEnterprise 114, zEC12, zBC12, z13, z14, z15 and z16.

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

<span class="mw-page-title-main">Word addressing</span> Support by a hardware architecture of accessing memory only in units of words larger than a byte

In computer architecture, word addressing means that addresses of memory on a computer uniquely identify words of memory. It is usually used in contrast with byte addressing, where addresses uniquely identify bytes. Almost all modern computer architectures use byte addressing, and word addressing is largely only of historical interest. A computer that uses word addressing is sometimes called a word machine.

In IBM System/360 through present day z/Architecture, an address constant or "adcon" is an assembly language data type which contains the address of a location in computer memory. An address constant can be one, two, three or four bytes long, although an adcon of less than four bytes is conventionally used to hold an expression for a small integer such as a length, a relative address, or an index value, and does not represent an address at all. Address constants are defined using an assembler language "DC" statement.

The IBM System/360 architecture is the model independent architecture for the entire S/360 line of mainframe computers, including but not limited to the instruction set architecture. The elements of the architecture are documented in the IBM System/360 Principles of Operation and the IBM System/360 I/O Interface Channel to Control Unit Original Equipment Manufacturers' Information manuals.

References

  1. Friday Q&A 2012-07-27: Let's Build Tagged Pointers, by Mike Ash
  2. Levy, Henry M. (1984). "The IBM System/38" (PDF). Capability-Based Computer Systems. Digital Press. ISBN   0-932376-22-3.
  3. Frank G. Soltis (1997). Inside the AS/400, Second Edition. Duke Press. ISBN   978-1882419661.
  4. Friday Q&A 2013-09-27: ARM64 and You, by Mike Ash
  5. [objc explain]: Non-pointer isa
  6. Bricknell, K. J. Macintosh C: A Hobbyist's Guide To Programming the Mac OS in C.
  7. "Malloc Examples". The GNU C Library. Retrieved 5 September 2018.
  8. posix_memalign(3)    Linux Programmer's Manual – Library Functions