Pointer swizzling

Last updated

In computer science, pointer swizzling is the conversion of references based on name or position into direct pointer references (memory addresses). It is typically performed during deserialization or loading of a relocatable object from a disk file, such as an executable file or pointer-based data structure.

Contents

The reverse operation, replacing memory pointers with position-independent symbols or positions, is sometimes referred to as unswizzling, and is performed during serialization (saving).

Example

It is easy to create a linked list data structure using elements like this:

structnode{intdata;structnode*next;};

But saving the list to a file and then reloading it will (on most operating systems) break every link and render the list useless because the nodes will almost never be loaded into the same memory locations. One way to usefully save and retrieve the list is to assign a unique id number to each node and then unswizzle the pointers by turning them into a field indicating the id number of the next node:

structnode_saved{intdata;intid_number;intid_number_of_next_node;};

Records like these can be saved to a file in any order and reloaded without breaking the list. Other options include saving the file offset of the next node or a number indicating its position in the sequence of saved records, or simply saving the nodes in-order to the file.

After loading such a list, finding a node based on its number is cumbersome and inefficient (serial search). Traversing the list was very fast with the original "next" pointers. To convert the list back to its original form, or swizzle the pointers, requires finding the address of each node and turning the id_number_of_next_node fields back into direct pointers to the right node.

Methods of unswizzling

There are a potentially unlimited number of forms into which a pointer can be unswizzled, but some of the most popular include:

Methods of swizzling

Swizzling in the general case can be complicated. The reference graph of pointers might contain an arbitrary number of cycles; this complicates maintaining a mapping from the old unswizzled values to the new addresses. Associative arrays are useful for maintaining the mapping, while algorithms such as breadth-first search help to traverse the graph, although both of these require extra storage. Various serialization libraries provide general swizzling systems. In many cases, however, swizzling can be performed with simplifying assumptions, such as a tree or list structure of references.

The different types of swizzling are:

Potential security weaknesses

For security, unswizzling and swizzling must be implemented with great caution. In particular, an attacker's presentation of a specially crafted file may allow access to addresses outside of the expected and proper bounds. In systems with weak memory protection this can lead to exposure of confidential data or modification of code likely to be executed. If the system does not implement guards against execution of data the system may be severely compromised by the installation of various kinds of malware.

Methods of protection include verifications prior to releasing the data to an application:

See also

Related Research Articles

<span class="mw-page-title-main">C (programming language)</span> General-purpose programming language

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computing, serialization or serialisation is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

In computer programming, a reference is a value that enables a program to indirectly access a particular data, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computing, position-independent code (PIC) or position-independent executable (PIE) is a body of machine code that, being placed somewhere in the primary memory, executes properly regardless of its absolute address. PIC is commonly used for shared libraries, so that the same library code can be loaded in a location in each program address space where it does not overlap with other memory in use. PIC was also used on older computer systems that lacked an MMU, so that the operating system could keep applications away from each other even within the single address space of an MMU-less system.

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attributes may include metadata, as well as owner and permission data.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

In computer programming, an opaque pointer is a special case of an opaque data type, a data type declared to be a pointer to a record or data structure of some unspecified type.

In computer programming, a forward declaration is a declaration of an identifier for which the programmer has not yet given a complete definition.

In computer programming, a sentinel node is a specifically designated node used with linked lists and trees as a traversal path terminator. This type of node does not hold or reference any data managed by the data structure.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Platform Invocation Services, commonly referred to as P/Invoke, is a feature of Common Language Infrastructure implementations, like Microsoft's Common Language Runtime, that enables managed code to call native code.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

C's offsetof macro is an ANSI C library feature found in stddef.h. It evaluates to the offset of a given member within a struct or union type, an expression of type size_t. The offsetof macro takes two parameters, the first being a structure name, and the second being the name of a member within the structure. It cannot be described as a C prototype.

In computer science, a linked data structure is a data structure which consists of a set of data records (nodes) linked together and organized by references. The link between data can also be called a connector.

References

    Further reading