Pointer swizzling

Last updated

In computer science, pointer swizzling is the conversion of references based on name or position into direct pointer references (memory addresses). It is typically performed during deserialization or loading of a relocatable object from a disk file, such as an executable file or pointer-based data structure.

Contents

The reverse operation, replacing memory pointers with position-independent symbols or positions, is sometimes referred to as unswizzling, and is performed during serialization (saving). Alternatively, both operations can also be referred to as swizzling.

Example

It is easy to create a linked list data structure using elements like this:

structnode{intdata;structnode*next;};

But saving the list to a file and then reloading it will (on most operating systems) break every link and render the list useless because the nodes will almost never be loaded into the same memory locations. One way to usefully save and retrieve the list is to assign a unique id number to each node and then unswizzle the pointers by turning them into a field indicating the id number of the next node:

structnode_saved{intdata;intid_number;intid_number_of_next_node;};

Records like these can be saved to a file in any order and reloaded without breaking the list. Other options include saving the file offset of the next node or a number indicating its position in the sequence of saved records, or simply saving the nodes in-order to the file.

After loading such a list, finding a node based on its number is cumbersome and inefficient (serial search). Traversing the list was very fast with the original "next" pointers. To convert the list back to its original form, or swizzle the pointers, requires finding the address of each node and turning the id_number_of_next_node fields back into direct pointers to the right node.

Methods of unswizzling

There are a potentially unlimited number of forms into which a pointer can be unswizzled, but some of the most popular include:

Methods of swizzling

Swizzling in the general case can be complicated. The reference graph of pointers might contain an arbitrary number of cycles; this complicates maintaining a mapping from the old unswizzled values to the new addresses. Associative arrays are useful for maintaining the mapping, while algorithms such as breadth-first search help to traverse the graph, although both of these require extra storage. Various serialization libraries provide general swizzling systems. In many cases, however, swizzling can be performed with simplifying assumptions, such as a tree or list structure of references.

The different types of swizzling are:

Potential security weaknesses

For security, unswizzling and swizzling must be implemented with great caution. In particular, an attacker's presentation of a specially crafted file may allow access to addresses outside of the expected and proper bounds. In systems with weak memory protection this can lead to exposure of confidential data or modification of code likely to be executed. If the system does not implement guards against execution of data the system may be severely compromised by the installation of various kinds of malware.

Methods of protection include verifications prior to releasing the data to an application:

See also

Related Research Articles

C is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of objects does not include any of their associated methods with which they were previously linked.

In computer programming, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computing, position-independent code (PIC) or position-independent executable (PIE) is a body of machine code that, being placed somewhere in the primary memory, executes properly regardless of its absolute address. PIC is commonly used for shared libraries, so that the same library code can be loaded at a location in each program's address space where it does not overlap with other memory in use by, for example, other shared libraries. PIC was also used on older computer systems that lacked an MMU, so that the operating system could keep applications away from each other even within the single address space of an MMU-less system.

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attributes may include metadata, as well as owner and permission data.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

Relocation is the process of assigning load addresses for position-dependent code and data of a program and adjusting the code and data to reflect the assigned addresses. Prior to the advent of multiprocess systems, and still in many embedded systems, the addresses for objects were absolute starting at a known location, often zero. Since multiprocessing systems dynamically link and switch between programs it became necessary to be able to relocate objects using position-independent code. A linker usually performs relocation in conjunction with symbol resolution, the process of searching files and libraries to replace symbolic references or names of libraries with actual usable addresses in memory before running a program.

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (alias) for another data type, but does not create a new type, except in the obscure case of a qualified typedef of an array type where the typedef qualifiers are transferred to the array element type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, although it is also commonly used to provide specific descriptive type names for integer data types of varying sizes.

In computer programming, an opaque pointer is a special case of an opaque data type, a data type declared to be a pointer to a record or data structure of some unspecified type.

A class in C++ is a user-defined type or data structure declared with any of the keywords class, struct or union that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class declared with the keyword class is private. The private members are not accessible outside the class; they can be accessed only through member functions of the class. The public members form an interface to the class and are accessible outside the class.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Platform Invocation Services, commonly referred to as P/Invoke, is a feature of Common Language Infrastructure implementations, like Microsoft's Common Language Runtime, that enables managed code to call native code.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

C's offsetof macro is an ANSI C library feature found in stddef.h. It evaluates to the offset of a given member within a struct or union type, an expression of type size_t. The offsetof macro takes two parameters, the first being a structure or union name, and the second being the name of a subobject of the structure/union that is not a bit field. It cannot be described as a C prototype.

In computer science, a linked data structure is a data structure which consists of a set of data records (nodes) linked together and organized by references. The link between data can also be called a connector.

<span class="mw-page-title-main">Zig (programming language)</span> A general-purpose programming language, toolchain to build Zig/C/C++ code

Zig is an imperative, general-purpose, statically typed, compiled system programming language designed by Andrew Kelley. It is intended as a successor to the language C, with the intent of being even smaller and simpler to program in, while offering more function. It is free and open-source software, released under an MIT License.

References

    Further reading