Type punning

Last updated

In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

Contents

In C and C++, constructs such as pointer type conversion and union — C++ adds reference type conversion and reinterpret_cast to this list — are provided in order to permit many kinds of type punning, although some kinds are not actually supported by the standard language.

In the Pascal programming language, the use of records with variants may be used to treat a particular data type in more than one manner, or in a manner not normally permitted.

Sockets example

One classic example of type punning is found in the Berkeley sockets interface. The function to bind an opened but uninitialized socket to an IP address is declared as follows:

intbind(intsockfd,structsockaddr*my_addr,socklen_taddrlen);

The bind function is usually called as follows:

structsockaddr_insa={0};intsockfd=...;sa.sin_family=AF_INET;sa.sin_port=htons(port);bind(sockfd,(structsockaddr*)&sa,sizeofsa);

The Berkeley sockets library fundamentally relies on the fact that in C, a pointer to struct sockaddr_in is freely convertible to a pointer to struct sockaddr; and, in addition, that the two structure types share the same memory layout. Therefore, a reference to the structure field my_addr->sin_family (where my_addr is of type struct sockaddr*) will actually refer to the field sa.sin_family (where sa is of type struct sockaddr_in). In other words, the sockets library uses type punning to implement a rudimentary form of polymorphism or inheritance.

Often seen in the programming world is the use of "padded" data structures to allow for the storage of different kinds of values in what is effectively the same storage space. This is often seen when two structures are used in mutual exclusivity for optimization.

Floating-point example

Not all examples of type punning involve structures, as the previous example did. Suppose we want to determine whether a floating-point number is negative. We could write:

boolis_negative(floatx){returnx<0.0;}

However, supposing that floating-point comparisons are expensive, and also supposing that float is represented according to the IEEE floating-point standard, and integers are 32 bits wide, we could engage in type punning to extract the sign bit of the floating-point number using only integer operations:

boolis_negative(floatx){unsignedint*ui=(unsignedint*)&x;return*ui&0x80000000;}

Note that the behaviour will not be exactly the same: in the special case of x being negative zero, the first implementation yields false while the second yields true.

This kind of type punning is more dangerous than most. Whereas the former example relied only on guarantees made by the C programming language about structure layout and pointer convertibility, the latter example relies on assumptions about a particular system's hardware. Some situations, such as time-critical code that the compiler otherwise fails to optimize, may require dangerous code. In these cases, documenting all such assumptions in comments, and introducing static assertions to verify portability expectations, helps to keep the code maintainable.

For a practical example popularized by Quake III, see fast inverse square root.

In addition to the assumption about bit-representation of floating-point numbers, the previous floating-point type-punning example also violates the C language's constraints on how objects are accessed: [1] the declared type of x is float but it is read through an expression of type unsigned int. On many common platforms, this use of pointer punning can create problems if different pointers are aligned in machine-specific ways. Furthermore, pointers of different sizes can alias accesses to the same memory, causing problems that are unchecked by the compiler.

Use of union

It is a common mistake to attempt to fix type-punning by the use of a union. (Additionally, this example still makes the assumption about IEEE-754 bit-representation of floating-point types.)

boolis_negative(floatx){union{unsignedintui;floatd;}my_union={.d=x};returnmy_union.ui&0x80000000;}

Accessing my_union.ui after initializing the other member, my_union.d, is still a form of type-punning [2] in C and the result is unspecified behavior [3] (and undefined behavior in C++ [4] ).

The language of § 6.5/7 [1] can be misread to imply that reading alternative union members is permissible. However, the text is "An object shall have its stored value accessed only by…". It is a limiting expression, not a statement that all possible union members may be accessed regardless of which was last stored. So, the use of the union avoids none of the issues with simply punning a pointer directly.

Some compilers like GCC support such non-standard constructs as a language extension. [5]

For another example of type punning, see Stride of an array.

Pascal

A variant record permits treating a data type as multiple kinds of data depending on which variant is being referenced. In the following example, integer is presumed to be 16 bit, while longint and real are presumed to be 32, while character is presumed to be 8 bit:

typevariant_record=recordcaserec_type:longintof1:(I:array[1..2]ofinteger);2:(L:longint);3:(R:real);4:(C:array[1..4]ofcharacter);end;VarV:Variant_record;K:Integer;LA:Longint;RA:Real;Ch:character;...V.I:=1;Ch:=V.C[1];(* This would extract the first binary byte of V.I *)V.R:=8.3;LA:=V.L;(* This would store a real into an integer *)

In Pascal, copying a real to an integer converts it to the truncated value. This method would translate the binary value of the floating-point number into whatever it is as a long integer (32 bit), which will not be the same and may be incompatible with the long integer value on some systems.

These examples could be used to create strange conversions, although, in some cases, there may be legitimate uses for these types of constructs, such as for determining locations of particular pieces of data. In the following example a pointer and a longint are both presumed to be 32 bit:

TypePA=^Arec;Arec=recordcasert:longintof1:(P:PA);2:(L:Longint);end;VarPP:PA;K:Longint;...New(PP);PP^.P:=PP;Writeln('Variable PP is located at address ',hex(PP^.L));

Where "new" is the standard routine in Pascal for allocating memory for a pointer, and "hex" is presumably a routine to print the hexadecimal string describing the value of an integer. This would allow the display of the address of a pointer, something which is not normally permitted. (Pointers cannot be read or written, only assigned .) Assigning a value to an integer variant of a pointer would allow examining or writing to any location in system memory:

PP^.L:=0;PP:=PP^.P;(*PP now points to address 0 *)K:=PP^.L;(*K contains the value of word 0 *)Writeln('Word 0 of this machine contains ',K);

This construct may cause a program check or protection violation if address 0 is protected against reading on the machine the program is running upon or the operating system it is running under.

The reinterpret cast technique from C/C++ also works in Pascal. This can be useful, when eg. reading dwords from a byte stream, and we want to treat them as float. Here is a working example, where we reinterpret-cast a dword to a float:

typePReal=^real;vardw:dword;f:real;...f:=PReal(@dw)^;

C#

In C# (and other .NET languages), type punning is a little harder to achieve because of the type system, but can be done nonetheless, using pointers or struct unions.

Pointers

C# only allows pointers to so-called native types, i.e. any primitive type (except string), enum, array or struct that is composed only of other native types. Note that pointers are only allowed in code blocks marked 'unsafe'.

floatpi=3.14159;uintpiAsRawData=*(uint*)&pi;

Struct unions

Struct unions are allowed without any notion of 'unsafe' code, but they do require the definition of a new type.

 [StructLayout(LayoutKind.Explicit)]structFloatAndUIntUnion{     [FieldOffset(0)]publicfloatDataAsFloat;     [FieldOffset(0)]publicuintDataAsUInt;}// ...FloatAndUIntUnionunion;union.DataAsFloat=3.14159;uintpiAsRawData=union.DataAsUInt;

Raw CIL code

Raw CIL can be used instead of C#, because it doesn't have most of the type limitations. This allows one to, for example, combine two enum values of a generic type:

TEnuma=...;TEnumb=...;TEnumcombined=a|b;// illegal

This can be circumvented by the following CIL code:

.methodpublicstatichidebysig!!TEnumCombineEnums<valuetype.ctor([mscorlib]System.ValueType)TEnum>(!!TEnuma,!!TEnumb)cilmanaged{.maxstack2ldarg.0ldarg.1or// this will not cause an overflow, because a and b have the same type, and therefore the same size.ret}

The cpblk CIL opcode allows for some other tricks, such as converting a struct to a byte array:

.methodpublicstatichidebysiguint8[]ToByteArray<valuetype.ctor([mscorlib]System.ValueType)T>(!!T&v// 'ref T' in C#)cilmanaged{.localsinit(         [0]uint8[]).maxstack3// create a new byte array with length sizeof(T) and store it in local 0sizeof!!Tnewarruint8dup// keep a copy on the stack for later (1)stloc.0ldc.i4.0ldelemauint8// memcpy(local 0, &v, sizeof(T));// <the array is still on the stack, see (1)>ldarg.0// this is the *address* of 'v', because its type is '!!T&'sizeof!!Tcpblkldloc.0ret}

Related Research Articles

C (programming language) general-purpose programming language

C is a general-purpose, procedural computer programming language supporting structured programming, lexical variable scope, and recursion, while a static type system prevents unintended operations. By design, C provides constructs that map efficiently to typical machine instructions and has found lasting use in applications previously coded in assembly language. Such applications include operating systems and various application software for computers, from supercomputers to embedded systems.

Data type classification of data in computer science

In computer science and computer programming, a data type or simply type is an attribute of data which tells the compiler or interpreter how the programmer intends to use the data. Most programming languages support basic data types of integer numbers, Floating-point numbers, characters and booleans. A data type constrains the values that an expression, such as a variable or a function, might take. This data type defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored. A data type provides a set of values from which an expression may take its values.

In computer programming, the stride of an array is the number of locations in memory between beginnings of successive array elements, measured in bytes or in units of the size of the array's elements. The stride cannot be smaller than the element size but can be larger, indicating extra space between elements.

The syntax of the C programming language is the set of rules governing writing of software in the language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

Pointer (computer programming) programming language data type

In computer science, a pointer is a programming language object that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain a float and an integer; in a union, there is only one value at any given time.

In computer science, type conversion, type casting, type coercion, and type juggling are different ways of changing an expression from one data type to another. An example would be the conversion of an integer value into a floating point value or its textual representation as a string, and vice versa. Type conversions can take advantage of certain features of type hierarchies or data representations. Two important aspects of a type conversion are whether it happens implicitly (automatically) or explicitly, and whether the underlying data representation is converted from one representation into another, or a given representation is merely reinterpreted as the representation of another data type. In general, both primitive and compound data types can be converted.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

IEC 61131-3 is the third part of the open international standard IEC 61131 for programmable logic controllers, and was first published in December 1993 by the IEC. The current (third) edition was published in February 2013.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

Data structure alignment refers to the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

typedef is a reserved keyword in the C and C++ programming languages. It is used to create an alias name for another data type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, but is just as common in providing specific descriptive type names for integer data types of varying lengths.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

In the programming languages C and C++, the unary operator sizeof generates the size of an expression or a data type, measured in the number of char-sized storage units required for the type. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern systems this is eight bits. The result of sizeof has an unsigned integral type that is usually denoted by size_t.

The C and C++ programming languages are closely related but have many significant differences. C++ began as a fork of an early, pre-standardized C, and was designed to be mostly source-and-link compatible with C compilers of the time. Due to this, development tools for the two languages are often integrated into a single product, with the programmer able to specify C or C++ as their source language.

Action Message Format (AMF) is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service, usually a Flash Media Server or third party alternatives. The Actionscript 3 language provides classes for encoding and decoding from the AMF format.

A vertex buffer object (VBO) is an OpenGL feature that provides methods for uploading vertex data to the video device for non-immediate-mode rendering. VBOs offer substantial performance gains over immediate mode rendering primarily because the data resides in the video device memory rather than the system memory and so it can be rendered directly by the video device. These are equivalent to vertex buffers in Direct3D.

The computer programming languages C and Object Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes.

Binn is a computer data serialization format used mainly for application data transfer. It stores primitive data types and data structures in a binary form.

References

  1. 1 2 ISO/IEC 9899:1999 s6.5/7
  2. "§ 6.5.2.3/3, footnote 97", ISO/IEC 9899:2018 (PDF), 2018, p. 59, If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called “type punning”). This might be a trap representation.
  3. "§ J.1/1, bullet 11", ISO/IEC 9899:2018 (PDF), 2018, p. 403, The following are unspecified: … The values of bytes that correspond to union members other than the one last stored into (6.2.6.1).
  4. ISO/IEC 14882:2011 Section 9.5
  5. GCC: Non-Bugs