Aliasing (computing)

Last updated

In computing, aliasing describes a situation in which a data location in memory can be accessed through different symbolic names in the program. Thus, modifying the data through one name implicitly modifies the values associated with all aliased names, which may not be expected by the programmer. As a result, aliasing makes it particularly difficult to understand, analyze and optimize programs. Aliasing analysers intend to make and compute useful information for understanding aliasing in programs.

Contents

Examples

Buffer overflow

For example, most implementations of the C programming language do not perform array bounds checking. One can then exploit the implementation of the programming language by the compiler and the computer architecture's assembly language conventions, to achieve aliasing effects by writing outside of the array (a type of buffer overflow). This invokes undefined behaviour according to the C language specification; however many implementations of C will show the aliasing effects described here.

If an array is created on the stack, with a variable laid out in memory directly beside that array, one could index outside the array and directly change the variable by changing the relevant array element. For example, if there is an int array of size 2 (for this example's sake, calling it arr), next to another int variable (call it i), arr[2] (i.e., the 3rd element) would be aliased to i if they are adjacent in memory.

#include<stdio.h>intmain(){intarr[2]={1,2};inti=10;/* Write beyond the end of arr. Undefined behaviour in standard C, will write to i in some implementations. */arr[2]=20;printf("element 0: %d \t",arr[0]);// outputs 1printf("element 1: %d \t",arr[1]);// outputs 2printf("element 2: %d \t",arr[2]);// outputs 20, if aliasing occurredprintf("i: %d \t\t",i);// might also output 20, not 10, because of aliasing, but the compiler might have i stored in a register and print 10/* arr size is still 2. */printf("arr size: %d \n",sizeof(arr)/sizeof(int));}

This is possible in some implementations of C because an array is a block of contiguous memory, and array elements are merely referenced by offsets off the address of the beginning of that block multiplied by the size of a single element. Since C has no bounds checking, indexing and addressing outside of the array is possible. Note that the aforementioned aliasing behaviour is undefined behaviour. Some implementations may leave space between arrays and variables on the stack, for instance, to align variables to memory locations that are a multiple of the architecture's native word size. The C standard does not generally specify how data is to be laid out in memory. (ISO/IEC 9899:1999, section 6.2.6.1).

It is not erroneous for a compiler to omit aliasing effects for accesses that fall outside the bounds of an array.

Aliased pointers

Another variety of aliasing can occur in any language that can refer to one location in memory with more than one name (for example, with pointers). See the C example of the XOR swap algorithm that is a function; it assumes the two pointers passed to it are distinct, but if they are in fact equal (or aliases of each other), the function fails. This is a common problem with functions that accept pointer arguments, and their tolerance (or the lack thereof) for aliasing must be carefully documented, particularly for functions that perform complex manipulations on memory areas passed to them.

Specified aliasing

Controlled aliasing behaviour may be desirable in some cases (that is, aliasing behaviour that is specified, unlike that enabled by memory layout in C). It is common practice in Fortran. The Perl programming language specifies, in some constructs, aliasing behaviour, such as in foreach loops. This allows certain data structures to be modified directly with less code. For example,

my@array=(1,2,3);foreachmy$element(@array){# Increment $element, thus automatically# modifying @array, since $element is ''aliased''# to each of @array's elements in turn.$element++;}print"@array \n";

will print out "2 3 4" as a result. If one wanted to bypass aliasing effects, one could copy the contents of the index variable into another and change the copy.

Conflicts with optimization

Optimizers often have to make conservative assumptions about variables when aliasing is possible. For example, knowing the value of a variable (such as x is 5) normally allows certain optimizations (such as constant propagation). However, the compiler cannot use this information after an assignment to another variable (for example, in C, *y = 10) because it could be that *y is an alias of x. This could be the case after an assignment like y = &x. As an effect of this assignment to *y, the value of x would be changed as well, so propagating the information that x is 5 to the statements following *y = 10 would be potentially wrong (if *y is indeed an alias of x). However, if there is information about pointers, the constant propagation process could make a query like: can x be an alias of *y? Then, if the answer is no, x = 5 can be propagated safely.

Another optimization impacted by aliasing is code reordering. If the compiler decides that x is not aliased by *y, then code that uses or changes the value of x can be moved before the assignment *y = 10, if this would improve scheduling or enable more loop optimizations to be carried out.

To enable such optimizations in a predictable manner, the ISO standard for the C programming language (including its newer C99 edition, see section 6.5, paragraph 7) specifies that it is illegal (with some exceptions) to access the same memory location using pointers of different types. A compiler may therefore assume that such pointers do not alias. This rule, known as the strict aliasing rule, sometimes allows for impressive increases in performance, [1] but has been known to break some otherwise valid code. Several software projects intentionally violate this portion of the C99 standard. For example, Python 2.x did so to implement reference counting, [2] and required changes to the basic object structs in Python 3 to enable this optimization. The Linux kernel does this because strict aliasing causes problems with optimization of inlined code. [3] In such cases, when compiled with gcc, the option -fno-strict-aliasing is invoked to prevent unwanted optimizations that could yield unexpected code.

Hardware aliasing

The term aliasing is also used to describe the situation where, due to either a hardware design choice or a hardware failure, one or more of the available address bits is not used in the memory selection process. [4] This may be a design decision if there are more address bits available than are necessary to support the installed memory device(s). In a failure, one or more address bits may be shorted together, or may be forced to ground (logic 0) or the supply voltage (logic 1).

Example

For this example, assuming a memory design with 8 locations, requiring only 3 address lines (or bits) since 23 = 8). Address bits (named A2 through A0) are decoded to select unique memory locations as follows, in standard binary counter fashion:

A2A1A0Memory location
0000
0011
0102
0113
1004
1015
1106
1117

In the table above, each of the 8 unique combinations of address bits selects a different memory location. However, if one address bit (say A2) were to be shorted to ground, the table would be modified as follows:

A2A1A0Memory location
0000
0011
0102
0113
0000
0011
0102
0113

In this case, with A2 always being zero, the first four memory locations are duplicated and appear again as the second four. Memory locations 4 through 7 have become inaccessible.

If this change occurred to a different address bit, the decoding results would be different, but in general the effect would be the same: the loss of a single address bit cuts the available memory space in half, with resulting duplication (aliasing) of the remaining space.

See also

Related Research Articles

In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

<span class="mw-page-title-main">C (programming language)</span> General-purpose programming language

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, protocol stacks, though decreasingly for application software. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every "term". Usually the terms are various constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term. Type systems formalize and enforce the otherwise implicit categories the programmer uses for algebraic data types, data structures, or other components.

In computer programming, the stride of an array is the number of locations in memory between beginnings of successive array elements, measured in bytes or in units of the size of the array's elements. The stride cannot be smaller than the element size but can be larger, indicating extra space between elements.

C dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc and free.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

Pointer (computer programming) Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification to which the computer code adheres. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

Foreach loop Control flow statement for traversing items in a collection

In computer programming, foreach loop is a control flow statement for traversing items in a collection. foreach is usually used in place of a standard for loop statement. Unlike other for loop constructs, however, foreach loops usually maintain no explicit counter: they essentially say "do this to everything in this set", rather than "do this x times". This avoids potential off-by-one errors and makes code simpler to read. In object-oriented languages, an iterator, even if implicit, is often used as the means of traversal.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In some programming languages, const is a type qualifier that indicates that the data is read-only. While this can be used to declare constants, const in the C family of languages differs from similar constructs in other languages in being part of the type, and thus has complicated behavior when combined with pointers, references, composite data types, and type-checking. In other languages, the data is not in a single memory location, but copied at compile time on each use. Languages which utilize it include C, C++, D, JavaScript, Julia, and Rust.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

typedef is a reserved keyword in the programming languages C and C++. It is used to create an additional name (alias) for another data type, but does not create a new type, except in the obscure case of a qualified typedef of an array type where the typedef qualifiers are transferred to the array element type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, although it is also commonly used to provide specific descriptive type names for integer data types of varying sizes.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computer programming, a branch table or jump table is a method of transferring program control (branching) to another part of a program using a table of branch or jump instructions. It is a form of multiway branch. The branch table construction is commonly used when programming in assembly language but may also be generated by compilers, especially when implementing optimized switch statements whose values are densely packed together.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

Increment and decrement operators are unary operators that add or subtract one, to or from their operand, respectively. They are commonly implemented in imperative programming languages. C-like languages feature two versions of each operator with slightly different semantics.

In computer programming, a subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed.

References

  1. Mike Acton (2006-06-01). "Understanding Strict Aliasing".
  2. Neil Schemenauer (2003-07-17). "ANSI strict aliasing and Python".
  3. Linus Torvalds (2003-02-26). "Re: Invalid compilation without -fno-strict-aliasing".
  4. Michael Barr (2012-07-27). "Software Based Memory Testing".