Union type

Last updated

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record (or structure), which could be defined to contain a float and an integer; in a union, there is only one value at any given time.

Contents

A union can be pictured as a chunk of memory that is used to store variables of different data types. Once a new value is assigned to a field, the existing data is overwritten with the new data. The memory area storing the value has no intrinsic type (other than just bytes or words of memory), but the value can be treated as one of several abstract data types, having the type of the value that was last written to the memory area.

In type theory, a union has a sum type; this corresponds to disjoint union in mathematics.

Depending on the language and type, a union value may be used in some operations, such as assignment and comparison for equality, without knowing its specific type. Other operations may require that knowledge, either by some external information, or by the use of a tagged union.

Untagged unions

Because of the limitations of their use, untagged unions are generally only provided in untyped languages or in a type-unsafe way (as in C). They have the advantage over simple tagged unions of not requiring space to store a data type tag.

The name "union" stems from the type's formal definition. If a type is considered as the set of all values that that type can take on, a union type is simply the mathematical union of its constituting types, since it can take on any value any of its fields can. Also, because a mathematical union discards duplicates, if more than one field of the union can take on a single common value, it is impossible to tell from the value alone which field was last written.

However, one useful programming function of unions is to map smaller data elements to larger ones for easier manipulation. A data structure consisting, for example, of 4 bytes and a 32-bit integer, can form a union with an unsigned 64-bit integer, and thus be more readily accessed for purposes of comparison etc.

Unions in various programming languages

ALGOL 68

ALGOL 68 has tagged unions, and uses a case clause to distinguish and extract the constituent type at runtime. A union containing another union is treated as the set of all its constituent possibilities.

The syntax of the C/C++ union type and the notion of casts was derived from ALGOL 68, though in an untagged form. [1]

C/C++

In C and C++, untagged unions are expressed nearly exactly like structures (structs), except that each data member begins at the same location in memory. The data members, as in structures, need not be primitive values, and in fact may be structures or even other unions. C++ (since C++11) also allows for a data member to be any type that has a full-fledged constructor/destructor and/or copy constructor, or a non-trivial copy assignment operator. For example, it is possible to have the standard C++ string as a member of a union.

Like a structure, all of the members of a union are by default public. The keywords private, public, and protected may be used inside a structure or a union in exactly the same way they are used inside a class for defining private, public, and protected member access.

The primary use of a union is allowing access to a common location by different data types, for example hardware input/output access, bitfield and word sharing, or type punning. Unions can also provide low-level polymorphism. However, there is no checking of types, so it is up to the programmer to be sure that the proper fields are accessed in different contexts. The relevant field of a union variable is typically determined by the state of other variables, possibly in an enclosing struct.

One common C programming idiom uses unions to perform what C++ calls a reinterpret_cast, by assigning to one field of a union and reading from another, as is done in code which depends on the raw representation of the values. A practical example is the method of computing square roots using the IEEE representation. This is not, however, a safe use of unions in general.

Structure and union specifiers have the same form. [ . . . ] The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit-field, then to the unit in which it resides), and vice versa.

ANSI/ISO 9899:1990 (the ANSI C standard) Section 6.5.2.1

Anonymous union

In C++, C11, and as a non-standard extension in many compilers, unions can also be anonymous. Their data members do not need to be referenced, are instead accessed directly. They have some restrictions as opposed to traditional unions: in C11, they must be a member of another structure or union, [2] and in C++, they can not have methods or access specifiers.

Simply omitting the class-name portion of the syntax does not make a union an anonymous union. For a union to qualify as an anonymous union, the declaration must not declare an object. Example:

#include<iostream>#include<cstdint>intmain(){union{floatf;std::uint32_td;// Assumes float is 32 bits wide};f=3.14f;std::cout<<"Hexadecimal representation of 3.14f:"<<std::hex<<d<<‘\n;}

Transparent union

In Unix-like compilers such as GCC, Clang, and IBM XL C for AIX, a transparent_union attribute is available for union types. Types contained in the union can be converted transparently to the union type itself in a function call, provided that all types have the same size. It is mainly intended for function with multiple parameter interfaces, a use necessitated by early Unix extensions and later re-standarisation. [3]

COBOL

In COBOL, union data items are defined in two ways. The first uses the RENAMES (66 level) keyword, which effectively maps a second alphanumeric data item on top of the same memory location as a preceding data item. In the example code below, data item PERSON-REC is defined as a group containing another group and a numeric data item. PERSON-DATA is defined as an alphanumeric data item that renames PERSON-REC, treating the data bytes continued within it as character data.

  01  PERSON-REC.05  PERSON-NAME.10  PERSON-NAME-LASTPIC X(12).10  PERSON-NAME-FIRSTPIC X(16).10  PERSON-NAME-MIDPIC X.05  PERSON-IDPIC 9(9)PACKED-DECIMAL.  01  PERSON-DATARENAMESPERSON-REC.

The second way to define a union type is by using the REDEFINES keyword. In the example code below, data item VERS-NUM is defined as a 2-byte binary integer containing a version number. A second data item VERS-BYTES is defined as a two-character alphanumeric variable. Since the second item is redefined over the first item, the two items share the same address in memory, and therefore share the same underlying data bytes. The first item interprets the two data bytes as a binary value, while the second item interprets the bytes as character values.

  01  VERS-INFO.05  VERS-NUMPIC S9(4)COMP.05  VERS-BYTESPIC X(2)REDEFINESVERS-NUM

Pascal

In Pascal, there are two ways to create unions. One is the standard way through a variant record. The second is a nonstandard means of declaring a variable as absolute, meaning it is placed at the same memory location as another variable or at an absolute address. While all Pascal compilers support variant records, only some support absolute variables.

For the purposes of this example, the following are all integer types: a byte is 8-bits, a word is 16-bits, and an integer is 32-bits.

The following example shows the non-standard absolute form:

VARA:Integer;B:Array[1..4]ofByteabsoluteA;C:Integerabsolute0;

In the first example, each of the elements of the array B maps to one of the specific bytes of the variable A. In the second example, the variable C is assigned to the exact machine address 0.

In the following example, a record has variants, some of which share the same location as others:

TYPETSystemTime=recordYear,Month,DayOfWeek,Day:word;Hour,Minute,Second,MilliSecond:word;end;TGender=(Male,Female,TransFemale,TransMale,Other);TPerson=RECORDFirstName,Lastname:String;Birthdate:TSystemTime;Dependents:Integer;HourlyRate:Currency;CaseGender:TGenderofFemale,TransMale:(isPregnant:Boolean;DateDue:TSystemTime);Male,TransFemale:(HasPartner,isPartnerExpecting:Boolean;PartnerDate:TSystemTime);END;

In the above example, a Tperson record has the tag field Gender, and the tag divides people among two classes: female or trans male (a person with a gender identity of male, but was born with a female body), and male or transfemale (a person with a gender identity of female, but born in a male body). In this record, hasPartner and isPregnant occupy the same location, while DateDue and isPartnerExpecting share the same location. While the record has a tag field Gender, the compiler does not enforce access according to the tag's value: one may access any of the variant fields notwithstanding the value of the tag, e.g., if the gender other is the value of the tag field Gender, any of the variant fields may still be accessed.

PL/I

In PL/I then original term for a union was cell, [4] which is still accepted as a synonym for union by several compilers. The union declaration is similar to the structure definition, where elements at the same level within the union declaration occupy the same storage. Elements of the union can be any data type, including structures and array. [5] :pp192–193 Here vers_num and vers_bytes occupy the same storage locations.

1  vers_infounion,5 vers_numfixedbinary,5 vers_bytespic '(2)A';

An alternative to a union declaration is the DEFINED attribute, which allows alternative declarations of storage, however the data types of the base and defined variables must match. [5] :pp.289–293

Syntax and example

C/C++

In C and C++, the syntax is:

union<name>{<datatype><1stvariablename>;<datatype><2ndvariablename>;...<datatype><nthvariablename>;}<unionvariablename>;

A structure can also be a member of a union, as the following example shows:

unionname1{structname2{inta;floatb;charc;}svar;intd;}uvar;

This example defines a variable uvar as a union (tagged as name1), which contains two members, a structure (tagged as name2) named svar (which in turn contains three members), and an integer variable named d.

Unions may occur within structures and arrays, and vice versa:

struct{intflags;char*name;intutype;union{intival;floatfval;char*sval;}u;}symtab[NSYM];

The number ival is referred to as symtab[i].u.ival and the first character of string sval by either of *symtab[i].u.sval or symtab[i].u.sval[0].

PHP

Union types were introduced in PHP 8.0. [6]

classExample{privateint|float$foo;publicfunctionsquareAndAdd(float|int$bar):int|float{return$bar**2+$this->foo;}}

TypeScript

Union types are supported in TypeScript. [7]

functionsuccessor(n: number|bigint):number|bigint{return++n}

Difference between union and structure

A union is a class all of whose data members are mapped to the same address within its object. The size of an object of a union is, therefore, the size of its largest data member.

In a structure, all of its data members are stored in contiguous memory locations. The size of an object of a struct is, therefore, the size of the sum of all its data members.

This gain in space efficiency, while valuable in certain circumstances, comes at a great cost of safety: the program logic must ensure that it only reads the field most recently written along all possible execution paths. The exception is when unions are used for type conversion: in this case, a certain field is written and the subsequently read field is deliberately different.

As an example illustrating this point, the declaration

structfoo{inta;floatb;}

defines a data object with two members occupying consecutive memory locations:

                ┌─────┬─────┐            foo  │  a  │  b  │                 └─────┴─────┘                    ↑     ↑ Memory address:  0150  0154

In contrast, the declaration

unionbar{inta;floatb;}

defines a data object with two members occupying the same memory location:

                ┌─────┐            bar  │  a  │                 │  b  │                 └─────┘                    ↑ Memory address:  0150

Structures are used where an "object" is composed of other objects, like a point object consisting of two integers, those being the x and y coordinates:

typedefstruct{intx;// x and y are separateinty;}tPoint;

Unions are typically used in situation where an object can be one of many things but only one at a time, such as a type-less storage system:

typedefenum{STR,INT}tType;typedefstruct{tTypetyp;// typ is separate.union{intival;// ival and sval occupy same memory.char*sval;};}tVal;

See also

Related Research Articles

Pascal (programming language) Programming language

Pascal is an imperative and procedural programming language, designed by Niklaus Wirth as a small, efficient language intended to encourage good programming practices using structured programming and data structuring. It is named in honour of the French mathematician, philosopher and physicist Blaise Pascal.

Data type

In computer science and computer programming, a data type or simply type is an attribute of data which tells the compiler or interpreter how the programmer intends to use the data. Most programming languages support basic data types of integer numbers, floating-point numbers, characters and Booleans. A data type constrains the values that an expression, such as a variable or a function, might take. This data type defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored. A data type provides a set of values from which an expression may take its values.

In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

In computer science, a composite data type or compound data type is any data type which can be constructed in a program using the programming language's primitive data types and other composite types. It is sometimes called a structure or aggregate data type, although the latter term may also refer to arrays, lists, etc. The act of constructing a composite type is known as composition. Composite data types are often contrasted with scalar variables.

In computer science, a tagged union, also called a variant, variant record, choice type, discriminated union, disjoint union, sum type or coproduct, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several "cases", each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as the value itself, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leaves. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

Pointer (computer programming) Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer science, a record is a basic data structure. Records in a database or spreadsheet are usually called "rows".

In computer science, type conversion, type casting, type coercion, and type juggling are different ways of changing an expression from one data type to another. An example would be the conversion of an integer value into a floating point value or its textual representation as a string, and vice versa. Type conversions can take advantage of certain features of type hierarchies or data representations. Two important aspects of a type conversion are whether it happens implicitly (automatically) or explicitly, and whether the underlying data representation is converted from one representation into another, or a given representation is merely reinterpreted as the representation of another data type. In general, both primitive and compound data types can be converted.

In computer science, type safety is the extent to which a programming language discourages or prevents type errors. A type error is erroneous or undesirable program behaviour caused by a discrepancy between differing data types for the program's constants, variables, and methods (functions), e.g., treating an integer (int) as a floating-point number (float). Type safety is sometimes alternatively considered to be a property of a computer program rather than the language in which that program is written; that is, some languages have type-safe facilities that can be circumvented by programmers who adopt practices that exhibit poor type safety. The formal type-theoretic definition of type safety is considerably stronger than what is understood by most programmers.

A struct in the C programming language is a composite data type declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address. The struct data type can contain other data types so is used for mixed-data-type records such as a hard-drive directory entry, or other mixed-type records.

In computer science, object composition is a way to combine objects or data types into more complex ones. Common kinds of compositions are objects used in object-oriented programming, tagged unions, sets, sequences, and various graph structures. Object compositions relate to, but are not the same as, data structures.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

typedef is a reserved keyword in the programming languages C and C++. It is used to create an additional name (alias) for another data type, but does not create a new type, except in the obscure case of a qualified typedef of an array type where the typedef qualifiers are transferred to the array element type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, but is just as common in providing specific descriptive type names for integer data types of varying lengths.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

C++ doesn't have:

ALGOL 68RS is the second ALGOL 68 compiler written by I. F. Currie and J. D. Morrison, at the Royal Signals and Radar Establishment (RSRE). Unlike the earlier ALGOL 68-R, it was designed to be portable, and implemented the language of the Revised Report.

References

  1. Ritchie, Dennis M. (March 1993). "The Development of the C Language". ACM SIGPLAN Notices. 28 (3): 201–208. doi:10.1145/155360.155580. The scheme of type composition adopted by C owes considerable debt to Algol 68, although it did not, perhaps, emerge in a form that Algol's adherents would approve of. The central notion I captured from Algol was a type structure based on atomic types (including structures), composed into arrays, pointers (references), and functions (procedures). Algol 68's concept of unions and casts also had an influence that appeared later.
  2. "6.63 Unnamed Structure and Union Fields" . Retrieved 2016-12-29.
  3. "Common Type Attributes: transparent_union". Using the GNU Compiler Collection (GCC).
  4. IBM Corporation (March 1968). IBM System/360 PL/I Language Specifications (PDF). p. 52. Retrieved Jan 22, 2018.
  5. 1 2 IBM Corporation (Dec 2017). Enterprise PL/I for z/OS PL/I for AIX IBM Developer for z Systems PL/I for windows Language Reference (PDF). Retrieved Jan 22, 2018.
  6. Karunaratne, Ayesh. "PHP 8.0: Union Types". PHP.Watch. Retrieved 30 November 2020.
  7. "Handbook - Unions and Intersection Types". www.typescriptlang.org. Retrieved 30 November 2020.