Tagged union

Last updated

In computer science, a tagged union, also called a variant , variant record, choice type, discriminated union, disjoint union , sum type or coproduct , is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several "cases", each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as the value itself, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leaves. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

Contents

Description

Tagged unions are most important in functional languages such as ML and Haskell, where they are called datatypes (see algebraic data type) and the compiler is able to verify that all cases of a tagged union are always handled, avoiding many types of errors. They can, however, be constructed in nearly any language, and are much safer than untagged unions, often simply called unions, which are similar but do not explicitly keep track of which member of the union is currently in use.

Tagged unions are often accompanied by the concept of a type constructor, which is similar but not the same as a constructor for a class. Type constructors produce a tagged union type, given the initial tag type and the corresponding type.

Mathematically, tagged unions correspond to disjoint or discriminated unions, usually written using +. Given an element of a disjoint union A + B, it is possible to determine whether it came from A or B. If an element lies in both, there will be two effectively distinct copies of the value in A + B, one from A and one from B.

In type theory, a tagged union is called a sum type. Sum types are the dual of product types. Notations vary, but usually the sum type A + B comes with two introduction forms (injections) inj1: AA + B and inj2: BA + B. The elimination form is case analysis, known as pattern matching in ML-style programming languages: if e has type A + B and e1 and e2 have type under the assumptions x: A and y: B respectively, then the term has type . The sum type corresponds to intuitionistic logical disjunction under the Curry–Howard correspondence.

An enumerated type can be seen as a degenerate case: a tagged union of unit types. It corresponds to a set of nullary constructors and may be implemented as a simple tag variable, since it holds no additional data besides the value of the tag.

Many programming techniques and data structures, including rope, lazy evaluation, class hierarchy (see below), arbitrary-precision arithmetic, CDR coding, the indirection bit and other kinds of tagged pointers, etc. are usually implemented using some sort of tagged union.

A tagged union can be seen as the simplest kind of self-describing data format. The tag of the tagged union can be seen as the simplest kind of metadata.

Advantages and disadvantages

The primary advantage of a tagged union over an untagged union is that all accesses are safe, and the compiler can even check that all cases are handled. Untagged unions depend on program logic to correctly identify the currently active field, which may result in strange behavior and hard-to-find bugs if that logic fails.

The primary advantage of a tagged union over a simple record containing a field for each type is that it saves storage by overlapping storage for all the types. Some implementations reserve enough storage for the largest type, while others dynamically adjust the size of a tagged union value as needed. When the value is immutable, it is simple to allocate just as much storage as is needed.

The main disadvantage of tagged unions is that the tag occupies space. Since there are usually a small number of alternatives, the tag can often be squeezed into 2 or 3 bits wherever space can be found, but sometimes even these bits are not available. In this case, a helpful alternative may be folded, computed or encoded tags, where the tag value is dynamically computed from the contents of the union field. Common examples of this are the use of reserved values, where, for example, a function returning a positive number may return -1 to indicate failure, and sentinel values, most often used in tagged pointers.

Sometimes, untagged unions are used to perform bit-level conversions between types, called reinterpret casts in C++. Tagged unions are not intended for this purpose; typically a new value is assigned whenever the tag is changed.

Many languages support, to some extent, a universal data type, which is a type that includes every value of every other type, and often a way is provided to test the actual type of a value of the universal type. These are sometimes referred to as variants. While universal data types are comparable to tagged unions in their formal definition, typical tagged unions include a relatively small number of cases, and these cases form different ways of expressing a single coherent concept, such as a data structure node or instruction. Also, there is an expectation that every possible case of a tagged union will be dealt with when it is used. The values of a universal data type are not related and there is no feasible way to deal with them all.

Like option types and exception handling, tagged unions are sometimes used to handle the occurrence of exceptional results. Often these tags are folded into the type as "reserved values", and their occurrence is not consistently checked: this is a fairly common source of programming errors. This use of tagged unions can be formalized as a monad with the following functions:

where "value" and "err" are the constructors of the union type, A and B are valid result types and E is the type of error conditions. Alternately, the same monad may be described by return and two additional functions, fmap and join:

Examples

Say we wanted to build a binary tree of integers. In ML, we would do this by creating a datatype like this:

datatypetree=Leaf|Nodeof(int*tree*tree)

This is a tagged union with two cases: one, the leaf, is used to terminate a path of the tree, and functions much like a null value would in imperative languages. The other branch holds a node, which contains an integer and a left and right subtree. Leaf and Node are the constructors, which enable us to actually produce a particular tree, such as:

Node(5,Node(1,Leaf,Leaf),Node(3,Leaf,Node(4,Leaf,Leaf)))

which corresponds to this tree:

The tree produced by the above constructors Tagged union tree.svg
The tree produced by the above constructors

Now we can easily write a typesafe function that, say, counts the number of nodes in the tree:

funcountNodes(Leaf)=0|countNodes(Node(int,left,right))=1+countNodes(left)+countNodes(right)

Timeline of language support

1960s

In ALGOL 68, tagged unions are called united modes, the tag is implicit, and the case construct is used to determine which field is tagged:

modenode = union (real, int, compl, string);

Usage example for unioncase of node:

node n := "1234";  case n in   (real r):   print(("real:", r)),   (int i):    print(("int:", i)),   (compl c):  print(("compl:", c)),   (string s): print(("string:", s))   out         print(("?:", n)) esac

1970s & 1980s

Although primarily only functional languages such as ML (from the 1070s) and Haskell (from 1990s) give a central role to tagged unions and have the power to check that all cases are handled, other languages have support for tagged unions as well. However, in practice they can be less efficient in non-functional languages due to optimizations enabled by functional language compilers that can eliminate explicit tag checks and avoid explicit storage of tags.[ citation needed ]

Pascal, Ada, and Modula-2 call them variant records (formally discriminated type in Ada), and require the tag field to be manually created and the tag values specified, as in this Pascal example:

typeshapeKind=(square,rectangle,circle);shape=recordcenterx:integer;centery:integer;casekind:shapeKindofsquare:(side:integer);rectangle:(length,height:integer);circle:(radius:integer);end;

and this Ada equivalent:

typeShape_Kindis(Square,Rectangle,Circle);typeShape(Kind: Shape_Kind)isrecordCenter_X:Integer;Center_Y:Integer;caseKindiswhenSquare=>Side:Integer;whenRectangle=>Length,Height:Integer;whenCircle=>Radius:Integer;endcase;end record;-- Any attempt to access a member whose existence depends-- on a particular value of the discriminant, while the-- discriminant is not the expected one, raises an error.

In C and C++, a tagged union can be created from untagged unions using a strict access discipline where the tag is always checked:

enumShapeKind{Square,Rectangle,Circle};structShape{intcenterx;intcentery;enumShapeKindkind;union{struct{intside;};/* Square */struct{intlength,height;};/* Rectangle */struct{intradius;};/* Circle */};};intgetSquareSide(structShape*s){assert(s->kind==Square);returns->side;}voidsetSquareSide(structShape*s,intside){s->kind=Square;s->side=side;}/* and so on */

As long as the union fields are only accessed through the functions, the accesses will be safe and correct. The same approach can be used for encoded tags; we simply decode the tag and then check it on each access. If the inefficiency of these tag checks is a concern, they may be automatically removed in the final version.

C and C++ also have language support for one particular tagged union: the possibly-null pointer. This may be compared to the option type in ML or the Maybe type in Haskell, and can be seen as a tagged pointer: a tagged union (with an encoded tag) of two types:

Unfortunately, C compilers do not verify that the null case is always handled, and this is a particularly prevalent source of errors in C code, since there is a tendency to ignore exceptional cases.

2000s

One advanced dialect of C called Cyclone has extensive built-in support for tagged unions. [1]

The enum types in the Rust, Haxe and Swift languages also work as tagged unions.

The variant library from Boost has demonstrated it was possible to implement a safe tagged union as a library in C++, visitable using function objects.

structdisplay:boost::static_visitor<void>{voidoperator()(inti){std::cout<<"It's an int, with value "<<i<<std::endl;}voidoperator()(conststd::string&s){std::cout<<"It's a string, with value "<<s<<std::endl;}};boost::variant<int,std::string>v=42;boost::apply_visitor(display(),v);boost::variant<int,std::string>v="hello world";boost::apply_visitor(display(),v);

Scala has case classes:

sealedabstractclassTreecaseobjectLeafextendsTreecaseclassNode(value:Int,left:Tree,right:Tree)extendsTreevaltree=Node(5,Node(1,Leaf,Leaf),Node(3,Leaf,Node(4,Leaf,Leaf)))

Because the class hierarchy is sealed, the compiler can check that all cases are handled in a pattern match:

treematch{caseNode(x,_,_)=>println("top level node value: "+x)caseLeaf=>println("top level node is a leaf")}

Scala's case classes also permit reuse through subtyping:

sealedabstractclassShape(centerX:Int,centerY:Int)caseclassSquare(side:Int,centerX:Int,centerY:Int)extendsShape(centerX,centerY)caseclassRectangle(length:Int,height:Int,centerX:Int,centerY:Int)extendsShape(centerX,centerY)caseclassCircle(radius:Int,centerX:Int,centerY:Int)extendsShape(centerX,centerY)

F# has discriminated unions:

typeTree=|Leaf|Nodeofvalue:int*left:Tree*right:Treelettree=Node(5,Node(1,Leaf,Leaf),Node(3,Leaf,Node(4,Leaf,Leaf)))

Because the defined cases are exhaustive, the compiler can check that all cases are handled in a pattern match:

matchtreewith|Node(x,_,_)->printfn"top level node value: %i"x|Leaf->printfn"top level node is a leaf"

Haxe's enums also work as tagged unions: [2]

enumColor{Red;Green;Blue;Rgb(r:Int,g:Int,b:Int);}

These can be matched using a switch expression:

switch(color){caseRed:trace("Color was red");caseGreen:trace("Color was green");caseBlue:trace("Color was blue");caseRgb(r,g,b):trace("Color had a red value of "+r);}

Nim has object variants [3] similar in declaration to those in Pascal and Ada:

typeShapeKind=enumskSquare,skRectangle,skCircleShape=objectcenterX,centerY:intcasekind:ShapeKindofskSquare:side:intofskRectangle:length,height:intofskCircle:radius:int

Macros can be used to emulate pattern matching or to create syntactic sugar for declaring object variants, seen here as implemented by the package patty:

importpattyproc `~`[A](a:A):refA=new(result)result[]=avariantList[A]:NilCons(x:A,xs:refList[A])proc listHelper[A](xs:seq[A]):List[A]=ifxs.len==0:Nil[A]()else:Cons(xs[0],~listHelper(xs[1..xs.high]))proc list[A](xs:varargs[A]):List[A]=listHelper(@xs)proc sum(xs:List[int]):int=(block:matchxs:Nil:0Cons(y,ys):y+sum(ys[]))echosum(list(1,2,3,4,5))

2010s

The Rust language has extensive support for tagged unions, called enums. [4] For example:

enumTree{Leaf,Node(i64,Box<Tree>,Box<Tree>)}

It also allows matching on unions:

lettree=Tree::Node(2,Box::new(Tree::Node(0,Box::new(Tree::Leaf),Box::new(Tree::Leaf))),Box::new(Tree::Node(3,Box::new(Tree::Leaf),Box::new(Tree::Node(4,Box::new(Tree::Leaf),Box::new(Tree::Leaf))))));fnadd_values(tree: Tree)-> i64{matchtree{Tree::Node(v,a,b)=>v+add_values(*a)+add_values(*b),Tree::Leaf=>0}}assert_eq!(add_values(tree),9);

Rust's error handling model relies extensively on these tagged unions, especially the Option<T> type, which is either None or Some(T), and the Result<T, E> type, which is either Ok(T) or Err(E). [5]

Swift also has substantial support for tagged unions via enumerations. [6] For example:

enumTree{caseleafindirectcasenode(Int,Tree,Tree)}lettree=Tree.node(2,.node(0,.leaf,.leaf),.node(3,.leaf,.node(4,.leaf,.leaf)))funcadd_values(_tree:Tree)->Int{switchtree{caselet.node(v,a,b):returnv+add_values(a)+add_values(b)case.leaf:return0}}assert(add_values(tree)==9)

With TypeScript it is possible to create tagged unions as well. For example:

interfaceLeaf{type:"leaf";value: string;}interfaceNode{type:"node";left: Tree;right: Tree;}typeTree=Leaf|Nodefunctionvisit(tree: Tree){switch(tree.type){case"leaf":console.log(tree.value)breakcase"node":visit(tree.left)visit(tree.right)break}}

Class hierarchies as tagged unions

In a typical class hierarchy in object-oriented programming, each subclass can encapsulate data unique to that class. The metadata used to perform virtual method lookup (for example, the object's vtable pointer in most C++ implementations) identifies the subclass and so effectively acts as a tag identifying the particular data stored by the instance (see RTTI). An object's constructor sets this tag, and it remains constant throughout the object's lifetime.

Nevertheless, a class hierarchy involves true subtype polymorphism; it can be extended by creating further subclasses of the same base type, which could not be handled correctly under a tag/dispatch model. Hence, it is usually not possible to do case analysis or dispatch on a subobject's 'tag' as one would for tagged unions. Some languages such as Scala allow base classes to be "sealed", and unify tagged unions with sealed base classes.

See also

Related Research Articles

Standard ML (SML) is a general-purpose, modular, functional programming language with compile-time type checking and type inference. It is popular among compiler writers and programming language researchers, as well as in the development of theorem provers.

In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created. In some cases, an object is considered immutable even if some internally used attributes change, but the object's state appears unchanging from an external point of view. For example, an object that uses memoization to cache the results of expensive computations could still be considered an immutable object.

Template metaprogramming (TMP) is a metaprogramming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates include compile-time constants, data structures, and complete functions. The use of templates can be thought of as compile-time polymorphism. The technique is used by a number of languages, the best-known being C++, but also Curl, D, and XL.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

In computer programming, especially functional programming and type theory, an algebraic data type is a kind of composite type, i.e., a type formed by combining other types.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain a float and an integer; in a union, there is only one value at any given time.

In computer science, object composition is a way to combine objects or data types into more complex ones. Common kinds of compositions are objects used in object-oriented programming, tagged unions, sets, sequences, and various graph structures. Object compositions relate to, but are not the same as, data structures.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

typedef is a reserved keyword in the programming languages C and C++. It is used to create an additional name (alias) for another data type, but does not create a new type, except in the obscure case of a qualified typedef of an array type where the typedef qualifiers are transferred to the array element type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, but is just as common in providing specific descriptive type names for integer data types of varying lengths.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

Recursion (computer science) Use of functions that call themselves

In computer science, recursion is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem. Such problems can generally be solved by iteration, but this needs to identify and index the smaller instances at programming time. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

The C and C++ programming languages are closely related but have many significant differences. C++ began as a fork of an early, pre-standardized C, and was designed to be mostly source-and-link compatible with C compilers of the time. Due to this, development tools for the two languages are often integrated into a single product, with the programmer able to specify C or C++ as their source language.

In computer programming, an enumerated type is a data type consisting of a set of named values called elements, members, enumeral, or enumerators of the type. The enumerator names are usually identifiers that behave as constants in the language. An enumerated type can be seen as a degenerate tagged union of unit type. A variable that has been declared as having an enumerated type can be assigned any of the enumerators as a value. In other words, an enumerated type has values that are different from each other, and that can be compared and assigned, but are not specified by the programmer as having any particular concrete representation in the computer's memory; compilers and interpreters can represent them arbitrarily.

C++11 is a version of the standard for the programming language C++. It was approved by International Organization for Standardization (ISO) on 12 August 2011, replacing C++03, superseded by C++14 on 18 August 2014 and later, by C++17. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

C++ doesn't have:

This article compares a large number of programming languages by tabulating their data types, their expression, statement, and declaration syntax, and some common operating-system interfaces.

This article describes the features in Haskell.

References

  1. http://cyclone.thelanguage.org/wiki/Tagged%20Unions
  2. "Using Enums - Haxe - The Cross-platform Toolkit". Haxe Foundation.
  3. "Nim Manual". nim-lang.org. Retrieved 2020-01-23.
  4. "The Rust Programming Language". Mozilla.
  5. "Rust By Example". Mozilla.
  6. "Enumerations — The Swift Programming Language (Swift 5.4)". docs.swift.org. Retrieved 2021-04-28.