Associative array

Last updated

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.

Contents

Operations associated with this data type allow to: [1] [2]

Implementing associative arrays poses the dictionary problem, a classic computer science problem: the task of designing a data structure that maintains a set of data during 'search', 'delete', and 'insert' operations. [3] The two major solutions to the dictionary problem are a hash table and a search tree. [1] [2] [4] [5] In some cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more specialized structures.

Many programming languages include associative arrays as primitive data types, and they are available in software libraries for many others. Content-addressable memory is a form of direct hardware-level support for associative arrays.

Associative arrays have many applications including such fundamental programming patterns as memoization and the decorator pattern. [6]

The name does not come from the associative property known in mathematics. Rather, it arises from the fact that we associate values with keys.

Operations

In an associative array, the association between a key and a value is often known as a "mapping", and the same word mapping may also be used to refer to the process of creating a new association.

The operations that are usually defined for an associative array are: [1] [2]

Often, instead of add or reassign there is a single set operation that adds a new pair if one does not already exist, and otherwise reassigns it.

In addition, associative arrays may also include other operations such as determining the number of mappings or constructing an iterator to loop over all the mappings. Usually, for such an operation, the order in which the mappings are returned may be implementation-defined.

A multimap generalizes an associative array by allowing multiple values to be associated with a single key. [7] A bidirectional map is a related abstract data type in which the mappings operate in both directions: each value must be associated with a unique key, and a second lookup operation takes a value as an argument and looks up the key associated with that value.

Example

Suppose that the set of loans made by a library is represented in a data structure. Each book in a library may be checked out only by a single library patron at a time. However, a single patron may be able to check out multiple books. Therefore, the information about which books are checked out to which patrons may be represented by an associative array, in which the books are the keys and the patrons are the values. Using notation from Python or JSON, the data structure would be:

{"Pride and Prejudice":"Alice","Wuthering Heights":"Alice","Great Expectations":"John"}

A lookup operation on the key "Great Expectations" would return "John". If John returns his book, that would cause a deletion operation, and if Pat checks out a book, that would cause an insertion operation, leading to a different state:

{"Pride and Prejudice":"Alice","The Brothers Karamazov":"Pat","Wuthering Heights":"Alice"}

Implementation

For dictionaries with very small numbers of mappings, it may make sense to implement the dictionary using an association list, a linked list of mappings. With this implementation, the time to perform the basic dictionary operations is linear in the total number of mappings; however, it is easy to implement and the constant factors in its running time are small. [1] [8]

Another very simple implementation technique, usable when the keys are restricted to a narrow range, is direct addressing into an array: the value for a given key k is stored at the array cell A[k], or if there is no mapping for k then the cell stores a special sentinel value that indicates the absence of a mapping. As well as being simple, this technique is fast: each dictionary operation takes constant time. However, the space requirement for this structure is the size of the entire keyspace, making it impractical unless the keyspace is small. [4]

The two major approaches to implementing dictionaries are a hash table or a search tree. [1] [2] [4] [5]

Hash table implementations

This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better locality of reference, though as the table gets full, its performance degrades drastically. Hash table average insertion time.png
This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better locality of reference, though as the table gets full, its performance degrades drastically.

The most frequently used general purpose implementation of an associative array is with a hash table: an array combined with a hash function that separates each key into a separate "bucket" of the array. The basic idea behind a hash table is that accessing an element of an array via its index is a simple, constant-time operation. Therefore, the average overhead of an operation for a hash table is only the computation of the key's hash, combined with accessing the corresponding bucket within the array. As such, hash tables usually perform in O(1) time, and outperform alternatives in most situations.

Hash tables need to be able to handle collisions: when the hash function maps two different keys to the same bucket of the array. The two most widespread approaches to this problem are separate chaining and open addressing. [1] [2] [4] [9] In separate chaining, the array does not store the value itself but stores a pointer to another container, usually an association list, that stores all of the values matching the hash. On the other hand, in open addressing, if a hash collision is found, then the table seeks an empty spot in an array to store the value in a deterministic manner, usually by looking at the next immediate position in the array.

Open addressing has a lower cache miss ratio than separate chaining when the table is mostly empty. However, as the table becomes filled with more elements, open addressing's performance degrades exponentially. Additionally, separate chaining uses less memory in most cases, unless the entries are very small (less than four times the size of a pointer).

Tree implementations

Self-balancing binary search trees

Another common approach is to implement an associative array with a self-balancing binary search tree, such as an AVL tree or a red–black tree. [10]

Compared to hash tables, these structures have both advantages and weaknesses. The worst-case performance of self-balancing binary search trees is significantly better than that of a hash table, with a time complexity in big O notation of O(log n). This is in contrast to hash tables, whose worst-case performance involves all elements sharing a single bucket, resulting in O(n) time complexity. In addition, and like all binary search trees, self-balancing binary search trees keep their elements in order. Thus, traversing its elements follows a least-to-greatest pattern, whereas traversing a hash table can result in elements being in seemingly random order. However, hash tables have a much better average-case time complexity than self-balancing binary search trees of O(1), and their worst-case performance is highly unlikely when a good hash function is used.

It is worth noting that a self-balancing binary search tree can be used to implement the buckets for a hash table that uses separate chaining. This allows for average-case constant lookup, but assures a worst-case performance of O(log n). However, this introduces extra complexity into the implementation, and may cause even worse performance for smaller hash tables, where the time spent inserting into and balancing the tree is greater than the time needed to perform a linear search on all of the elements of a linked list or similar data structure. [11] [12]

Other trees

Associative arrays may also be stored in unbalanced binary search trees or in data structures specialized to a particular type of keys such as radix trees, tries, Judy arrays, or van Emde Boas trees, though the ability of these implementation methods within comparison to hash tables varies; for instance, Judy trees remain indicated to perform with a smaller quantity of efficiency than hash tables, while carefully selected hash tables generally perform with increased efficiency in comparison to adaptive radix trees, with potentially greater restrictions on the types of data that they can handle. [13] The advantages of these alternative structures come from their ability to handle operations beyond the basic ones of an associative array, such as finding the mapping whose key is the closest to a queried key, when the query is not itself present in the set of mappings.

Comparison

Underlying data structureLookupInsertionDeletionOrdered
averageworst caseaverageworst caseaverageworst case
Hash table O(1)O(n)O(1)O(n)O(1)O(n)No
Self-balancing binary search tree O(log n)O(log n)O(log n)O(log n)O(log n)O(log n)Yes
unbalanced binary search tree O(log n)O(n)O(log n)O(n)O(log n)O(n)Yes
Sequential container of key–value pairs
(e.g. association list)
O(n)O(n)O(1)O(1)O(n)O(n)No

Ordered dictionary

The basic definition of the dictionary does not mandate an order. To guarantee a fixed order of enumeration, ordered versions of the associative array are often used. There are two senses of an ordered dictionary:

The latter sense of ordered dictionaries are more commonly encountered. They can be implemented using an association list, or by overlaying a doubly linked list on top of a normal dictionary. The latter approach, as used by CPython before version 3.6, has the advantage of keeping the potentially better complexity of another implementation. [17] CPython 3.6+ makes dictionaries ordered by splitting the hash table into an insertion-ordered array of k-v pairs and a sparse array ("hash table") of indices. [18]

Language support

Associative arrays can be implemented in any programming language as a package and many language systems provide them as part of their standard library. In some languages, they are not only built into the standard system, but have special syntax, often using array-like subscripting.

Built-in syntactic support for associative arrays was introduced in 1969 by SNOBOL4, under the name "table". TMG offered tables with string keys and integer values. MUMPS made multi-dimensional associative arrays, optionally persistent, its key data structure. SETL supported them as one possible implementation of sets and maps. Most modern scripting languages, starting with AWK and including Rexx, Perl, PHP, Tcl, JavaScript, Maple, Python, Ruby, Wolfram Language, Go, and Lua, support associative arrays as a primary container type. In many more languages, they are available as library functions without special syntax.

In Smalltalk, Objective-C, .NET, [19] Python, REALbasic, Swift, VBA and Delphi [20] they are called dictionaries; in Perl, Ruby and Seed7 they are called hashes; in C++, Java, Go, Clojure, Scala, OCaml, Haskell they are called maps (see map (C++), unordered_map (C++), and Map ); in Common Lisp and Windows PowerShell, they are called hash tables (since both typically use this implementation); in Maple and Lua, they are called tables. In PHP, all arrays can be associative, except that the keys are limited to integers and strings. In JavaScript (see also JSON), all objects behave as associative arrays with string-valued keys, while the Map and WeakMap types take arbitrary objects as keys. In Lua, they are used as the primitive building block for all data structures. In Visual FoxPro, they are called Collections. The D language also has support for associative arrays. [21]

Permanent storage

Many programs using associative arrays will at some point need to store that data in a more permanent form, like in a computer file. A common solution to this problem is a generalized concept known as archiving or serialization , which produces a text or binary representation of the original objects that can be written directly to a file. This is most commonly implemented in the underlying object model, like .Net or Cocoa, which include standard functions that convert the internal data into text form. The program can create a complete text representation of any group of objects by calling these methods, which are almost always already implemented in the base associative array class. [22]

For programs that use very large data sets, this sort of individual file storage is not appropriate, and a database management system (DB) is required. Some DB systems natively store associative arrays by serializing the data and then storing that serialized data and the key. Individual arrays can then be loaded or saved from the database using the key to refer to them. These key–value stores have been used for many years and have a history as long as that as the more common relational database (RDBs), but a lack of standardization, among other reasons, limited their use to certain niche roles. RDBs were used for these roles in most cases, although saving objects to a RDB can be complicated, a problem known as object-relational impedance mismatch.

After c.2010, the need for high performance databases suitable for cloud computing and more closely matching the internal structure of the programs using them led to a renaissance in the key–value store market. These systems can store and retrieve associative arrays in a native fashion, which can greatly improve performance in common web-related workflows.

See also

Related Research Articles

Binary search algorithm Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

Binary search tree Data structure in tree form sorted for fast lookup

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree whose internal nodes each store a key greater than all the keys in the node's left subtree and less than those in its right subtree. A binary tree is a type of data structure for storing data such as numbers in an organized way. Binary search trees allow binary search for fast lookup, addition and removal of data items, and can be used to implement dynamic sets and lookup tables. The order of nodes in a BST means that each comparison skips about half of the remaining tree, so the whole lookup takes time proportional to the binary logarithm of the number of items stored in the tree. This is much better than the linear time required to find items by key in an (unsorted) array, but slower than the corresponding operations on hash tables. Several variants of the binary search tree have been studied.

Hash function Type of function that maps data of arbitrary size to data of fixed size

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

Hash table Associates data values with key values – a lookup table

In computing, a hash table is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

Heap (data structure) Computer science data structure

In computer science, a heap is a specialized tree-based data structure which is essentially an almost complete tree that satisfies the heap property: in a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

Trie A type of search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer programming, an iterator is an object that enables a programmer to traverse a container, particularly lists. Various types of iterators are often provided via a container's interface. Though the interface and semantics of a given iterator are fixed, iterators are often implemented in terms of the structures underlying a container implementation and are often tightly coupled to the container to enable the operational semantics of the iterator. An iterator performs traversal and also gives access to data elements in a container, but does not itself perform iteration. An iterator is behaviorally similar to a database cursor. Iterators date to the CLU programming language in 1974.

In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.

Self-balancing binary search tree Any node-based binary search tree that automatically keeps its height small

In computer science, a self-balancingbinary search tree is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions.

In computing, a persistent data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.

Radix tree

In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.

In computer science, a multimap is a generalization of a map or associative array abstract data type in which more than one value may be associated with and returned for a given key. Both map and multimap are particular cases of containers. Often the multimap is implemented as a map with lists or sets as the map values.

In computer science, a ternary search tree is a type of trie where nodes are arranged in a manner similar to a binary search tree, but with up to three children rather than the binary tree's limit of two. Like other prefix trees, a ternary search tree can be used as an associative map structure with the ability for incremental string search. However, ternary search trees are more space efficient compared to standard prefix trees, at the cost of speed. Common applications for ternary search trees include spell-checking and auto-completion.

Java collections framework

The Java collections framework is a set of classes and interfaces that implement commonly reusable collection data structures.

In computer science, a container is a class, a data structure, or an abstract data type (ADT) whose instances are collections of other objects. In other words, they store objects in an organized way that follows specific access rules. The size of the container depends on the number of objects (elements) it contains. Underlying (inherited) implementations of various container types may vary in size and complexity, and provide flexibility in choosing the right implementation for any given scenario.

In computer science, a collection or container is a grouping of some variable number of data items that have some shared significance to the problem being solved and need to be operated upon together in some controlled fashion. Generally, the data items will be of the same type or, in languages supporting inheritance, derived from some common ancestor type. A collection is a concept applicable to abstract data types, and does not prescribe a specific implementation as a concrete data structure, though often there is a conventional choice.

This Comparison of programming languages compares the features of associative array data structures or array-lookup processing for over 40 computer programming languages.

In computer science, an x-fast trie is a data structure for storing integers from a bounded domain. It supports exact and predecessor or successor queries in time O(log log M), using O(n log M) space, where n is the number of stored values and M is the maximum value in the domain. The structure was proposed by Dan Willard in 1982, along with the more complicated y-fast trie, as a way to improve the space usage of van Emde Boas trees, while retaining the O(log log M) query time.

In computing, associative containers refer to a group of class templates in the standard library of the C++ programming language that implement ordered associative arrays. Being templates, they can be used to store arbitrary elements, such as integers or custom classes. The following containers are defined in the current revision of the C++ standard: set, map, multiset, multimap. Each of these containers differ only on constraints placed on their elements.

References

  1. 1 2 3 4 5 6 Goodrich, Michael T.; Tamassia, Roberto (2006), "9.1 The Map Abstract Data Type", Data Structures & Algorithms in Java (4th ed.), Wiley, pp. 368–371
  2. 1 2 3 4 5 Mehlhorn, Kurt; Sanders, Peter (2008), "4 Hash Tables and Associative Arrays", Algorithms and Data Structures: The Basic Toolbox (PDF), Springer, pp. 81–98
  3. Andersson, Arne (1989). "Optimal Bounds on the Dictionary Problem". Proc. Symposium on Optimal Algorithms. Lecture Notes in Computer Science. Springer Verlag. 401: 106–114. doi:10.1007/3-540-51859-2_10. ISBN   978-3-540-51859-4.
  4. 1 2 3 4 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "11 Hash Tables", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp. 221–252, ISBN   0-262-03293-7 .
  5. 1 2 Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. "Dynamic Perfect Hashing: Upper and Lower Bounds" Archived 2016-03-04 at the Wayback Machine . SIAM J. Comput. 23, 4 (Aug. 1994), 738-761. http://portal.acm.org/citation.cfm?id=182370 doi : 10.1137/S0097539791194094
  6. Goodrich & Tamassia (2006), pp. 597–599.
  7. Goodrich & Tamassia (2006), pp. 389–397.
  8. "When should I use a hash table instead of an association list?". lisp-faq/part2. 1996-02-20.
  9. Klammer, F.; Mazzolini, L. (2006), "Pathfinders for associative maps", Ext. Abstracts GIS-l 2006, GIS-I, pp. 71–74.
  10. Joel Adams and Larry Nyhoff. "Trees in STL". Quote: "The Standard Template library ... some of its containers -- the set<T>, map<T1, T2>, multiset<T>, and multimap<T1, T2> templates -- are generally built using a special kind of self-balancing binary search tree called a red–black tree."
  11. Knuth, Donald (1998). The Art of Computer Programming. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN   0-201-89685-0.
  12. Probst, Mark (2010-04-30). "Linear vs Binary Search" . Retrieved 2016-11-20.
  13. Alvarez, Victor; Richter, Stefan; Chen, Xiao; Dittrich, Jens (April 2015). "A comparison of adaptive radix trees and hash tables". 2015 IEEE 31st International Conference on Data Engineering. Seoul, South Korea: IEEE: 1227–1238. doi:10.1109/ICDE.2015.7113370. ISBN   978-1-4799-7964-6. S2CID   17170456.
  14. "std::map". en.cppreference.com.
  15. "OrderedDictionary Class (System.Collections.Specialized)". MS Docs.
  16. "collections — Container datatypes — Python 3.9.0a3 documentation". docs.python.org.
  17. Dimitris Fasarakis Hilliard. "dictionary - How does Python's OrderedDict remember elements inserted?". Stack Overflow.
  18. Dimitris Fasarakis Hilliard. "Are dictionaries ordered in Python 3.6+?". Stack Overflow.
  19. "Dictionary<TKey, TValue> Class". MSDN.
  20. "System.Generics.Collections.TDictionary - RAD Studio API Documentation". docwiki.embarcadero.com. Retrieved 2017-04-18.
  21. "Associative Arrays, the D programming language". Digital Mars.
  22. "Archives and Serializations Programming Guide", Apple Inc., 2012