Hash array mapped trie

Last updated

A hash array mapped trie [1] (HAMT) is an implementation of an associative array that combines the characteristics of a hash table and an array mapped trie. [1] It is a refined version of the more general notion of a hash tree.

Contents

Operation

A HAMT is an array mapped trie where the keys are first hashed to ensure an even distribution of keys and a constant key length.

In a typical implementation of HAMT's array mapped trie, each node contains a table with some fixed number N of slots with each slot containing either a nil pointer or a pointer to another node. N is commonly 32. As allocating space for N pointers for each node would be expensive, each node instead contains a bitmap which is N bits long where each bit indicates the presence of a non-nil pointer. This is followed by an array of pointers equal in length to the number of ones in the bitmap (its Hamming weight).

Advantages of HAMTs

The hash array mapped trie achieves almost hash table-like speed while using memory much more economically. Also, a hash table may have to be periodically resized, an expensive operation, whereas HAMTs grow dynamically. Generally, HAMT performance is improved by a larger root table with some multiple of N slots; some HAMT variants allow the root to grow lazily [1] with negligible impact on performance.

Implementation details

Implementation of a HAMT involves the use of the population count function, which counts the number of ones in the binary representation of a number. This operation is available in many instruction set architectures, but it is available in only some high-level languages. Although population count can be implemented in software in O(1) time using a series of shift and add instructions, doing so may perform the operation an order of magnitude slower.[ citation needed ]

Implementations

The programming languages Clojure, [2] Scala, and Frege [3] use a persistent variant of hash array mapped tries for their native hash map type. The Haskell library "unordered-containers" uses the same to implement persistent map and set data structures. [4] Another Haskell library "stm-containers" adapts the algorithm for use in the context of software transactional memory. [5] A Javascript HAMT library [6] based on the Clojure implementation is also available. The Rubinius [7] implementation of Ruby includes a HAMT, mostly written in Ruby but with 3 [8] primitives. Large maps in Erlang use a persistent HAMT representation internally since release 18.0. [9] The Pony programming language uses a HAMT for the hash map in its persistent collections package. [10] The im and im-rc crates, which provide persistent collection types for the Rust programming language, use a HAMT for their persistent hash tables and hash sets. [11]

The concurrent lock-free version [12] of the hash trie called Ctrie is a mutable thread-safe implementation which ensures progress. The data-structure has been proven to be correct [13] - Ctrie operations have been shown to have the atomicity, linearizability and lock-freedom properties.

See also

Related Research Articles

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of pairs, such that each possible key appears at most once in the collection. In mathematical terms, an associative array is a function with finite domain. It supports 'lookup', 'remove', and 'insert' operations.

In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.

<span class="mw-page-title-main">Radix tree</span> Data structure

In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.

In computer science, a multimap is a generalization of a map or associative array abstract data type in which more than one value may be associated with and returned for a given key. Both map and multimap are particular cases of containers. Often the multimap is implemented as a map with lists or sets as the map values.

C++ Technical Report 1 (TR1) is the common name for ISO/IEC TR 19768, C++ Library Extensions, which is a document that proposed additions to the C++ standard library for the C++03 language standard. The additions include regular expressions, smart pointers, hash tables, and random number generators. TR1 was not a standard itself, but rather a draft document. However, most of its proposals became part of the later official standard, C++11. Before C++11 was standardized, vendors used this document as a guide to create extensions. The report's goal was "to build more widespread existing practice for an expanded C++ standard library".

In computer science, hash trie can refer to:

<span class="mw-page-title-main">Java collections framework</span>

The Java collections framework is a set of classes and interfaces that implement commonly reusable collection data structures.

<span class="mw-page-title-main">Clojure</span> Dialect of the Lisp programming language on the Java platform

Clojure is a dynamic and functional dialect of the Lisp programming language on the Java platform.

In the programming language C++, unordered associative containers are a group of class templates in the C++ Standard Library that implement hash table variants. Being templates, they can be used to store arbitrary elements, such as integers or custom classes. The following containers are defined in the current revision of the C++ standard: unordered_set, unordered_map, unordered_multiset, unordered_multimap. Each of these containers differ only on constraints placed on their elements.

<span class="mw-page-title-main">Go (programming language)</span> Programming language

Go is a statically typed, compiled high-level programming language designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson. It is syntactically similar to C, but with memory safety, garbage collection, structural typing, and CSP-style concurrency. It is often referred to as Golang because of its former domain name, golang.org, but its proper name is Go.

Mustache is a web template system with implementations available for ActionScript, C++, Clojure, CoffeeScript, ColdFusion, Common Lisp, Crystal, D, Dart, Delphi, Elixir, Erlang, Fantom, Go, Haskell, Io, Java, JavaScript, Julia, Lua, .NET, Objective-C, OCaml, Perl, PHP, Pharo, Python, R, Racket, Raku, Ruby, Rust, Scala, Smalltalk, Swift, Tcl, CFEngine, and XQuery.

A concurrent hash-trie or Ctrie is a concurrent thread-safe lock-free implementation of a hash array mapped trie. It is used to implement the concurrent map abstraction. It has particularly scalable concurrent insert and remove operations and is memory-efficient. It is the first known concurrent data-structure that supports O(1), atomic, lock-free snapshots.

In computer science, a hash tree is a persistent data structure that can be used to implement sets and maps, intended to replace hash tables in purely functional programming. In its basic form, a hash tree stores the hashes of its keys, regarded as strings of bits, in a trie, with the actual keys and (optional) values stored at the trie's "final" nodes.

The HAT-trie is a type of radix trie that uses array nodes to collect individual key–value pairs under radix nodes and hash buckets into an associative array. Unlike a simple hash table, HAT-tries store key–value in an ordered collection. The original inventors are Nikolas Askitis and Ranjan Sinha. Askitis & Zobel showed that building and accessing the HAT-trie key/value collection is considerably faster than other sorted access methods and is comparable to the array hash which is an unsorted collection. This is due to the cache-friendly nature of the data structure which attempts to group access to data in time and space into the 64 byte cache line size of the modern CPU.

<span class="mw-page-title-main">Concurrent hash table</span>

A concurrent hash table or concurrent hash map is an implementation of hash tables allowing concurrent access by multiple threads using a hash function.

A bitwise trie is a special form of trie where each node with its child-branches represents a bit sequence of one or more bits of a key. A bitwise trie with bitmap uses a bitmap to denote valid child branches.

References

  1. 1 2 3 Phil Bagwell (2000). Ideal Hash Trees (PDF) (Report). Infoscience Department, École Polytechnique Fédérale de Lausanne.
  2. "clojure/clojure". GitHub. 8 December 2022.
  3. "Frege/frege". GitHub. 7 December 2022.
  4. Johan Tibell, A. Announcing unordered-containers 0.2
  5. Nikita Volkov, Announcing the "stm-containers" library, 2014
  6. "mattbierner/hamt". GitHub. 27 November 2022.
  7. "Ruby source file of Rubinius's HAMT". GitHub .
  8. https://github.com/rubinius/rubinius/blob/master/machine/builtin/system.cpp#L1724-L1802 [ dead link ]
  9. "Erlang Programming Language".
  10. "horse: Pony is an open-source, actor-model, capabilities-secure, high performance programming language: Ponylang/ponyc". GitHub . 2018-11-26.
  11. "API docs for Rust im-rc crate".
  12. Prokopec, A. Implementation of Concurrent Hash Tries on GitHub
  13. Prokopec, A. et al. (2011) Cache-Aware Lock-Free Concurrent Hash Tries. Technical Report, 2011.