Linear hashing

Last updated

Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980. [1] [2] It has been analyzed by Baeza-Yates and Soza-Pollman. [3] It is the first in a number of schemes known as dynamic hashing [3] [4] such as Larson's Linear Hashing with Partial Extensions, [5] Linear Hashing with Priority Splitting, [6] Linear Hashing with Partial Expansions and Priority Splitting, [7] or Recursive Linear Hashing. [8]

Contents

The file structure of a dynamic hashing data structure adapts itself to changes in the size of the file, so expensive periodic file reorganization is avoided. [4] A Linear Hashing file expands by splitting a predetermined bucket into two and shrinks by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or load factor (i.e., the number of records divided by the number of buckets) moving outside of a predetermined range. [1] In Linear Hashing there are two types of buckets, those that are to be split and those already split. While extendible hashing splits only overflowing buckets, spiral hashing (a.k.a. spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split. [5]

Linear Hashing has also been made into a scalable distributed data structure, LH*. In LH*, each bucket resides at a different server. [9] LH* itself has been expanded to provide data availability in the presence of failed buckets. [10] Key based operations (inserts, deletes, updates, reads) in LH and LH* take maximum constant time independent of the number of buckets and hence of records. [1] [10]

Algorithm details

Records in LH or LH* consists of a key and a content, the latter basically all the other attributes of the record. [1] [10] They are stored in buckets. For example, in Ellis' implementation, a bucket is a linked list of records. [2] The file allows the key based CRUD operations create or insert, read, update, and delete as well as a scan operations that scans all records, for example to do a database select operation on a non-key attribute. [10] Records are stored in buckets whose numbering starts with 0. [10]

The key distinction from schemes such as Fagin's extendible hashing is that as the file expands due to insertions, only one bucket is split at a time, and the order in which buckets are split is already predetermined. [11]

Hash functions

The hash function returns the 0-based index of the bucket that contains the record with key . When a bucket which uses the hash function is split into two new buckets, the hash function is replaced with for both of those new buckets. At any time, at most two hash functions and are used; such that corresponds to the current level. The family of hash functions is also referred to as the dynamic hash function.

Typically, the value of in corresponds to the number of rightmost binary digits of the key that are used to segregate the buckets. This dynamic hash function can be expressed arithmetically as . Note that when the total number of buckets is equal to one, .

Complete the calculations below to determine the correct hashing function for the given hashing key . [10]

# l represents the current level# s represents the split pointer indexa=h_l(c)if(a<s):a=h_{l+1}(c)

Split control

Linear hashing algorithms may use only controlled splits or both controlled and uncontrolled splits.

Controlled splitting occurs if a split is performed whenever the load factor, which is monitored by the file, exceeds a predetermined threshold. [10] If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the load factor surpasses a set threshold, the split pointer's designated bucket is split. Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets). [12]

An uncontrolled split occurs when a split is performed whenever a bucket overflows, in which case that bucket would be split into two separate buckets.

File contraction occurs in some LH algorithm implementations if a controlled split causes the load factor to sink below a threshold. In this case, a merge operation would be triggered which would undo the last split, and reset the file state. [10]

Split pointer

The index of the next bucket to be split is part of the file state and called the split pointer. The split pointer corresponds to the first bucket that uses the hash function instead of . [10]

For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110. This would give us the buckets 000, 001, 010, 11, 100, 101, 110. [12]

When a bucket is split, split pointer and possibly the level are updated according to the following, such that the level is 0 when the linear hashing index only has 1 bucket. [10]

# l represents the current level# s represents the split pointer indexs=s+1if(s>=2^l):l=l+1s=0

LH*

The main contribution of LH* is to allow a client of an LH* file to find the bucket where the record resides even if the client does not know the file state. Clients in fact store their version of the file state, which is initially just the knowledge of the first bucket, namely Bucket 0. Based on their file state, a client calculates the address of a key and sends a request to that bucket. At the bucket, the request is checked and if the record is not at the bucket, it is forwarded. In a reasonably stable system, that is, if there is only one split or merge going on while the request is processed, it can be shown that there are at most two forwards. After a forward, the final bucket sends an Image Adjustment Message to the client whose state is now closer to the state of the distributed file. [10] While forwards are reasonably rare for active clients, their number can be even further reduced by additional information exchange between servers and clients [13]

Other properties

File state calculation

The file state consists of split pointer and level . If the original file started with buckets, then the number of buckets and the file state are related via [13]

.

Adoption in language systems

Griswold and Townsend [14] discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

Adoption in database systems

Linear hashing is used in the Berkeley database system (BDB), which in turn is used by many software systems, using a C implementation derived from the CACM article and first published on the Usenet in 1988 by Esmond Pitt.

Related Research Articles

<span class="mw-page-title-main">Binary search</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Hash table</span> Associative array for storing key-value pairs

In computing, a hash table is a data structure that implements an associative array, also called a dictionary or simply map, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Indexed Sequential Access Method (ISAM) is a method for creating, maintaining, and manipulating computer files of data so that records can be retrieved sequentially or randomly by one or more keys. Indexes of key fields are maintained to achieve fast retrieval of required file records in indexed files. IBM originally developed ISAM for mainframe computers, but implementations are available for most computer systems.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

In computing, Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed hash table stores key-value pairs by assigning keys to different computers ; a node will store the values for all the keys for which it is responsible. Chord specifies how keys are assigned to nodes, and how a node can discover the value for a given key by first locating the node responsible for that key.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

Extendible hashing is a type of hash system which treats a hash as a bit string and uses a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an incremental operation. This means that time-sensitive applications are less affected by table growth than by standard full-table rehashes.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

<span class="mw-page-title-main">Proxmap sort</span>

ProxmapSort, or Proxmap sort, is a sorting algorithm that works by partitioning an array of data items, or keys, into a number of "subarrays". The name is short for computing a "proximity map," which indicates for each key K the beginning of a subarray where K will reside in the final sorted order. Keys are placed into each subarray using insertion sort. If keys are "well distributed" among the subarrays, sorting occurs in linear time. The computational complexity estimates involve the number of subarrays and the proximity mapping function, the "map key," used. It is a form of bucket and radix sort.

Spiral hashing, also known as Spiral Storage is an extensible hashing algorithm. As in all hashing schemes, spiral hashing stores records in a varying number of buckets, using a record key for addressing. In an expanding Linear hashing file, buckets are split in a predefined order. This results in adding a new bucket at the end of the file. While this allows gradual reorganization of the file, the expected number of records in the newly created bucket and the bucket from what it splits falls to half the previous number. Several attempts were made to alleviate this sudden drop in space utilization. Martin's spiral storage uses a different approach. The file consists of a number of continuously numbered buckets. The lower-numbered (left) buckets have a higher expected number of records. When the file expands, the left-most bucket is replaced by two buckets on the right. Some variants of this idea exist.

References

  1. 1 2 3 4 Litwin, Witold (1980), "Linear hashing: A new tool for file and table addressing" (PDF), Proc. 6th Conference on Very Large Databases: 212–223
  2. 1 2 Ellis, Carla Schlatter (June 1987), "Concurrency in Linear Hashing", ACM Transactions on Database Systems, 12 (2): 195–217, doi: 10.1145/22952.22954 , S2CID   14260177
  3. 1 2 Baeza-Yates, Ricardo; Soza-Pollman, Hector (1998), "Analysis of Linear Hashing Revised" (PDF), Nordic Journal of Computing: 70–85, S2CID   7497598, archived from the original (PDF) on 2019-03-07
  4. 1 2 Enbody, Richard; Du, HC (June 1988), "Dynamic hashing schemes", ACM Computing Surveys, 20 (2): 85–113, doi: 10.1145/46157.330532 , S2CID   1437123
  5. 1 2 Larson, Per-Åke (April 1988), "Dynamic Hash Tables", Communications of the ACM, 31 (4): 446–457, doi: 10.1145/42404.42410 , S2CID   207548097
  6. Ruchte, Willard; Tharp, Alan (Feb 1987), "Linear hashing with Priority Splitting: A method for improving the retrieval performance of linear hashing", IEEE Third International Conference on Data Engineering: 2–9
  7. Manolopoulos, Yannis; Lorentzos, N. (1994), "Performance of linear hashing schemes for primary key retrieval", Information Systems, 19 (5): 433–446, doi:10.1016/0306-4379(94)90005-1
  8. Ramamohanarao, K.; Sacks-Davis, R. (Sep 1984), "Recursive linear hashing", ACM Transactions on Databases, 9 (3): 369–391, doi: 10.1145/1270.1285 , S2CID   18577730
  9. Litwin, Witold; Neimat, Marie-Anne; Schneider, Donavan A. (1993), "LH: Linear Hashing for distributed files", ACM SIGMOD Record, 22 (2): 327–336, doi:10.1145/170036.170084, S2CID   259938726
  10. 1 2 3 4 5 6 7 8 9 10 11 Litwin, Witold; Moussa, Rim; Schwarz, Thomas (Sep 2005), "LH*RS - a highly-available scalable distributed data structure", ACM Transactions on Database Systems, 30 (3): 769–811, doi:10.1145/1093382.1093386, S2CID   1802386
  11. Fagin, Ronald; Nievergelt, Jurg; Pippenger, Nicholas; Strong, Raymond (Sep 1979), "Extendible Hashing - A Fast Access Method for Dynamic Files", ACM Transactions on Database Systems, 4 (2): 315–344, doi:10.1145/320083.320092, S2CID   2723596
  12. 1 2 Silberschatz, Abraham; Korth, Henry F.; Sudarshan, S. (2020). Database system concepts (Seventh ed.). New York, NY: McGraw-Hill Education. ISBN   978-1-260-08450-4.
  13. 1 2 Chabkinian, Juan; Schwarz, Thomas (2016), "Fast LH*", International Journal of Parallel Programming, 44 (4): 709–734, doi:10.1007/s10766-015-0371-8, S2CID   7448240
  14. Griswold, William G.; Townsend, Gregg M. (April 1993), "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon", Software: Practice and Experience, 23 (4): 351–367, doi:10.1002/spe.4380230402, S2CID   11595927

See also