Hash join

Last updated January 18, 2024

The hash join is an example of a join algorithm and is used in the implementation of a relational database management system. All variants of hash join algorithms involve building hash tables from the tuples of one or both of the joined relations, and subsequently probing those tables so that only tuples with the same hash code need to be compared for equality in equijoins.

Hash joins are typically more efficient than nested loops joins, except when the probe side of the join is very small. They require an equijoin predicate (a predicate comparing records from one table with those from the other table using a conjunction of equality operators '=' on one or more columns).

Classic hash join

The classic hash join algorithm for an inner join of two relations proceeds as follows:

First, prepare a hash table using the contents of one relation, ideally whichever one is smaller after applying local predicates. This relation is called the build side of the join. The hash table entries are mappings from the value of the (composite) join attribute to the remaining attributes of that row (whichever ones are needed).
Once the hash table is built, scan the other relation (the probe side). For each row of the probe relation, find the relevant rows from the build relation by looking in the hash table.

The first phase is usually called the "build" phase, while the second is called the "probe" phase. Similarly, the join relation on which the hash table is built is called the "build" input, whereas the other input is called the "probe" input.

This algorithm is simple, but it requires that the smaller join relation fits into memory, which is sometimes not the case. A simple approach to handling this situation proceeds as follows:

For each tuple $r$ $Hash join$ in the build input $R$ $Hash join$
1. Add $r$ to the in-memory hash table
2. If the size of the hash table equals the maximum in-memory size:
  1. Scan the probe input $S$ , and add matching join tuples to the output relation
  2. Reset the hash table, and continue scanning the build input $R$
Do a final scan of the probe input $S$ and add the resulting join tuples to the output relation

This is essentially the same as the block nested loop join algorithm. This algorithm scans $S$ eventually more times than necessary.

Grace hash join

A better approach is known as the "grace hash join", after the GRACE database machine for which it was first implemented.

This algorithm avoids rescanning the entire $S$ relation by first partitioning both $R$ and $S$ via a hash function, and writing these partitions out to disk. The algorithm then loads pairs of partitions into memory, builds a hash table for the smaller partitioned relation, and probes the other relation for matches with the current hash table. Because the partitions were formed by hashing on the join key, it must be the case that any join output tuples must belong to the same partition.

It is possible that one or more of the partitions still does not fit into the available memory, in which case the algorithm is recursively applied: an additional orthogonal hash function is chosen to hash the large partition into sub-partitions, which are then processed as before. Since this is expensive, the algorithm tries to reduce the chance that it will occur by forming the smallest partitions possible during the initial partitioning phase.

Hybrid hash join

The hybrid hash join algorithm^[1] is a combination of the classical hash join and grace hash join. It uses minimal amount of memory for partitioning like in grace hash join and uses the remaining memory to initialize a classical hash join during partitioning phase. During the partitioning phase, the hybrid hash join uses the available memory for two purposes:

To partition both relations $R$ and $S$ and
To hold an entire partition from $R$ in-memory, known as "partition 0"

Because partition 0 is never written to disk, hybrid hash join typically performs fewer I/O operations than grace hash join. When $R$ fits nearly fully into memory hybrid hash join has a similar behavior like the classical hash join which is more beneficial. Otherwise hybrid hash join imitates grace hash join.

Note that this algorithm is memory-sensitive, because there are two competing demands for memory (the hash table for partition 0, and the output buffers for the remaining partitions). Choosing too large a hash table for partition 0 might cause the algorithm to recurse because one of the non-zero partitions is too large to fit into memory.

Hash anti-join

Hash joins can also be evaluated for an anti-join predicate (a predicate selecting values from one table when no related values are found in the other). Depending on the sizes of the tables, different algorithms can be applied:

Hash left anti-join

Prepare a hash table for the NOT IN side of the join.
Scan the other table, selecting any rows where the join attribute hashes to an empty entry in the hash table.

This is more efficient when the NOT IN table is smaller than the FROM table.

Hash right anti-join

Prepare a hash table for the FROM side of the join.
Scan the NOT IN table, removing the corresponding records from the hash table on each hash hit.
Return everything that left in the hash table.

This is more efficient when the NOT IN table is larger than the FROM table.

Hash semi-join

Hash semi-join is used to return the records found in the other table. Unlike the plain join, it returns each matching record from the leading table only once, regardless of how many matches there are in the IN table.

As with the anti-join, semi-join can also be left and right:

Hash left semi-join

Prepare a hash table for the IN side of the join.
Scan the other table, returning any rows that produce a hash hit.

The records are returned right after they produced a hit. The actual records from the hash table are ignored.

This is more efficient when the IN table is smaller than the FROM table.

Hash right semi-join

Prepare a hash table for the FROM side of the join.
Scan the IN table, returning the corresponding records from the hash table and removing them.

With this algorithm, each record from the hash table (that is, FROM table) can only be returned once, since it is removed after being returned.

This is more efficient when the IN table is larger than the FROM table.

Related Research Articles

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

<span class="mw-page-title-main">Graph (abstract data type)</span> Abstract data type in computer science

In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.

A join clause in the Structured Query Language (SQL) combines columns from one or more tables into a new table. The operation corresponds to a join operation in relational algebra. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

The sort-merge join is a join algorithm and is used in the implementation of a relational database management system.

<span class="mw-page-title-main">Partition (database)</span>

A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, or for load balancing. It is popular in distributed database management systems, where each partition may be spread over multiple nodes, with users at the node performing local transactions on the partition. This increases performance for sites that have regular transactions involving certain views of data, whilst maintaining availability and security.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

The complexity of constraint satisfaction is the application of computational complexity theory to constraint satisfaction. It has mainly been studied for discriminating between tractable and intractable classes of constraint satisfaction problems on finite domains.

A nested loop join is a naive algorithm that joins two relations by using two nested loops. Join operations are important for database management.

A block-nested loop (BNL) is an algorithm used to join two relations in a relational database.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the array into $p$ approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

The rebound attack is a tool in the cryptanalysis of cryptographic hash functions. The attack was first published in 2009 by Florian Mendel, Christian Rechberger, Martin Schläffer and Søren Thomsen. It was conceived to attack AES like functions such as Whirlpool and Grøstl, but was later shown to also be applicable to other designs such as Keccak, JH and Skein.

<span class="mw-page-title-main">Sponge function</span> Theory of cryptography

In cryptography, a sponge function or sponge construction is any of a class of algorithms with finite internal state that take an input bit stream of any length and produce an output bit stream of any desired length. Sponge functions have both theoretical and practical uses. They can be used to model or implement many cryptographic primitives, including cryptographic hashes, message authentication codes, mask generation functions, stream ciphers, pseudo-random number generators, and authenticated encryption.

Lyra2 is a password hashing scheme (PHS) that can also work as a key derivation function (KDF). It received a special recognition during the Password Hashing Competition in July 2015, which was won by Argon2. Besides being used for its original purposes, it is also in the core of proof-of-work algorithms such as Lyra2REv2, adopted by Vertcoin, MonaCoin, among other cryptocurrencies Lyra2 was designed by Marcos A. Simplicio Jr., Leonardo C. Almeida, Ewerton R. Andrade, Paulo C. F. dos Santos, and Paulo S. L. M. Barreto from Escola Politécnica da Universidade de São Paulo. It is an improvement over Lyra, previously proposed by the same authors. Lyra2 preserves the security, efficiency and flexibility of its predecessor, including: (1) the ability to configure the desired amount of memory, processing time and parallelism to be used by the algorithm; and (2) the capacity of providing a high memory usage with a processing time similar to that obtained with scrypt. In addition, it brings the following improvements when compared to its predecessor:

In computer science, an enumeration algorithm is an algorithm that enumerates the answers to a computational problem. Formally, such an algorithm applies to problems that take an input and produce a list of solutions, similarly to function problems. For each input, the enumeration algorithm must produce the list of all solutions, without duplicates, and then halt. The performance of an enumeration algorithm is measured in terms of the time required to produce the solutions, either in terms of the total time required to produce all solutions, or in terms of the maximal delay between two consecutive solutions and in terms of a preprocessing time, counted as the time before outputting the first solution. This complexity can be expressed in terms of the size of the input, the size of each individual output, or the total size of the set of all outputs, similarly to what is done with output-sensitive algorithms.

References

↑ DeWitt, D.J.; Katz, R.; Olken, F.; Shapiro, L.; Stonebraker, M.; Wood, D. (June 1984). "Implementation techniques for main memory database systems". Proc. ACM SIGMOD Conf. 14 (4): 1–8. doi:10.1145/971697.602261.

External links

Hansjörg Zeller; Jim Gray (1990). "An Adaptive Hash Join Algorithm for Multiuser Environments" (PDF). Proceedings of the 16th VLDB conference. Brisbane: 186–197. Archived from the original (PDF) on 2012-03-11. Retrieved 2008-09-21.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] DeWitt, D.J.; Katz, R.; Olken, F.; Shapiro, L.; Stonebraker, M.; Wood, D. (June 1984). "Implementation techniques for main memory database systems". Proc. ACM SIGMOD Conf. 14 (4): 1–8. doi:10.1145/971697.602261.

[1]