Sort-merge join

Last updated

The sort-merge join (also known as merge join) is a join algorithm and is used in the implementation of a relational database management system.

Contents

The basic problem of a join algorithm is to find, for each distinct value of the join attribute, the set of tuples in each relation which display that value. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time.

In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order. This can be achieved via an explicit sort operation (often an external sort), or by taking advantage of a pre-existing ordering in one or both of the join relations. [1] The latter condition, called interesting order, can occur because an input to the join might be produced by an index scan of a tree-based index, another merge join, or some other plan operator that happens to produce output sorted on an appropriate key. Interesting orders need not be serendipitous: the optimizer may seek out this possibility and choose a plan that is suboptimal for a specific preceding operation if it yields an interesting order that one or more downstream nodes can exploit.

Complexity

Let and be relations where . fits in pages memory and fits in pages memory. In the worst case, a sort-merge join will run in I/O operations. In the case that and are not ordered the worst case time cost will contain additional terms of sorting time: , which equals (as linearithmic terms outweigh the linear terms, see Big O notation – Orders of common functions).

Pseudocode

For simplicity, the algorithm is described in the case of an inner join of two relations left and right. Generalization to other join types is straightforward. The output of the algorithm will contain only rows contained in the left and right relation and duplicates form a Cartesian product.

functionSort-MergeJoin(left:Relation,right:Relation,comparator:Comparator){result=newRelation()// Ensure that at least one element is presentif(!left.hasNext()||!right.hasNext()){returnresult}// Sort left and right relation with comparatorleft.sort(comparator)right.sort(comparator)// Start Merge Join algorithmleftRow=left.next()rightRow=right.next()outerForeverLoop:while(true){while(comparator.compare(leftRow,rightRow)!=0){if(comparator.compare(leftRow,rightRow)<0){// Left row is less than right rowif(left.hasNext()){// Advance to next left rowleftRow=left.next()}else{breakouterForeverLoop}}else{// Left row is greater than right rowif(right.hasNext()){// Advance to next right rowrightRow=right.next()}else{breakouterForeverLoop}}}// Mark position of left row and keep copy of current left rowleft.mark()markedLeftRow=leftRowwhile(true){while(comparator.compare(leftRow,rightRow)==0){// Left row and right row are equal// Add rows to resultresult=add(leftRow,rightRow)// Advance to next left rowleftRow=left.next()// Check if left row existsif(!leftRow){// Continue with inner forever loopbreak}}if(right.hasNext()){// Advance to next right rowrightRow=right.next()}else{breakouterForeverLoop}if(comparator.compare(markedLeftRow,rightRow)==0){// Restore left to stored markleft.restoreMark()leftRow=markedLeftRow}else{// Check if left row existsif(!leftRow){breakouterForeverLoop}else{// Continue with outer forever loopbreak}}}}returnresult}

Since the comparison logic is not the central aspect of this algorithm, it is hidden behind a generic comparator and can also consist of several comparison criteria (e.g. multiple columns). The compare function should return if a row is less(-1), equal(0) or bigger(1) than another row:

functioncompare(leftRow:RelationRow,rightRow:RelationRow):number{// Return -1 if leftRow is less than rightRow// Return 0 if leftRow is equal to rightRow// Return 1 if leftRow is greater than rightRow}

Note that a relation in terms of this pseudocode supports some basic operations:

interfaceRelation{// Returns true if relation has a next row (otherwise false)hasNext():boolean// Returns the next row of the relation (if any)next():RelationRow// Sorts the relation with the given comparatorsort(comparator:Comparator):void// Marks the current row indexmark():void// Restores the current row index to the marked row indexrestoreMark():void}

Simple C# implementation

Note that this implementation assumes the join attributes are unique, i.e., there is no need to output multiple tuples for a given value of the key.

publicclassMergeJoin{// Assume that left and right are already sortedpublicstaticRelationMerge(Relationleft,Relationright){Relationoutput=newRelation();while(!left.IsPastEnd&&!right.IsPastEnd){if(left.Key==right.Key){output.Add(left.Key);left.Advance();right.Advance();}elseif(left.Key<right.Key)left.Advance();else// if (left.Key > right.Key)right.Advance();}returnoutput;}}publicclassRelation{privateconstintENDPOS=-1;privateList<int>list;privateintposition=0;publicRelation(){this.list=newList<int>();}publicRelation(List<int>list){this.list=list;}publicintPosition=>position;publicintKey=>list[position];publicboolIsPastEnd=>position==ENDPOS;publicboolAdvance(){if(position==list.Count-1||position==ENDPOS){position=ENDPOS;returnfalse;}position++;returntrue;}publicvoidAdd(intkey){list.Add(key);}publicvoidPrint(){foreach(intkeyinlist)Console.WriteLine(key);}}

See also

Related Research Articles

<span class="mw-page-title-main">Binary search</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

In computer science, a red–black tree is a self-balancing binary search tree data structure noted for fast storage and retrieval of ordered information. The nodes in a red-black tree hold an extra "color" bit, often drawn as red and black, which help ensure that the tree is always approximately balanced.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

In computer science, selection sort is an in-place comparison sorting algorithm. It has a O(n2) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is noted for its simplicity and has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Treap</span> Random search tree data structure

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

<span class="mw-page-title-main">Join (SQL)</span> SQL clause

A join clause in the Structured Query Language (SQL) combines columns from one or more tables into a new table. The operation corresponds to a join operation in relational algebra. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v) from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications, especially in ranking problems such as feedback arc set. Topological sorting is also possible when the DAG has disconnected components.

In number theory, the integer square root (isqrt) of a non-negative integer n is the non-negative integer m which is the greatest integer less than or equal to the square root of n,

<span class="mw-page-title-main">Rope (data structure)</span> Data structure for storing strings

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate longer strings or entire texts. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

The hash join is an example of a join algorithm and is used in the implementation of a relational database management system. All variants of hash join algorithms involve building hash tables from the tuples of one or both of the joined relations, and subsequently probing those tables so that only tuples with the same hash code need to be compared for equality in equijoins.

<span class="mw-page-title-main">Bitonic sorter</span> Parallel sorting algorithm

Bitonic mergesort is a parallel algorithm for sorting. It is also used as a construction method for building a sorting network. The algorithm was devised by Ken Batcher. The resulting sorting networks consist of comparators and have a delay of , where is the number of items to be sorted. This makes it a popular choice for sorting large numbers of elements on an architecture which itself contains a large number of parallel execution units running in lockstep, such as a typical GPU.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In computing, a procedural parameter is a parameter of a procedure that is itself a procedure.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, join-based tree algorithms are a class of algorithms for self-balancing binary search trees. This framework aims at designing highly-parallelized algorithms for various balanced binary search trees. The algorithmic framework is based on a single operation join. Under this framework, the join operation captures all balancing criteria of different balancing schemes, and all other functions join have generic implementation across different balancing schemes. The join-based algorithms can be applied to at least four balancing schemes: AVL trees, red–black trees, weight-balanced trees and treaps.

References

  1. "Sort-Merge Joins". www.dcs.ed.ac.uk. Retrieved 2022-11-02.

C# Implementations of Various Join Algorithms