Sort-merge join

Last updated

The sort-merge join (also known as merge join) is a join algorithm and is used in the implementation of a relational database management system.

Contents

The basic problem of a join algorithm is to find, for each distinct value of the join attribute, the set of tuples in each relation which display that value. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time.

In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order. This can be achieved via an explicit sort operation (often an external sort), or by taking advantage of a pre-existing ordering in one or both of the join relations. [1] The latter condition, called interesting order, can occur because an input to the join might be produced by an index scan of a tree-based index, another merge join, or some other plan operator that happens to produce output sorted on an appropriate key. Interesting orders need not be serendipitous: the optimizer may seek out this possibility and choose a plan that is suboptimal for a specific preceding operation if it yields an interesting order that one or more downstream nodes can exploit.

Let's say that we have two relations and and . fits in pages memory and fits in pages memory. So, in the worst case sort-merge join will run in I/Os. In the case that and are not ordered the worst case time cost will contain additional terms of sorting time: , which equals (as linearithmic terms outweigh the linear terms, see Big O notation – Orders of common functions).

Pseudocode

For simplicity, the algorithm is described in the case of an inner join of two relations left and right. Generalization to other join types is straightforward. The output of the algorithm will contain only rows contained in the left and right relation and duplicates form a Cartesian product.

functionSort-MergeJoin(left: Relation,right: Relation,comparator: Comparator){result=newRelation()// Ensure that at least one element is presentif(!left.hasNext()||!right.hasNext()){returnresult}// Sort left and right relation with comparatorleft.sort(comparator)right.sort(comparator)// Start Merge Join algorithmleftRow=left.next()rightRow=right.next()outerForeverLoop:     while(true){while(comparator.compare(leftRow,rightRow)!=0){if(comparator.compare(leftRow,rightRow)<0){// Left row is less than right rowif(left.hasNext()){// Advance to next left rowleftRow=left.next()}else{breakouterForeverLoop}}else{// Left row is greater than right rowif(right.hasNext()){// Advance to next right rowrightRow=right.next()}else{breakouterForeverLoop}}}// Mark position of left row and keep copy of current left rowleft.mark()markedLeftRow=leftRowwhile(true){while(comparator.compare(leftRow,rightRow)==0){// Left row and right row are equal// Add rows to resultresult=add(leftRow,rightRow)// Advance to next left rowleftRow=left.next()// Check if left row existsif(!leftRow){// Continue with inner forever loopbreak}}if(right.hasNext()){// Advance to next right rowrightRow=right.next()}else{breakouterForeverLoop}if(comparator.compare(markedLeftRow,rightRow)==0){// Restore left to stored markleft.restoreMark()leftRow=markedLeftRow}else{// Check if left row existsif(!leftRow){breakouterForeverLoop}else{// Continue with outer forever loopbreak}}}}returnresult}

Since the comparison logic is not the central aspect of this algorithm, it is hidden behind a generic comparator and can also consist of several comparison criterias (e.g. multiple columns). The compare function should return if a row is less(-1), equal(0) or bigger(1) than another row:

functioncompare(leftRow: RelationRow,rightRow: RelationRow):number{// Return -1 if leftRow is less than rightRow// Return 0 if leftRow is equal to rightRow// Return 1 if leftRow is greater than rightRow}

Note that a relation in terms of this pseudocode supports some basic operations:

interfaceRelation{// Returns true if relation has a next row (otherwise false)hasNext():boolean// Returns the next row of the relation (if any)next():RelationRow// Sorts the relation with the given comparatorsort(comparator: Comparator):void// Marks the current row indexmark():void// Restores the current row index to the marked row indexrestoreMark():void}

Simple C# implementation

Note that this implementation assumes the join attributes are unique, i.e., there is no need to output multiple tuples for a given value of the key.

publicclassMergeJoin{// Assume that left and right are already sortedpublicstaticRelationMerge(Relationleft,Relationright){Relationoutput=newRelation();while(!left.IsPastEnd()&&!right.IsPastEnd()){if(left.Key==right.Key){output.Add(left.Key);left.Advance();right.Advance();}elseif(left.Key<right.Key)left.Advance();else// if (left.Key > right.Key)right.Advance();}returnoutput;}}publicclassRelation{privateList<int>list;publicconstintENDPOS=-1;publicintposition=0;publicintPosition=>position;publicintKey=>list[position];publicboolAdvance(){if(position==list.Count-1||position==ENDPOS){position=ENDPOS;returnfalse;}position++;returntrue;}publicvoidAdd(intkey){list.Add(key);}publicboolIsPastEnd(){returnposition==ENDPOS;}publicvoidPrint(){foreach(intkeyinlist)Console.WriteLine(key);}publicRelation(List<int>list){this.list=list;}publicRelation(){this.list=newList<int>();}}

See also

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a specialized tree-based data structure which is essentially an almost complete tree that satisfies the heap property: in a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Insertion sort</span> Sorting algorithm

Insertion sort is a simple sorting algorithm that builds the final sorted array (or list) one item at a time by comparisons. It is much less efficient on large lists than more advanced algorithms such as quicksort, heapsort, or merge sort. However, insertion sort provides several advantages:

<span class="mw-page-title-main">Merge sort</span> Divide and conquer-based sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

Merge algorithms are a family of algorithms that take multiple sorted lists as input and produce a single list as output, containing all the elements of the inputs lists in sorted order. These algorithms are used as subroutines in various sorting algorithms, most famously merge sort.

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which are used when re-organising the tree to ensure that it is always approximately balanced.

<span class="mw-page-title-main">Sorting algorithm</span> Algorithm that arranges lists in order

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is important for optimizing the efficiency of other algorithms that require input data to be in sorted lists. Sorting is also often useful for canonicalizing data and for producing human-readable output.

In computer science, selection sort is an in-place comparison sorting algorithm. It has an O(n2) time complexity, which makes it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is noted for its simplicity and has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Treap</span>

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

A join clause in SQL – corresponding to a join operation in relational algebra – combines columns from one or more tables into a new table. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

<span class="mw-page-title-main">Rope (data structure)</span> Data structure for storing strings

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

The hash join is an example of a join algorithm and is used in the implementation of a relational database management system. All variants of hash join algorithms involve building hash tables from the tuples of one or both of the joined relations, and subsequently probing those tables so that only tuples with the same hash code need to be compared for equality in equijoins.

<span class="mw-page-title-main">Bitonic sorter</span> Parallel sorting algorithm

Bitonic mergesort is a parallel algorithm for sorting. It is also used as a construction method for building a sorting network. The algorithm was devised by Ken Batcher. The resulting sorting networks consist of comparators and have a delay of , where is the number of items to be sorted.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In computing, a procedural parameter is a parameter of a procedure that is itself a procedure.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computer science, join-based tree algorithms are a class of algorithms for self-balancing binary search trees. This framework aims at designing highly-parallelized algorithms for various balanced binary search trees. The algorithmic framework is based on a single operation join. Under this framework, the join operation captures all balancing criteria of different balancing schemes, and all other functions join have generic implementation across different balancing schemes. The join-based algorithms can be applied to at least four balancing schemes: AVL trees, red–black trees, weight-balanced trees and treaps.

References

  1. "Sort-Merge Joins". www.dcs.ed.ac.uk. Retrieved 2022-11-02.

C# Implementations of Various Join Algorithms