Worst-case optimal join algorithm

Last updated
An illustration of properties of join algorithms. When performing a join between more than two relations on more than two attributes, binary join algorithms such as hash join operate over two relations at a time, and join them on all attributes in the join condition; worst-case optimal algorithms such as generic join operate on a single attribute at a time but join all the relations on this attribute. Comparison of join algorithms.png
An illustration of properties of join algorithms. When performing a join between more than two relations on more than two attributes, binary join algorithms such as hash join operate over two relations at a time, and join them on all attributes in the join condition; worst-case optimal algorithms such as generic join operate on a single attribute at a time but join all the relations on this attribute.

A worst-case optimal join algorithm is an algorithm for computing relational joins with a runtime that is bounded by the worst-case output size of the join. Traditional binary join algorithms such as hash join operate over two relations at a time; joins between more than two relations are implemented by repeatedly applying binary joins. Worst-case optimal join algorithms are asymptotically faster in worst case than any join algorithm based on such iterated binary joins.

Contents

The first worst-case optimal join algorithm, generic join, was published in 2012. [2] Worst-case optimal join algorithms have been implemented in commercial database systems, including the LogicBlox system. [3] [4] Worst-case optimal joins have been applied to build a worst-case optimal algorithm for e-matching. [5]

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Shellsort</span> Sorting algorithm which uses multiple comparison intervals

Shellsort, also known as Shell sort or Shell's method, is an in-place comparison sort. It can be seen as either a generalization of sorting by exchange or sorting by insertion. The method starts by sorting pairs of elements far apart from each other, then progressively reducing the gap between elements to be compared. By starting with far apart elements, it can move some out-of-place elements into position faster than a simple nearest neighbor exchange. Donald Shell published the first version of this sort in 1959. The running time of Shellsort is heavily dependent on the gap sequence it uses. For many practical variants, determining their time complexity remains an open problem.

The Gödel Prize is an annual prize for outstanding papers in the area of theoretical computer science, given jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery Special Interest Group on Algorithms and Computational Theory. The award is named in honor of Kurt Gödel. Gödel's connection to theoretical computer science is that he was the first to mention the "P versus NP" question, in a 1956 letter to John von Neumann in which Gödel asked whether a certain NP-complete problem could be solved in quadratic or linear time.

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

The graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic.

The Christofides algorithm or Christofides–Serdyukov algorithm is an algorithm for finding approximate solutions to the travelling salesman problem, on instances where the distances form a metric space . It is an approximation algorithm that guarantees that its solutions will be within a factor of 3/2 of the optimal solution length, and is named after Nicos Christofides and Anatoliy I. Serdyukov, who discovered it independently in 1976.

In computer science, the Hopcroft–Karp algorithm is an algorithm that takes a bipartite graph as input and produces a maximum-cardinality matching as output — a set of as many edges as possible with the property that no two edges share an endpoint. It runs in time in the worst case, where is set of edges in the graph, is set of vertices of the graph, and it is assumed that . In the case of dense graphs the time bound becomes , and for sparse random graphs it runs in time with high probability.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

<span class="mw-page-title-main">Smoothed analysis</span>

In theoretical computer science, smoothed analysis is a way of measuring the complexity of an algorithm. Since its introduction in 2001, smoothed analysis has been used as a basis for considerable research, for problems ranging from mathematical programming, numerical analysis, machine learning, and data mining. It can give a more realistic analysis of the practical performance of the algorithm compared to analysis that uses worst-case or average-case scenarios.

<span class="mw-page-title-main">Georg Gottlob</span> Austrian computer scientist

Georg Gottlob FRS is an Austrian-Italian computer scientist who works in the areas of database theory, logic, and artificial intelligence and is Professor of Informatics at the University of Calabria. He was Professor at the University of Oxford.

Envy-free (EF) item allocation is a fair item allocation problem, in which the fairness criterion is envy-freeness - each agent should receive a bundle that they believe to be at least as good as the bundle of any other agent.

In computer science, the predecessor problem involves maintaining a set of items to, given an element, efficiently query which element precedes or succeeds that element in an order. Data structures used to solve the problem include balanced binary search trees, van Emde Boas trees, and fusion trees. In the static predecessor problem, the set of elements does not change, but in the dynamic predecessor problem, insertions into and deletions from the set are allowed.

<span class="mw-page-title-main">Witold Lipski</span> Polish computer scientist

Witold Lipski Jr. was a Polish computer scientist, and an author of two books: Combinatorics for Programmers and (jointly with Wiktor Marek Combinatorial analysis. Lipski, jointly with his PhD student, Tomasz Imieliński, created foundations of the theory of incomplete information in relational databases.

Since the advent of differential privacy, a number of systems supporting differentially private data analyses have been implemented and deployed. This article tracks real-world deployments, production software packages, and research prototypes.

Truthful resource allocation is the problem of allocating resources among agents with different valuations over the resources, such that agents are incentivized to reveal their true valuations over the resources.

In computer science, an e-graph is a data structure that stores an equivalence relation over terms of some language.

The welfare maximization problem is an optimization problem studied in economics and computer science. Its goal is to partition a set of items among agents with different utility functions, such that the welfare – defined as the sum of the agents' utilities – is as high as possible. In other words, the goal is to find an item allocation satisfying the utilitarian rule.

The LogicBlox system is a commercial, declarative, incremental logic programming language and deductive database inspired by Datalog. The LogiQL programming language extends Datalog with several features, including stratified negation, aggregation, and a module system. LogicBlox has been used to build pointer analyses for Java.

References

Notes

  1. Wang, Yisu Remy; Willsey, Max; Suciu, Dan (2023-01-27). "Free Join: Unifying Worst-Case Optimal and Traditional Joins". arXiv: 2301.10841 [cs.DB].
  2. Ngo, Hung Q.; Porat, Ely; Ré, Christopher; Rudra, Atri (2012-03-08). "Worst-case Optimal Join Algorithms". arXiv: 1203.1952 [cs.DB].
  3. Veldhuizen, Todd L. (2013-12-20). "Leapfrog Triejoin: a worst-case optimal join algorithm". arXiv: 1210.0481 [cs.DB].
  4. Freitag, Michael; Bandle, Maximilian; Schmidt, Tobias; Kemper, Alfons; Neumann, Thomas (2020-07-01). "Adopting worst-case optimal joins in relational database systems". Proceedings of the VLDB Endowment. 13 (12): 1891–1904. doi:10.14778/3407790.3407797. ISSN   2150-8097. S2CID   221115321.
  5. Zhang, Yihong; Wang, Yisu Remy; Willsey, Max; Tatlock, Zachary (2022-01-12). "Relational e-matching". Proceedings of the ACM on Programming Languages. 6 (POPL): 35:1–35:22. doi: 10.1145/3498696 . S2CID   236924583.

Sources