Correlated subquery

Last updated

In a SQL database query, a correlated subquery (also known as a synchronized subquery) is a subquery (a query nested inside another query) that uses values from the outer query. Because the subquery may be evaluated once for each row processed by the outer query, it can be slow.

Contents

Examples

Correlated subqueries in the WHERE clause

Here is an example for a typical correlated subquery. In this example, the objective is to find all employees whose salary is above average for their department.

SELECTemployee_number,nameFROMemployeesempWHEREsalary>(SELECTAVG(salary)FROMemployeesWHEREdepartment=emp.department);

In the above query the outer query is

SELECTemployee_number,nameFROMemployeesempWHEREsalary>...

and the inner query (the correlated subquery) is

SELECTAVG(salary)FROMemployeesWHEREdepartment=emp.department

In the above nested query the inner query has to be re-executed for each employee. (A sufficiently smart implementation may cache the inner query's result on a department-by-department basis, but even in the best case the inner query must be executed once per department.)

Correlated subqueries in the SELECT clause

Correlated subqueries may appear elsewhere besides the WHERE clause; for example, this query uses a correlated subquery in the SELECT clause to print the entire list of employees alongside the average salary for each employee's department. Again, because the subquery is correlated with a column of the outer query, it must be re-executed for each row of the result.[ citation needed ]

SELECTemployee_number,name,(SELECTAVG(salary)FROMemployeesWHEREdepartment=emp.department)ASdepartment_averageFROMemployeesemp

Correlated subqueries in the FROM clause

It is generally meaningless to have a correlated subquery in the FROM clause because the table in the FROM clause is needed to evaluate the outer query, but the correlated subquery in the FROM clause can't be evaluated before the outer query is evaluated, causing a chicken-and-egg problem. Specifically, MariaDB lists this as a limitation in its documentation. [1]

However, in some database systems, it is allowed to use correlated subqueries while joining in the FROM clause, referencing the tables listed before the join using a specified keyword, producing a number of rows in the correlated subquery and joining it to the table on the left. For example, in PostgreSQL, adding the keyword LATERAL before the right-hand subquery, [2] or in Microsoft SQL Server, using the keyword CROSS APPLY or OUTER APPLY instead of JOIN [3] achieves the effect.

Computation of correlated subqueries

A commonly used computational method for a correlated subquery is to rewrite it into an equivalent flat query [4] (a process known as flattening [5] [6] [7] [8] ). The algorithm development in this direction has an advantage of low complexity. Because this is a customized approach, existing database systems cannot flatten arbitrary correlated subqueries by following certain general rules. In addition, this approach requires high engineering efforts to implement flattening algorithms into a database engine. A general computational approach is to directly execute the nested loop by iterating all tuples of the correlated columns from the outer query block and executing the subquery as many times as the number of outer-loop tuples. [9] This simple approach has an advantage of general-purpose because it is not affected by the type of correlated operators or subquery structures. However, it has a high computational complexity. A GPU acceleration approach is used to significantly improve the performance of the nested method of high algorithmic complexity by exploiting massive parallelism and device memory locality on GPU, [10] which accomplishes the goal for both general-purpose software design and implementation and high performance in subquery processing.

Related Research Articles

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

Transact-SQL (T-SQL) is Microsoft's and Sybase's proprietary extension to the SQL used to interact with relational databases. T-SQL expands on the SQL standard to include procedural programming, local variables, various support functions for string processing, date processing, mathematics, etc. and changes to the DELETE and UPDATE statements.

<span class="mw-page-title-main">Join (SQL)</span> SQL clause

A join clause in the Structured Query Language (SQL) combines columns from one or more tables into a new table. The operation corresponds to a join operation in relational algebra. Informally, a join stitches two tables and puts on the same row records with matching fields : INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

The SQL SELECT statement returns a result set of records, from one or more tables.

An SQL INSERT statement adds one or more records to any single table in a relational database.

A data control language (DCL) is a syntax similar to a computer programming language used to control access to data stored in a database (authorization). In particular, it is a component of Structured Query Language (SQL). Data Control Language is one of the logical group in SQL Commands. SQL is the standard language for relational database management systems. SQL statements are used to perform tasks such as insert data to a database, delete or update data in a database, or retrieve data from a database.

A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for maintaining the integrity of the information on the database. For example, when a new record is added to the employees table, new records should also be created in the tables of the taxes, vacations and salaries. Triggers can also be used to log historical data, for example to keep track of employees' previous salaries.

A query plan is a sequence of steps used to access data in a SQL relational database management system. This is a specific case of the relational model concept of access plans.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

A WHERE clause in SQL specifies that a SQL Data Manipulation Language (DML) statement should only affect rows that meet specified criteria. The criteria are expressed in the form of predicates. WHERE clauses are not mandatory clauses of SQL DML statements, but can be used to limit the number of rows affected by a SQL DML statement or returned by a query. In brief SQL WHERE clause is used to extract only those results from a SQL statement, such as: SELECT, INSERT, UPDATE, or DELETE statement.

The programming language XQuery defines FLWOR as an expression that supports iteration and binding of variables to intermediate results. FLWOR is an acronym: FOR, LET, WHERE, ORDER BY, RETURN. FLWOR is loosely analogous to SQL's SELECT-FROM-WHERE and can be used to provide join-like functionality to XML documents.

In relational databases, a condition in a query is said to be sargable if the DBMS engine can take advantage of an index to speed up the execution of the query. The term is derived from a contraction of Search ARGument ABLE. It was first used by IBM researchers as a contraction of Search ARGument, and has come to mean simply "can be looked up by an index."

A HAVING clause in SQL specifies that an SQL SELECT statement must only return rows where aggregate values meet the specified conditions.

In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row. Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function.

A hierarchical query is a type of SQL query that handles hierarchical model data. They are special cases of more general recursive fixpoint queries, which compute transitive closures.

Apache Empire-db is a Java library that provides a high level object-oriented API for accessing relational database management systems (RDBMS) through JDBC. Apache Empire-db is open source and provided under the Apache License 2.0 from the Apache Software Foundation.

QUEL is a relational database query language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the Ingres DBMS effort at University of California, Berkeley, based on Codd's earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably in an implementation called POSTQUEL supported by POSTGRES. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead. QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years.

A relational data stream management system (RDSMS) is a distributed, in-memory data stream management system (DSMS) that is designed to use standards-compliant SQL queries to process unstructured and structured data streams in real-time. Unlike SQL queries executed in a traditional RDBMS, which return a result and exit, SQL queries executed in a RDSMS do not exit, generating results continuously as new data become available. Continuous SQL queries in a RDSMS use the SQL Window function to analyze, join and aggregate data streams over fixed or sliding windows. Windows can be specified as time-based or row-based.

The syntax of the SQL programming language is defined and maintained by ISO/IEC SC 32 as part of ISO/IEC 9075. This standard is not freely available. Despite the existence of the standard, SQL code is not completely portable among different database systems without adjustments.

References

  1. "Subquery Limitations". MariaDB Knowledgebase. Retrieved 2020-12-24.
  2. "Table Expressions - LATERAL Subqueries". postgresql.org. Retrieved 2023-01-21.
  3. "FROM clause plus JOIN, APPLY, PIVOT (Transact-SQL)". docs.microsoft.com. 2019-06-01. Retrieved 2020-12-24.
  4. Kim, Won (September 1982). "On Optimizing an SQL-like Nested Query" (pdf). ACM Transactions on Database Systems. 7 (3): 443–469. doi: 10.1145/319732.319745 . S2CID   4374300.
  5. "The SQLite Query Optimizer Overview - 11. Subquery Flattening". SQLite. Retrieved 2023-01-21. When a subquery occurs in the FROM clause of a SELECT, the simplest behavior is to evaluate the subquery into a transient table, then run the outer SELECT against the transient table. Such a plan can be suboptimal since the transient table will not have any indexes and the outer query (which is likely a join) will be forced to do a full table scan on the transient table. To overcome this problem, SQLite attempts to flatten subqueries in the FROM clause of a SELECT. This involves inserting the FROM clause of the subquery into the FROM clause of the outer query and rewriting expressions in the outer query that refer to the result set of the subquery. ...
  6. "Flattening FROM Clause Subqueries". Vertica. Retrieved 2023-01-21. FROM clause subqueries are always evaluated before their containing query. In some cases, the optimizer flattens FROM clause subqueries so the query can execute more efficiently.
  7. "Flattening a subquery into a normal join". Apache Derby. Retrieved 2023-01-21. Statements that include such subqueries can be flattened into joins only if the subquery does not introduce any duplicates into the result set...
  8. "Chapter 15: Abstract Query Plan Guide - Flattened subqueries". Sybase. Retrieved 2023-01-21. Some subqueries can be flattened into joins. ...
  9. Selinger, P. Griffiths; Astrahan, M. M.; Chamberlin, D. D.; Lorie, R. A.; Price, T. G. (1979). Access Path Selection in a Relational Database Management System (pdf). Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts. SIGMOD '79. New York, NY, USA: Association for Computing Machinery. pp. 23–34. doi:10.1145/582095.582099. ISBN   089791001X. 10.1145/582095.582099.
  10. Floratos, Sofoklis; Xiao, Mengbai; Wang, Hao; Guo, Chengxin; Yuan, Yuan; Lee, Rubao; Zhang, Xiaodong (2021). NestGPU: Nested Query Processing on GPU. IEEE 37th International Conference on Data Engineering. pp. 1008–1019.