Conjunctive query

Last updated

In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational databases can be expressed in this way. Conjunctive queries also have a number of desirable theoretical properties that larger classes of queries (e.g., the relational algebra queries) do not share.

Contents

Definition

The conjunctive queries are the fragment of (domain independent) first-order logic given by the set of formulae that can be constructed from atomic formulae using conjunction and existential quantification ∃, but not using disjunction , negation ¬, or universal quantification ∀. Each such formula can be rewritten (efficiently) into an equivalent formula in prenex normal form, thus this form is usually simply assumed.

Thus conjunctive queries are of the following general form:

,

with the free variables being called distinguished variables, and the bound variables being called undistinguished variables. are atomic formulae.

As an example of why the restriction to domain independent first-order logic is important, consider , which is not domain independent; see Codd's theorem. This formula cannot be implemented in the select-project-join fragment of relational algebra, and hence should not be considered a conjunctive query.

Conjunctive queries can express a large proportion of queries that are frequently issued on relational databases. To give an example, imagine a relational database for storing information about students, their address, the courses they take and their gender. Finding all male students and their addresses who attend a course that is also attended by a female student is expressed by the following conjunctive query:

(student, address) . ∃ (student2, course) .    attends(student, course)  gender(student, 'male')      attends(student2, course)     gender(student2, 'female')  lives(student, address)

Note that since the only entity of interest is the male student and his address, these are the only distinguished variables, while the variables course, student2 are only existentially quantified, i.e. undistinguished.

Fragments

Conjunctive queries without distinguished variables are called boolean conjunctive queries. Conjunctive queries where all variables are distinguished (and no variables are bound) are called equi-join queries, [1] because they are the equivalent, in the relational calculus, of the equi-join queries in the relational algebra (when selecting all columns of the result).

Relationship to other query languages

Conjunctive queries also correspond to select-project-join queries in relational algebra (i.e., relational algebra queries that do not use the operations union or difference) and to select-from-where queries in SQL in which the where-condition uses exclusively conjunctions of atomic equality conditions, i.e. conditions constructed from column names and constants using no comparison operators other than "=", combined using "and". Notably, this excludes the use of aggregation and subqueries. For example, the above query can be written as an SQL query of the conjunctive query fragment as

selectl.student,l.addressfromattendsa1,genderg1,attendsa2,genderg2,liveslwherea1.student=g1.studentanda2.student=g2.studentandl.student=g1.studentanda1.course=a2.courseandg1.gender='male'andg2.gender='female';

Datalog

Besides their logical notation, conjunctive queries can also be written as Datalog rules. Many authors in fact prefer the following Datalog notation for the query above:

result(student,address):-attends(student,course),gender(student,male),attends(student2,course),gender(student2,female),lives(student,address).

Although there are no quantifiers in this notation, variables appearing in the head of the rule are still implicitly universally quantified, while variables only appearing in the body of the rule are still implicitly existentially quantified.

While any conjunctive query can be written as a Datalog rule, not every Datalog program can be written as a conjunctive query. In fact, only single rules over extensional predicate symbols can be easily rewritten as an equivalent conjunctive query. The problem of deciding whether for a given Datalog program there is an equivalent nonrecursive program (corresponding to a positive relational algebra query, or, equivalently, a formula of positive existential first-order logic, or, as a special case, a conjunctive query) is known as the Datalog boundedness problem and is undecidable. [2]

Extensions

Extensions of conjunctive queries capturing more expressive power include:

The formal study of all of these extensions is justified by their application in relational databases and is in the realm of database theory.

Complexity

For the study of the computational complexity of evaluating conjunctive queries, two problems have to be distinguished. The first is the problem of evaluating a conjunctive query on a relational database where both the query and the database are considered part of the input. The complexity of this problem is usually referred to as combined complexity, while the complexity of the problem of evaluating a query on a relational database, where the query is assumed fixed, is called data complexity. [3]

Conjunctive queries are NP-complete with respect to combined complexity, [4] while the data complexity of conjunctive queries is very low, in the parallel complexity class AC0, which is contained in LOGSPACE and thus in polynomial time. The NP-hardness of conjunctive queries may appear surprising, since relational algebra and SQL strictly subsume the conjunctive queries and are thus at least as hard (in fact, relational algebra is PSPACE-complete with respect to combined complexity and is therefore even harder under widely held complexity-theoretic assumptions). However, in the usual application scenario, databases are large, while queries are very small, and the data complexity model may be appropriate for studying and describing their difficulty.

The problem of listing all answers to a non-Boolean conjunctive query has been studied in the context of enumeration algorithms, with a characterization (under some computational hardness assumptions) of the queries for which enumeration can be performed with linear time preprocessing and constant delay between each solution. Specifically, these are the acyclic conjunctive queries which also satisfy a free-connex condition. [5]

Formal properties

Conjunctive queries are one of the great success stories of database theory in that many interesting problems that are computationally hard or undecidable for larger classes of queries are feasible for conjunctive queries. [6] For example, consider the query containment problem. We write for two database relations of the same schema if and only if each tuple occurring in also occurs in . Given a query and a relational database instance , we write the result relation of evaluating the query on the instance simply as . Given two queries and and a database schema, the query containment problem is the problem of deciding whether for all possible database instances over the input database schema, . The main application of query containment is in query optimization: Deciding whether two queries are equivalent is possible by simply checking mutual containment.

The query containment problem is undecidable for relational algebra and SQL but is decidable and NP-complete for conjunctive queries. In fact, it turns out that the query containment problem for conjunctive queries is exactly the same problem as the query evaluation problem. [6] Since queries tend to be small, NP-completeness here is usually considered acceptable. The query containment problem for conjunctive queries is also equivalent to the constraint satisfaction problem. [7]

An important class of conjunctive queries that have polynomial-time combined complexity are the acyclic conjunctive queries. [8] The query evaluation, and thus query containment, is LOGCFL-complete and thus in polynomial time. [9] Acyclicity of conjunctive queries is a structural property of queries that is defined with respect to the query's hypergraph: [6] a conjunctive query is acyclic if and only if it has hypertree-width 1. For the special case of conjunctive queries in which all relations used are binary, this notion corresponds to the treewidth of the dependency graph of the variables in the query (i.e., the graph having the variables of the query as nodes and an undirected edge between two variables if and only if there is an atomic formula or in the query) and the conjunctive query is acyclic if and only if its dependency graph is acyclic.

An important generalization of acyclicity is the notion of bounded hypertree-width, which is a measure of how close to acyclic a hypergraph is, analogous to bounded treewidth in graphs. Conjunctive queries of bounded tree-width have LOGCFL combined complexity. [10]

Unrestricted conjunctive queries over tree data (i.e., a relational database consisting of a binary child relation of a tree as well as unary relations for labeling the tree nodes) have polynomial time combined complexity. [11]

Related Research Articles

First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science. First-order logic uses quantified variables over non-logical objects, and allows the use of sentences that contain variables, so that rather than propositions such as "Socrates is a man", one can have expressions in the form "there exists x such that x is Socrates and x is a man", where "there exists" is a quantifier, while x is a variable. This distinguishes it from propositional logic, which does not use quantifiers or relations; in this sense, propositional logic is the foundation of first-order logic.

In Boolean logic, a formula is in conjunctive normal form (CNF) or clausal normal form if it is a conjunction of one or more clauses, where a clause is a disjunction of literals; otherwise put, it is a product of sums or an AND of ORs. As a canonical normal form, it is useful in automated theorem proving and circuit theory.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form that gives it useful properties for use in logic programming, formal specification, universal algebra and model theory. Horn clauses are named for the logician Alfred Horn, who first pointed out their significance in 1951.

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

In mathematics, a real closed field is a field F that has the same first-order properties as the field of real numbers. Some examples are the field of real numbers, the field of real algebraic numbers, and the field of hyperreal numbers.

Descriptive complexity is a branch of computational complexity theory and of finite model theory that characterizes complexity classes by the type of logic needed to express the languages in them. For example, PH, the union of all complexity classes in the polynomial hierarchy, is precisely the class of languages expressible by statements of second-order logic. This connection between complexity and the logic of finite structures allows results to be transferred easily from one area to the other, facilitating new proof methods and providing additional evidence that the main complexity classes are somehow "natural" and not tied to the specific abstract machines used to define them.

In universal algebra and in model theory, a structure consists of a set along with a collection of finitary operations and relations that are defined on it.

The complexity of constraint satisfaction is the application of computational complexity theory to constraint satisfaction. It has mainly been studied for discriminating between tractable and intractable classes of constraint satisfaction problems on finite domains.

In the theory of relational databases, a Boolean conjunctive query is a conjunctive query without distinguished predicates, i.e., a query in the form , where each is a relation symbol and each is a tuple of variables and constants; the number of elements in is equal to the arity of . Such a query evaluates to either true or false depending on whether the relations in the database contain the appropriate tuples of values, i.e. the conjunction is valid according to the facts in the database.

<span class="mw-page-title-main">Georg Gottlob</span> Austrian computer scientist

Georg Gottlob FRS is an Austrian-Italian computer scientist who works in the areas of database theory, logic, and artificial intelligence and is Professor of Informatics at the University of Calabria. He was Professor at the University of Oxford.

In logic, a quantifier is an operator that specifies how many individuals in the domain of discourse satisfy an open formula. For instance, the universal quantifier in the first order formula expresses that everything in the domain satisfies the property denoted by . On the other hand, the existential quantifier in the formula expresses that there exists something in the domain which satisfies that property. A formula where a quantifier takes widest scope is called a quantified formula. A quantified formula must contain a bound variable and a subformula specifying a property of the referent of that variable.

In relational database theory, a tuple-generating dependency (TGD) is a certain kind of constraint on a relational database. It is a subclass of the class of embedded dependencies (EDs).

In relational database theory, an embedded dependency (ED) is a certain kind of constraint on a relational database. It is the most general type of constraint used in practice, including both tuple-generating dependencies and equality-generating dependencies. Embedded dependencies can express functional dependencies, join dependencies, multivalued dependencies, inclusion dependencies, foreign key dependencies, and many more besides.

In relational database theory, an equality-generating dependency (EGD) is a certain kind of constraint on data. It is a subclass of the class of embedded dependencies (ED).

In mathematical logic, fixed-point logics are extensions of classical predicate logic that have been introduced to express recursion. Their development has been motivated by descriptive complexity theory and their relationship to database query languages, in particular to Datalog.

<span class="mw-page-title-main">Nicola Leone</span> Italian computer scientist

Nicola Leone is an Italian computer scientist who works in the areas of artificial intelligence, knowledge representation and reasoning, and database theory. Leone is currently the rector of the University of Calabria and a professor of Computer Science. Previously, he was a professor of Database Systems at the TU Wien.

The Vadalog system is a Knowledge Graph Management System (KGMS) that offers a language for performing complex logic reasoning tasks over knowledge graphs. At the same time, Vadalog delivers a platform to support the entire spectrum of data science tasks: data integration, pre-processing, statistical analysis, machine learning, algorithmic modeling, probabilistic reasoning and temporal reasoning. Its language is based on an extension of the rule-based language Datalog, Warded Datalog±, a high-performance language using an aggressive termination control strategy. Vadalog can support the entire spectrum of data science activities and tools. The system can read from and connect to multiple sources, from relational databases, such as PostgreSQL and MySQL, to graph databases, such as Neo4j, as well as make use of machine learning tools, and a web data extraction tool, OXPath. Additional Python libraries and extensions can also be easily integrated into the system.

DatalogZ is an extension of Datalog with integer arithmetic and comparisons. The decision problem of whether or not a given ground atom (fact) is entailed by a DatalogZ program is RE-complete, which can be shown by a reduction to diophantine equations.

In database theory, the query evaluation problem is the problem of determining the answers to a query on a database. Research in database theory aims at determining the computational complexity of answering different kinds of queries over databases, in particular over relational databases.

References

  1. Dan Olteanu, Jakub Závodný, Size Bounds for Factorised Representations of Query Results, 2015, DOI 10.1145/2656335,
  2. Gerd G. Hillebrand, Paris C. Kanellakis, Harry G. Mairson, Moshe Y. Vardi: Undecidable Boundedness Problems for Datalog Programs. J. Log. Program. 25(2): 163-190 (1995)
  3. Vardi, Moshe Y. (1982), "The complexity of relational query languages (Extended Abstract)", Proceedings of the fourteenth annual ACM symposium on Theory of computing - STOC '82, pp. 137–146, CiteSeerX   10.1.1.331.6045 , doi:10.1145/800070.802186, ISBN   978-0897910705, S2CID   7869248, archived from the original on 2011-08-23, retrieved 2011-05-16
  4. Ashok K. Chandra and Philip M. Merlin, 1977. Optimal Implementation of Conjunctive Queries in Relational Data Bases. STOC '77: Proceedings of the ninth annual ACM symposium on Theory of computing
  5. Bagan, Guillaume; Durand, Arnaud; Grandjean, Etienne (2007). Duparc, Jacques; Henzinger, Thomas A. (eds.). "On Acyclic Conjunctive Queries and Constant Delay Enumeration". Computer Science Logic. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 4646: 208–222. doi:10.1007/978-3-540-74915-8_18. ISBN   9783540749158.
  6. 1 2 3 Serge Abiteboul, Richard B. Hull, Victor Vianu: Foundations of Databases. Addison-Wesley, 1995.
  7. Kolaitis, Phokion G.; Vardi, Moshe Y. (2000), "Conjunctive-Query Containment and Constraint Satisfaction", Journal of Computer and System Sciences, 61 (2): 302–332, doi: 10.1006/jcss.2000.1713
  8. Mihalis Yannakakis: Algorithms for Acyclic Database Schemes . Proc. VLDB 1981: 82-94.
  9. Georg Gottlob, Nicola Leone, and Francesco Scarcello (2001). "The complexity of acyclic conjunctive queries". Journal of the ACM 48 (3): 431–498. doi : 10.1145/382780.382783.
  10. Georg Gottlob, Nicola Leone, and Francesco Scarcello: Hypertree Decompositions and Tractable Queries. J. Comput. Syst. Sci. 64(3): 579-627 (2002)
  11. Georg Gottlob, Christoph Koch, Klaus U. Schulz: Conjunctive queries over trees. J. ACM 53(2): 238-272 (2006)