Chase (algorithm)

Last updated

The chase is a simple fixed-point algorithm testing and enforcing implication of data dependencies in database systems. It plays important roles in database theory as well as in practice. It is used, directly or indirectly, on an everyday basis by people who design databases, and it is used in commercial systems to reason about the consistency and correctness of a data design.[ citation needed ] New applications of the chase in meta-data management and data exchange are still being discovered.

Contents

The chase has its origins in two seminal papers of 1979, one by Alfred V. Aho, Catriel Beeri, and Jeffrey D. Ullman [1] and the other by David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. [2]

In its simplest application the chase is used for testing whether the projection of a relation schema constrained by some functional dependencies onto a given decomposition can be recovered by rejoining the projections. Let t be a tuple in where R is a relation and F is a set of functional dependencies (FD). If tuples in R are represented as t1, ..., tk, the join of the projections of each ti should agree with t on where i = 1, 2, ..., k. If ti is not on , the value is unknown.

The chase can be done by drawing a tableau (which is the same formalism used in tableau query). Suppose R has attributes A, B, ... and components of t are a, b, .... For ti use the same letter as t in the components that are in Si but subscript the letter with i if the component is not in Si. Then, ti will agree with t if it is in Si and will have a unique value otherwise.

The chase process is confluent. There exist implementations of the chase algorithm, [3] some of them are also open-source. [4]

Example

Let R(A, B, C, D) be a relation schema known to obey the set of functional dependencies F = {AB, BC, CD→A}. Suppose R is decomposed into three relation schemas S1 = {A, D}, S2 = {A, C} and S3 = {B, C, D}. Determining whether this decomposition is lossless can be done by performing a chase as shown below.

The initial tableau for this decomposition is:

ABCD
ab1c1d
ab2cd2
a3bcd

The first row represents S1. The components for attributes A and D are unsubscripted and those for attributes B and C are subscripted with i = 1. The second and third rows are filled in the same manner with S2 and S3 respectively.

The goal for this test is to use the given F to prove that t = (a, b, c, d) is really in R. To do so, the tableau can be chased by applying the FDs in F to equate symbols in the tableau. A final tableau with a row that is the same as t implies that any tuple t in the join of the projections is actually a tuple of R.
To perform the chase test, first decompose all FDs in F so each FD has a single attribute on the right hand side of the "arrow". (In this example, F remains unchanged because all of its FDs already have a single attribute on the right hand side: F = {AB, BC, CDA}.)

When equating two symbols, if one of them is unsubscripted, make the other be the same so that the final tableau can have a row that is exactly the same as t = (a, b, c, d). If both have their own subscript, change either to be the other. However, to avoid confusion, all of the occurrences should be changed.
First, apply AB to the tableau. The first row is (a, b1, c1, d) where a is unsubscripted and b1 is subscripted with 1. Comparing the first row with the second one, change b2 to b1. Since the third row has a3, b in the third row stays the same. The resulting tableau is:

ABCD
ab1c1d
ab1cd2
a3bcd

Then consider BC. Both first and second rows have b1 and notice that the second row has an unsubscripted c. Therefore, the first row changes to (a, b1, c, d). Then the resulting tableau is:

ABCD
ab1cd
ab1cd2
a3bcd

Now consider CDA. The first row has an unsubscripted c and an unsubscripted d, which is the same as in third row. This means that the A value for row one and three must be the same as well. Hence, change a3 in the third row to a. The resulting tableau is:

ABCD
ab1cd
ab1cd2
abcd

At this point, notice that the third row is (a, b, c, d) which is the same as t. Therefore, this is the final tableau for the chase test with given R and F. Hence, whenever R is projected onto S1, S2 and S3 and rejoined, the result is in R. Particularly, the resulting tuple is the same as the tuple of R that is projected onto {B, C, D}.

Related Research Articles

Database normalization is the process of structuring a database, usually a relational database, in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as part of his relational model.

The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In mathematics, a tuple is a finite ordered list (sequence) of elements. An n-tuple is a sequence of n elements, where n is a non-negative integer. There is only one 0-tuple, referred to as the empty tuple. An n-tuple is defined inductively using the construction of an ordered pair.

Ellipsoid Quadric surface that looks like a deformed sphere

An ellipsoid is a surface that may be obtained from a sphere by deforming it by means of directional scalings, or more generally, of an affine transformation.

In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, a functional dependency is a constraint between two keys. Given a relation R and sets of attributes , X is said to functionally determineY if and only if each X value in R is associated with precisely one Y value in R; R is then said to satisfy the functional dependency XY. Equivalently, the projection is a Function, i.e. Y is a function of X. In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x. Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: XY is called trivial if Y is a subset of X.

A candidate key, or simply a key, of a relational database is a minimal superkey. In other words, it is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column would possibly produce duplicate rows.

In mathematics, particularly in functional analysis, a projection-valued measure (PVM) is a function defined on certain subsets of a fixed set and whose values are self-adjoint projections on a fixed Hilbert space. Projection-valued measures are formally similar to real-valued measures, except that their values are self-adjoint projections rather than real numbers. As in the case of ordinary measures, it is possible to integrate complex-valued functions with respect to a PVM; the result of such an integration is a linear operator on the given Hilbert space.

In relational algebra, a projection is a unary operation written as , where is a relation and are attribute names. Its result is defined as the set obtained when the components of the tuples in are restricted to the set – it discards the other attributes.

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd to address certain types of anomalies not dealt with by 3NF as originally defined.

In mathematics, a projection is a mapping of a set into a subset, which is equal to its square for mapping composition. The restriction to a subspace of a projection is also called a projection, even if the idempotence property is lost. An everyday example of a projection is the casting of shadows onto a plane. The projection of a point is its shadow on the paper sheet. The shadow of a point on the paper sheet is this point itself (idempotency). The shadow of a three-dimensional sphere is a closed disk. Originally, the notion of projection was introduced in Euclidean geometry to denote the projection of the Euclidean space of three dimensions onto a plane in it, like the shadow example. The two main projections of this kind are:

The term "information algebra" refers to mathematical techniques of information processing. Classical information theory goes back to Claude Shannon. It is a theory of information transmission, looking at communication and storage. However, it has not been considered so far that information comes from different sources and that it is therefore usually combined. It has furthermore been neglected in classical information theory that one wants to extract those parts out of a piece of information that are relevant to specific questions.

Seymour Ginsburg was an American pioneer of automata theory, formal language theory, and database theory, in particular; and computer science, in general. His work was influential in distinguishing theoretical Computer Science from the disciplines of Mathematics and Electrical Engineering.

According to database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

In mathematics and computer science, a history monoid is a way of representing the histories of concurrently running computer processes as a collection of strings, each string representing the individual history of a process. The history monoid provides a set of synchronization primitives for providing rendezvous points between a set of independently executing processes or threads.

In the area of computer science known as dependency theory, a join dependency is a constraint on the set of legal relations over a database scheme. A table is subject to a join dependency if can always be recreated by joining multiple tables each having a subset of the attributes of . If one of the tables in the join has all the attributes of the table , the join dependency is called trivial.

Relation (database)

In relational database theory, a relation, as originally defined by E. F. Codd, is a set of tuples (d1, d2, ..., dn), where each element dj is a member of Dj, a data domain. Codd's original definition notwithstanding, and contrary to the usual definition in mathematics, there is no ordering to the elements of the tuples of a relation. Instead, each element is termed an attribute value. An attribute is a name paired with a domain. An attribute value is an attribute name paired with an element of that attribute's domain, and a tuple is a set of attribute values in which no two distinct elements have the same name. Thus, in some accounts, a tuple is described as a function, mapping names to values.

In database design, a lossless join decomposition is a decomposition of a relation into relations such that a natural join of the two smaller relations yields back the original relation. This is central in removing redundancy safely from databases while preserving the original data.

References

  1. Alfred V. Aho, Catriel Beeri, and Jeffrey D. Ullman: "The Theory of Joins in Relational Databases", ACM Trans. Datab. Syst. 4(3):297-314, 1979.
  2. David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv: "Testing Implications of Data Dependencies". ACM Trans. Datab. Syst. 4(4):455-469, 1979.
  3. Michael Benedikt, George Konstantinidis, Giansalvatore Mecca, Boris Motik, Paolo Papotti, Donatello Santoro, Efthymia Tsamoura: Benchmarking the Chase. In Proc. of PODS, 2017.
  4. "The Llunatic Mapping and Cleaning Chase Engine". 6 April 2021.

Further reading