Database repair

Last updated January 10, 2023

The problem of database repair is a question about relational databases which has been studied in database theory, and which is a particular kind of data cleansing. The problem asks about how we can "repair" an input relational database in order to make it satisfy integrity constraints. The goal of the problem is to be able to work with data that is "dirty", i.e., does not satisfy the right integrity constraints, by reasoning about all possible repairs of the data, i.e., all possible ways to change the data to make it satisfy the integrity constraints, without committing to a specific choice.

Several variations of the problem exist, depending on:

what we intend to figure out about the dirty data: figuring out if some database tuple is certain (i.e., is in every repaired database), figuring out if some query answer is certain (i.e., the answer is returned when evaluating the query on every repaired database)
which kinds of ways are allowed to repair the database: can we insert new facts, remove facts (so-called subset repairs), and so on
which repaired databases do we study: those where we only change a minimal subset of the database tuples (e.g., minimal subset repairs), those where we only change a minimal number of database tuples (e.g., minimal cardinality repairs)

The problem of database repair has been studied to understand what is the complexity of these different problem variants, i.e., can we efficiently determine information about the state of the repairs, without explicitly materializing all of these repairs.

Related Research Articles

Database normalization or database normalisation is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using the SQL for querying and maintaining the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

Object–relational mapping in computer science is a programming technique for converting data between a relational database and the heap of an object-oriented programming language. This creates, in effect, a virtual object database that can be used from within the programming language. There are both free and commercial packages available that perform object–relational mapping, as well as many custom solutions used within a single codebase.

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be composed of a number of other elements which, in turn, represent the color and size of the car and define its owner.

In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd.

Tuple calculus is a calculus that was created and introduced by Edgar F. Codd as part of the relational model, in order to provide a declarative database-query language for data manipulation in this data model. It formed the inspiration for the database-query languages QUEL and SQL, of which the latter, although far less faithful to the original relational model and calculus, is now the de facto standard database-query language; a dialect of SQL is used by nearly every relational-database-management system. Michel Lacroix and Alain Pirotte proposed domain calculus, which is closer to first-order logic and together with Codd showed that both of these calculi are equivalent in expressive power. Subsequently, query languages for the relational model were called relationally complete if they could express at least all of these queries.

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, a functional dependency is a constraint between two attributes in a relation. Given a relation R and sets of attributes $, X is said to functionally determine Y if and only if each X value in R is associated with precisely one Y value in R; R is then said to satisfy the functional dependency X \to Y . Equivalently, the projection is a function, i.e. Y is a function of X . In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x . Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: X \to Y is called trivial if Y is a subset of X .$

A candidate key, or simply a key, of a relational database is a minimal superkey. In other words, it is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column would possibly produce duplicate rows.

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

A sublanguage is a subset of a language. Sublanguages occur in natural language, computer programming language, and relational databases.

The object–relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being served by an application program written in an object-oriented programming language or style, particularly because objects or class definitions must be mapped to database tables defined by a relational schema.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

The complexity of constraint satisfaction is the application of computational complexity theory on constraint satisfaction. It has mainly been studied for discriminating between tractable and intractable classes of constraint satisfaction problems on finite domains.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

In database theory, a conjunctive query is a restricted form of first-order queries using the logical conjunction operator. Many first-order queries can be written as conjunctive queries. In particular, a large part of queries issued on relational databases can be expressed in this way. Conjunctive queries also have a number of desirable theoretical properties that larger classes of queries do not share.

The following is provided as an overview of and topical guide to databases:

In relational database theory, a tuple-generating dependency (TGD) is a certain kind of constraint on a relational database. It is a subclass of the class of embedded dependencies (EDs).

In database theory, Imieliński–Lipski algebra is an extension of relational algebra onto tables with different types of null values. It is used to operate on relations with incomplete information.

References

Arenas, Marcelo; Bertossi, Leopoldo; Chomicki, Jan (1999). Consistent Query Answers in Inconsistent Databases (PDF). PODS.

Database repair

Related Research Articles

References

See also