Uncertain database

Last updated March 25, 2024

An uncertain database^[1] is a kind of database studied in database theory. The goal of uncertain databases is to manage information on which there is some uncertainty. Uncertain databases make it possible to explicitly represent and manage uncertainty on the data, usually in a succinct way.

Formal definition

At the basis of uncertain databases is the notion of possible world. Specifically, a possible world of an uncertain database is a (certain) database which is one of the possible realizations of the uncertain database. A given uncertain database typically has more than one, and potentially infinitely many, possible worlds.

A formalism to represent uncertain databases then explains how to succinctly represent a set of possible worlds into one uncertain database.

Types of uncertain databases

Uncertain database models differ in how they represent and quantify these possible worlds:

Incomplete databases^[2]^[3] are a compact representation of the set of possible worlds – the use of NULL in SQL, arguably the most commonplace instantiation of uncertain databases, is an example of incomplete database model.
Probabilistic databases ^[4] are a compact representation of a probability distribution over the set of possible worlds.
Fuzzy databases ^[5] are a compact representation of a fuzzy set of the possible worlds.

Though mostly studied in the relational setting, uncertain database models can also be defined in other relational models such as graph databases ^[6] or XML databases.

Incomplete database

The most common database model is the relational model. Multiple incomplete database models have been defined over the relational model, that form extensions to the relational algebra. These have been called^[7] Imieliński–Lipski algebras:

Relations with NULL values, also called Codd tables
c-tables^[2]
v-tables^[2]

Example

The following table is a relation of an incomplete database, described in the formalism of NULL values:

id	Name	Salary
1	Alice	10,000
2	Bob	`NULL`
3	Charlie	`NULL`

There are infinitely many possible worlds for this incomplete database, obtained by replacing the "NULL" values with concrete values. For instance, the following relation is a possible world:

id	Name	Salary
1	Alice	10,000
2	Bob	8,000
3	Charlie	12,000

Related Research Articles

Fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely false. By contrast, in Boolean logic, the truth values of variables may only be the integer values 0 or 1.

Inductive logic programming (ILP) is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. The term "inductive" here refers to philosophical rather than mathematical induction. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesised logic program which entails all the positive and none of the negative examples.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

In logic, a three-valued logic is any of several many-valued logic systems in which there are three truth values indicating true, false, and some third value. This is contrasted with the more commonly known bivalent logics which provide only for true and false.

The database schema is the structure of a database described in a formal language supported typically by a relational database management system (RDBMS). The term "schema" refers to the organization of data as a blueprint of how the database is constructed. The formal definition of a database schema is a set of formulas (sentences) called integrity constraints imposed on a database. These integrity constraints ensure compatibility between parts of the schema. All constraints are expressible in the same language. A database can be considered a structure in realization of the database language. The states of a created conceptual schema are transformed into an explicit mapping, the database schema. This describes how real-world entities are modeled in the database.

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding an uninformative prior over admissible solutions and ending with the model that generates only the global optima.

A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, defining probability distributions on possible worlds on any given domain.

Probabilistic logic involves the use of probability and logic to deal with uncertain situations. Probabilistic logic extends traditional logic truth tables with probabilistic expressions. A difficulty of probabilistic logics is their tendency to multiply the computational complexities of their probabilistic and logical components. Other difficulties include the possibility of counter-intuitive results, such as in case of belief fusion in Dempster–Shafer theory. Source trust and epistemic uncertainty about the probabilities they provide, such as defined in subjective logic, are additional elements to consider. The need to deal with a broad variety of contexts and issues has led to many different proposals.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

Subjective logic is a type of probabilistic logic that explicitly takes epistemic uncertainty and source trust into account. In general, subjective logic is suitable for modeling and analysing situations involving uncertainty and relatively unreliable sources. For example, it can be used for modeling and analysing trust networks and Bayesian networks.

Most real databases contain data whose correctness is uncertain. In order to work with such data, there is a need to quantify the integrity of the data. This is achieved by using probabilistic databases.

Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with domain models that exhibit both uncertainty and complex, relational structure. Typically, the knowledge representation formalisms developed in SRL use first-order logic to describe relational properties of a domain in a general manner and draw upon probabilistic graphical models to model the uncertainty; some also build upon the methods of inductive logic programming. Significant contributions to the field have been made since the late 1990s.

Tomasz Imieliński is a Polish-American computer scientist, most known in the areas of data mining, mobile computing, data extraction, and search engine technology. He is currently a professor of computer science at Rutgers University in New Jersey, United States.

In database theory, Imieliński–Lipski algebra is an extension of relational algebra onto tables with different types of null values. It is used to operate on relations with incomplete information.

Witold Lipski Jr. was a Polish computer scientist, and an author of two books: Combinatorics for Programmers and (jointly with Wiktor Marek Combinatorial analysis. Lipski, jointly with his PhD student, Tomasz Imieliński, created foundations of the theory of incomplete information in relational databases.

The Vadalog system is a Knowledge Graph Management System (KGMS) that offers a language for performing complex logic reasoning tasks over knowledge graphs. At the same time, Vadalog delivers a platform to support the entire spectrum of data science tasks: data integration, pre-processing, statistical analysis, machine learning, algorithmic modeling, probabilistic reasoning and temporal reasoning. Its language is based on an extension of the rule-based language Datalog, Warded Datalog^±, a high-performance language using an aggressive termination control strategy. Vadalog can support the entire spectrum of data science activities and tools. The system can read from and connect to multiple sources, from relational databases, such as PostgreSQL and MySQL, to graph databases, such as Neo4j, as well as make use of machine learning tools, and a web data extraction tool, OXPath. Additional Python libraries and extensions can also be easily integrated into the system.

Probabilistic logic programming is a programming paradigm that combines logic programming with probabilities.

References

↑ Aggarwal, Charu C., ed. (2009). Managing and Mining Uncertain Data. Advances in Database Systems. Vol. 35. Bibcode:2009mmud.book.....A. doi:10.1007/978-0-387-09690-2. ISBN 978-0-387-09689-6. ISSN 1386-2944.
1 2 3 Imieliński, Tomasz; Lipski, Witold (1984-09-20). "Incomplete Information in Relational Databases". Journal of the ACM. 31 (4): 761–791. doi:10.1145/1634.1886. ISSN 0004-5411.
↑ Abiteboul, Serge; Hull, Richard; Vianu, Victor (1995). "Incomplete information" (PDF). Foundations of Databases. Addison-Wesley. ISBN 0-201-53771-0.
↑ Suciu, Dan; Olteanu, Dan; Ré, Christopher; Koch, Christoph (2011). "Probabilistic Databases". Synthesis Lectures on Data Management. doi:10.1007/978-3-031-01879-4. ISBN 978-3-031-00751-4. ISSN 2153-5418. S2CID 264145434.
↑ Petry, Frederick E. (1996). "Fuzzy Databases". International Series in Intelligent Technologies. 5. doi:10.1007/978-1-4613-1319-9. ISBN 978-1-4612-8566-3. ISSN 1382-3434.
↑ Khan, Arijit; Ye, Yuan; Chen, Lei (2018). "On Uncertain Graphs". Synthesis Lectures on Data Management. doi:10.1007/978-3-031-01860-2. ISBN 978-3-031-00732-3. ISSN 2153-5418.
↑ Green, Todd J.; Karvounarakis, Grigoris; Tannen, Val (2007-06-11). "Provenance semirings". Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '07. New York, NY, USA: Association for Computing Machinery. pp. 31–40. doi:10.1145/1265530.1265535. ISBN 978-1-59593-685-1.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Aggarwal, Charu C., ed. (2009). Managing and Mining Uncertain Data. Advances in Database Systems. Vol. 35. Bibcode:2009mmud.book.....A. doi:10.1007/978-0-387-09690-2. ISBN 978-0-387-09689-6. ISSN 1386-2944.

[il1984-2] 1 2 3 Imieliński, Tomasz; Lipski, Witold (1984-09-20). "Incomplete Information in Relational Databases". Journal of the ACM. 31 (4): 761–791. doi:10.1145/1634.1886. ISSN 0004-5411.

[3] Abiteboul, Serge; Hull, Richard; Vianu, Victor (1995). "Incomplete information" (PDF). Foundations of Databases. Addison-Wesley. ISBN 0-201-53771-0.

[4] Suciu, Dan; Olteanu, Dan; Ré, Christopher; Koch, Christoph (2011). "Probabilistic Databases". Synthesis Lectures on Data Management. doi:10.1007/978-3-031-01879-4. ISBN 978-3-031-00751-4. ISSN 2153-5418. S2CID 264145434.

[5] Petry, Frederick E. (1996). "Fuzzy Databases". International Series in Intelligent Technologies. 5. doi:10.1007/978-1-4613-1319-9. ISBN 978-1-4612-8566-3. ISSN 1382-3434.

[6] Khan, Arijit; Ye, Yuan; Chen, Lei (2018). "On Uncertain Graphs". Synthesis Lectures on Data Management. doi:10.1007/978-3-031-01860-2. ISBN 978-3-031-00732-3. ISSN 2153-5418.

[7] Green, Todd J.; Karvounarakis, Grigoris; Tannen, Val (2007-06-11). "Provenance semirings". Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '07. New York, NY, USA: Association for Computing Machinery. pp. 31–40. doi:10.1145/1265530.1265535. ISBN 978-1-59593-685-1.

[1]

[2]

[3]

[4]

[5]

[6]

[7]