Probabilistic database

Last updated

Most real databases contain data whose correctness is uncertain. In order to work with such data, there is a need to quantify the integrity of the data. This is achieved by using probabilistic databases.

Contents

A probabilistic database is an uncertain database in which the possible worlds have associated probabilities. Probabilistic database management systems are currently an active area of research. "While there are currently no commercial probabilistic database systems, several research prototypes exist..." [1]

Probabilistic databases distinguish between the logical data model and the physical representation of the data much like relational databases do in the ANSI-SPARC Architecture. In probabilistic databases this is even more crucial since such databases have to represent very large numbers of possible worlds, often exponential in the size of one world (a classical database), succinctly. [2] [3]

Terminology

In a probabilistic database, each tuple is associated with a probability between 0 and 1, with 0 representing that the data is certainly incorrect, and 1 representing that it is certainly correct.

Possible worlds

A probabilistic database could exist in multiple states. For example, if there is uncertainty about the existence of a tuple in the database, then the database could be in two different states with respect to that tuple—the first state contains the tuple, while the second one does not. Similarly, if an attribute can take one of the values x, y or z, then the database can be in three different states with respect to that attribute.

Each of these states is called a possible world.

Consider the following database:

An Incomplete Database
AB
a1b1
a2b2
a3{b3, b3′, b3′′}

(Here {b3, b3′, b3′′} denotes that the attribute can take any of the values b3, b3′ or b3′′)

Then the actual state of the database may or may not contain the first tuple (depending on whether it is correct or not). Similarly, the value of the attribute B may be b3, b3′ or b3′′.

Consequently, the possible worlds corresponding to the database are as follows:

World 1
AB
a1b1
a2b2
a3b3
World 2
AB
a1b1
a2b2
a3b3′
World 3
AB
a1b1
a2b2
a3b3′′
World 4
AB
a2b2
a3b3
World 5
AB
a2b2
a3b3′
World 6
AB
a2b2
a3b3′′

Types of Uncertainties

There are essentially two kinds of uncertainties that could exist in a probabilistic database, as described in the table below:

Types of Uncertainties
Tuple-level uncertaintyAttribute-level uncertainty
Uncertainty if a tuple is correct or not, that is, whether it should exist in the database or not.Uncertainty about the values that an attribute of a tuple can take, that is, it could take one of the several possible values.
Corresponding to each uncertain tuple, there are two possible worlds: one which includes the tuple while the other which does not.Corresponding to each uncertain attribute which can take one of the values a1,...,an, there are n possible worlds.
Tuple-level uncertainty can be seen as a boolean random variable associated with each uncertain tuple.Attribute-level uncertainty can be seen as a random variable associated with each uncertain attribute which can take values a1,...,an.

By assigning values to random variables associated with the data items, different possible worlds can be represented.

History

The first published use of the term "probabilistic database" was probably in the 1987 VLDB conference paper "The theory of probabilistic databases", by Cavallo and Pittarelli. [4] The title (of the 11 page paper) was intended as a bit of a joke, since David Maier's 600 page monograph, The Theory of Relational Databases, would have been familiar at that time to many of the conference participants and readers of the conference proceedings.

Related Research Articles

Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

Object–relational mapping in computer science is a programming technique for converting data between a relational database and the heap of an object-oriented programming language. This creates, in effect, a virtual object database that can be used from within the programming language.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if no attribute domain has relations as elements. Or more informally, that no table column can have tables as values. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. SQL-92 does not support creating or using table-valued columns, which means that using only the "traditional relational database features" most relational databases will be in first normal form by necessity. Database systems which do not require first normal form are often called NoSQL systems. Newer SQL standards like SQL:1999 have started to allow so called non-atomic types, which include composite types. Even newer versions like SQL:2016 allow JSON.

In the relational model of databases, a primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a relation (table). Informally, a primary key is "which attributes identify a record," and in simple cases constitute a single attribute: a unique ID. More formally, a primary key is a choice of candidate key ; any other candidate key is an alternate key.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

In a relational database, a column is a set of data values of a particular type, one value for each row of a table. A column may contain text values, numbers, or even pointers to files in the operating system. Columns typically contain simple types, though some relational database systems allow columns to contain more complex data types, such as whole documents, images, or even video clips. A column can also be called an attribute.

Object–relational impedance mismatch is a set of difficulties going between data in relational data stores and data in domain-driven object models. Relational Database Management Systems (RDBMS) is the standard method for storing data in a dedicated database, while object-oriented (OO) programming is the default method for business-centric design in programming languages. The problem lies in neither relational databases nor OO programming, but in the conceptual difficulty mapping between the two logic models. Both logical models are differently implementable using database servers, programming languages, design patterns, or other technologies. Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.

In relational algebra, a projection is a unary operation written as , where is a relation and are attribute names. Its result is defined as the set obtained when the components of the tuples in are restricted to the set – it discards the other attributes.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

Probabilistic logic involves the use of probability and logic to deal with uncertain situations. Probabilistic logic extends traditional logic truth tables with probabilistic expressions. A difficulty of probabilistic logics is their tendency to multiply the computational complexities of their probabilistic and logical components. Other difficulties include the possibility of counter-intuitive results, such as in case of belief fusion in Dempster–Shafer theory. Source trust and epistemic uncertainty about the probabilities they provide, such as defined in subjective logic, are additional elements to consider. The need to deal with a broad variety of contexts and issues has led to many different proposals.

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Subjective logic is a type of probabilistic logic that explicitly takes epistemic uncertainty and source trust into account. In general, subjective logic is suitable for modeling and analysing situations involving uncertainty and relatively unreliable sources. For example, it can be used for modeling and analysing trust networks and Bayesian networks.

In computer science, uncertain data is data that contains noise that makes it deviate from the correct, intended or original values. In the age of big data, uncertainty or data veracity is one of the defining characteristics of data. Data is constantly growing in volume, variety, velocity and uncertainty (1/veracity). Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. For example, there may be uncertainty regarding the address of a customer in an enterprise dataset, or the temperature readings captured by a sensor due to aging of the sensor. In 2012 IBM called out managing uncertain data at scale in its global technology outlook report that presents a comprehensive analysis looking three to ten years into the future seeking to identify significant, disruptive technologies that will change the world. In order to make confident business decisions based on real-world data, analyses must necessarily account for many different kinds of uncertainty present in very large amounts of data. Analyses based on uncertain data will have an effect on the quality of subsequent decisions, so the degree and types of inaccuracies in this uncertain data cannot be ignored.

<span class="mw-page-title-main">Relation (database)</span> Set of tuples consisting of values indexed by attributes

In database theory, a relation, as originally defined by E. F. Codd, is a set of tuples (d1,d2,...,dn), where each element dj is a member of Dj, a data domain. Codd's original definition notwithstanding, and contrary to the usual definition in mathematics, there is no ordering to the elements of the tuples of a relation. Instead, each element is termed an attribute value. An attribute is a name paired with a domain. An attribute value is an attribute name paired with an element of that attribute's domain, and a tuple is a set of attribute values in which no two distinct elements have the same name. Thus, in some accounts, a tuple is described as a function, mapping names to values.

The following is provided as an overview of and topical guide to databases:

An uncertain database is a kind of database studied in database theory. The goal of uncertain databases is to manage information on which there is some uncertainty. Uncertain databases make it possible to explicitly represent and manage uncertainty on the data, usually in a succinct way.

In database theory, Imieliński–Lipski algebra is an extension of relational algebra onto tables with different types of null values. It is used to operate on relations with incomplete information.

References

  1. Vinod Muthusamy, Haifeng Liu, Hans-Arno Jacobsen: Predictive Publish/Subscribe Matching. University of Toronto.
  2. Nilesh N. Dalvi, Dan Suciu: Efficient query evaluation on probabilistic databases. VLDB J. 16(4): 523–544 (2007)
  3. Lyublena Antova, Christoph Koch, Dan Olteanu: 10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information. ICDE 2007: 606–615
  4. Roger Cavallo, Michael Pittarelli: The Theory of Probabilistic Databases. In VLDB'87, Proceedings of 13th International Conference on Very Large Data Bases, September 1–4, 1987, Brighton: 71–81 (1987)