Candidate key

Last updated

A candidate key, or simply a key, of a relational database is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

Contents

A candidate key is a minimal superkey, [1] i.e., a superkey that doesn't contain a smaller one. Therefore, a relation can have multiple candidate keys, each with a different number of attributes. [2]

Specific candidate keys are sometimes called primary keys , secondary keys or alternate keys. The columns in a candidate key are called prime attributes, [3] and a column that does not occur in any candidate key is called a non-prime attribute.

Every relation without NULL values will have at least one candidate key: Since there cannot be duplicate rows, the set of all columns is a superkey, and if that isn't minimal, some subset of that will be minimal.

There is a functional dependency from the candidate key to all the attributes in the relation.

The superkeys of a relation are all the possible ways we can identify a row. The candidate keys are the minimal subsets of each superkey and as such, they are an important concept for the design of database schema.

Example

The definition of candidate keys can be illustrated with the following (abstract) example. Consider a relation variable (relvar) R with attributes (A, B, C, D) that has only the following two legal values r1 and r2:

r1
ABCD
a1b1c1d1
a1b2c2d1
a2b1c2d1
r2
ABCD
a1b1c1d1
a1b2c2d1
a1b1c2d2

Here r2 differs from r1 only in the A and D values of the last tuple.

For r1 the following sets have the uniqueness property, i.e., there are no two distinct tuples in the instance with the same attribute values in the set:

{A,B}, {A,C}, {B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

For r2 the uniqueness property holds for the following sets;

{B,C}, {B,D}, {C,D}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Since superkeys of a relvar are those sets of attributes that have the uniqueness property for all legal values of that relvar and because we assume that r1 and r2 are all the legal values that R can take, we can determine the set of superkeys of R by taking the intersection of the two lists:

{B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Finally we need to select those sets for which there is no proper subset in the list, which are in this case:

{B,C}, {A,B,D}, {A,C,D}

These are indeed the candidate keys of relvar R.

We have to consider all the relations that might be assigned to a relvar to determine whether a certain set of attributes is a candidate key. For example, if we had considered only r1 then we would have concluded that {A,B} is a candidate key, which is incorrect. However, we might be able to conclude from such a relation that a certain set is not a candidate key, because that set does not have the uniqueness property (example {A,D} for r1). Note that the existence of a proper subset of a set that has the uniqueness property cannot in general be used as evidence that the superset is not a candidate key. In particular, note that in the case of an empty relation, every subset of the heading has the uniqueness property, including the empty set.

Determining candidate keys

The set of all candidate keys can be computed e.g. from the set of functional dependencies. To this end we need to define the attribute closure for an attribute set . The set contains all attributes that are functionally implied by .

It is quite simple to find a single candidate key. We start with a set of attributes and try to remove successively each attribute. If after removing an attribute the attribute closure stays the same, then this attribute is not necessary and we can remove it permanently. We call the result . If is the set of all attributes, then is a candidate key.

Actually we can detect every candidate key with this procedure by simply trying every possible order of removing attributes. However there are many more permutations of attributes () than subsets (). That is, many attribute orders will lead to the same candidate key.

There is a fundamental difficulty for efficient algorithms for candidate key computation: Certain sets of functional dependencies lead to exponentially many candidate keys. Consider the functional dependencies which yields candidate keys: . That is, the best we can expect is an algorithm that is efficient with respect to the number of candidate keys.

The following algorithm actually runs in polynomial time in the number of candidate keys and functional dependencies: [4]

function find_candidate_keys(A, F)     /* A is the set of all attributes and F is the set of functional dependencies */     K[0] := minimize(A);     n := 1; /* Number of Keys known so far */     i := 0; /* Currently processed key */     while i < n dofor each α → β ∈ F do             /* Build a new potential key from the previous known key and the current FD */             S := α ∪ (K[i] − β);             /* Search whether the new potential key is part of the already known keys */              found := false;             for j := 0 to n-1 doif K[j] ⊆ S then found := true;             /* If not, add it */             ifnot found then                 K[n] := minimize(S);                 n := n + 1;         i := i + 1     return K

The idea behind the algorithm is that given a candidate key and a functional dependency , the reverse application of the functional dependency yields the set , which is a key, too. It may however be covered by other already known candidate keys. (The algorithm checks this case using the 'found' variable.) If not, then minimizing the new key yields a new candidate key. The key insight is that all candidate keys can be created this way.

See also

Related Research Articles

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

Second normal form (2NF) is a normal form used in database normalization. 2NF was originally defined by E. F. Codd in 1971.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies XY, X is a superkey—that is, X is either a candidate key or a superset thereof.

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, a functional dependency is a constraint between two attributes in a relation. Given a relation R and sets of attributes , X is said to functionally determineY if and only if each X value in R is associated with precisely one Y value in R; R is then said to satisfy the functional dependency XY. Equivalently, the projection is a function, i.e. Y is a function of X. In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x. Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: XY is called trivial if Y is a subset of X.

In mathematics, a multiset is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the multiplicity of that element in the multiset. As a consequence, an infinite number of multisets exist which contain only elements a and b, but vary in the multiplicities of their elements:

<span class="mw-page-title-main">Partition of a set</span> Mathematical ways to group elements of a set

In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset.

In mathematics, in set theory, the constructible universe, denoted by , is a particular class of sets that can be described entirely in terms of simpler sets. is the union of the constructible hierarchy. It was introduced by Kurt Gödel in his 1938 paper "The Consistency of the Axiom of Choice and of the Generalized Continuum-Hypothesis". In this paper, he proved that the constructible universe is an inner model of ZF set theory, and also that the axiom of choice and the generalized continuum hypothesis are true in the constructible universe. This shows that both propositions are consistent with the basic axioms of set theory, if ZF itself is consistent. Since many other theorems only hold in systems in which one or both of the propositions is true, their consistency is an important result.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are result of attribute evaluation rules associated with productions of the grammar. Attributes allow to transfer information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

In the relational data model a superkey is a set of attributes that uniquely identifies each tuple of a relation. Because superkey values are unique, tuples with the same superkey value must also have the same non-key attribute values. That is, non-key attributes are functionally dependent on the superkey.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

Armstrong's axioms are a set of references used to infer all the functional dependencies on a relational database. They were developed by William W. Armstrong in his 1974 paper. The axioms are sound in generating only functional dependencies in the closure of a set of functional dependencies when applied to that set. They are also complete in that repeated application of these rules will generate all functional dependencies in the closure .

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd to address certain types of anomalies not dealt with by 3NF as originally defined.

Sixth normal form (6NF) is a term in relational database theory, used in two different ways.

In database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

The chase is a simple fixed-point algorithm testing and enforcing implication of data dependencies in database systems. It plays important roles in database theory as well as in practice. It is used, directly or indirectly, on an everyday basis by people who design databases, and it is used in commercial systems to reason about the consistency and correctness of a data design. New applications of the chase in meta-data management and data exchange are still being discovered.

In database design, a lossless join decomposition is a decomposition of a relation into relations such that a natural join of the two smaller relations yields back the original relation. This is central in removing redundancy safely from databases while preserving the original data.

References

  1. Date, Christopher (2015). "Codd's First Relational Papers: A Critical Analysis" (PDF). warwick.ac.uk. Retrieved 2020-01-04. Note that the extract allows a "relation" to have any number of primary keys, and moreover that such keys are allowed to be "redundant" (better: reducible). In other words, what the paper calls a primary key is what later (and better) became known as a superkey, and what the paper calls a nonredundant (better: irreducible) primary key is what later became known as a candidate key or (better) just a key.
  2. "database - Can a relation have Candidate Keys with different lengths?". Stack Overflow. Retrieved 2023-03-23.
  3. Saiedian, H. (1996-02-01). "An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema". The Computer Journal. 39 (2): 124–132. doi:10.1093/comjnl/39.2.124. ISSN   0010-4620.
  4. L. Lucchesi, Cláudio; Osborn, Sylvia L. (October 1978). "Candidate keys for relations". Journal of Computer and System Sciences. 17 (2): 270–279. doi:10.1016/0022-0000(78)90009-0.