Candidate key

Last updated

A candidate key, or simply a key, of a relational database is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

Contents

A candidate key is a minimal superkey, [1] i.e., a superkey that does not contain a smaller one. Therefore, a relation can have multiple candidate keys, each with a different number of attributes. [2]

Specific candidate keys are sometimes called primary keys , secondary keys or alternate keys. The columns in a candidate key are called prime attributes, [3] and a column that does not occur in any candidate key is called a non-prime attribute.

Every relation without NULL values will have at least one candidate key: Since there cannot be duplicate rows, the set of all columns is a superkey, and if that is not minimal, some subset of that will be minimal.

There is a functional dependency from the candidate key to all the attributes in the relation.

The superkeys of a relation are all the possible ways we can identify a row. The candidate keys are the minimal subsets of each superkey and as such, they are an important concept for the design of database schema.

Example

The definition of candidate keys can be illustrated with the following (abstract) example. Consider a relation variable (relvar) R with attributes (A, B, C, D) that has only the following two legal values r1 and r2:

r1
ABCD
a1b1c1d1
a1b2c2d1
a2b1c2d1
r2
ABCD
a1b1c1d1
a1b2c2d1
a1b1c2d2

Here r2 differs from r1 only in the A and D values of the last tuple.

For r1 the following sets have the uniqueness property, i.e., there are no two distinct tuples in the instance with the same attribute values in the set:

{A,B}, {A,C}, {B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

For r2 the uniqueness property holds for the following sets;

{B,C}, {B,D}, {C,D}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Since superkeys of a relvar are those sets of attributes that have the uniqueness property for all legal values of that relvar and because we assume that r1 and r2 are all the legal values that R can take, we can determine the set of superkeys of R by taking the intersection of the two lists:

{B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Finally we need to select those sets for which there is no proper subset in the list, which are in this case:

{B,C}, {A,B,D}, {A,C,D}

These are indeed the candidate keys of relvar R.

We have to consider all the relations that might be assigned to a relvar to determine whether a certain set of attributes is a candidate key. For example, if we had considered only r1 then we would have concluded that {A,B} is a candidate key, which is incorrect. However, we might be able to conclude from such a relation that a certain set is not a candidate key, because that set does not have the uniqueness property (example {A,D} for r1). Note that the existence of a proper subset of a set that has the uniqueness property cannot in general be used as evidence that the superset is not a candidate key. In particular, note that in the case of an empty relation, every subset of the heading has the uniqueness property, including the empty set.

Determining candidate keys

The set of all candidate keys can be computed e.g. from the set of functional dependencies. To this end we need to define the attribute closure for an attribute set . The set contains all attributes that are functionally implied by .

It is quite simple to find a single candidate key. We start with a set of attributes and try to remove successively each attribute. If after removing an attribute the attribute closure stays the same, then this attribute is not necessary and we can remove it permanently. We call the result . If is the set of all attributes, then is a candidate key.

Actually we can detect every candidate key with this procedure by simply trying every possible order of removing attributes. However there are many more permutations of attributes () than subsets (). That is, many attribute orders will lead to the same candidate key.

There is a fundamental difficulty for efficient algorithms for candidate key computation: Certain sets of functional dependencies lead to exponentially many candidate keys. Consider the functional dependencies which yields candidate keys: . That is, the best we can expect is an algorithm that is efficient with respect to the number of candidate keys.

The following algorithm actually runs in polynomial time in the number of candidate keys and functional dependencies: [4]

function find_candidate_keys(A, F)     /* A is the set of all attributes and F is the set of functional dependencies */     K[0] := minimize(A);     n := 1; /* Number of Keys known so far */     i := 0; /* Currently processed key */     while i < n dofor each α → β ∈ F do             /* Build a new potential key from the previous known key and the current FD */             S := α ∪ (K[i] − β);             /* Search whether the new potential key is part of the already known keys */              found := false;             for j := 0 to n-1 doif K[j] ⊆ S then found := true;             /* If not, add it */             ifnot found then                 K[n] := minimize(S);                 n := n + 1;         i := i + 1     return K

The idea behind the algorithm is that given a candidate key and a functional dependency , the reverse application of the functional dependency yields the set , which is a key, too. It may however be covered by other already known candidate keys. (The algorithm checks this case using the 'found' variable.) If not, then minimizing the new key yields a new candidate key. The key insight is that all candidate keys can be created this way.

See also

Related Research Articles

Database normalization is the process of structuring a relational database accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

<span class="mw-page-title-main">Knapsack problem</span> Problem in combinatorial optimization

The knapsack problem is the following problem in combinatorial optimization:

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data are represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics. The theory was introduced by Edgar F. Codd.

Second normal form (2NF), in database normalization, is a normal form. A relation is in the second normal form if it fulfills the following two requirements:

  1. It is in first normal form.
  2. It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies XY, {X, Y} is a superkey—that is, the combination of all attributes in X and Y is either a candidate key or a superset thereof.

In relational database theory, a functional dependency is the following constraint between two attribute sets in a relation: Given a relation R and attribute sets , X is said to functionally determineY if each X value is associated with precisely one Y value. R is then said to satisfy the functional dependency XY. Equivalently, the projection is a function, that is, Y is a function of X. In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x. Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: XY is called trivial if Y is a subset of X.

<span class="mw-page-title-main">Partition of a set</span> Mathematical ways to group elements of a set

In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

In mathematical measure theory, for every positive integer n the ham sandwich theorem states that given n measurable "objects" in n-dimensional Euclidean space, it is possible to divide each one of them in half (with respect to their measure, e.g. volume) with a single (n − 1)-dimensional hyperplane. This is possible even if the objects overlap.

In the relational data model a superkey is any set of attributes that uniquely identifies each tuple of a relation. Because superkey values are unique, tuples with the same superkey value must also have the same non-key attribute values. That is, non-key attributes are functionally dependent on the superkey.

<span class="mw-page-title-main">Differentiable manifold</span> Manifold upon which it is possible to perform calculus

In mathematics, a differentiable manifold is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One may then apply ideas from calculus while working within the individual charts, since each chart lies within a vector space to which the usual rules of calculus apply. If the charts are suitably compatible, then computations done in one chart are valid in any other differentiable chart.

Armstrong's axioms are a set of axioms used to infer all the functional dependencies on a relational database. They were developed by William W. Armstrong in his 1974 paper. The axioms are sound in generating only functional dependencies in the closure of a set of functional dependencies when applied to that set. They are also complete in that repeated application of these rules will generate all functional dependencies in the closure .

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stricter version of the third normal form (3NF). By using BCNF, a database will remove all redundancies based on functional dependencies.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.

In database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

The chase is a simple fixed-point algorithm testing and enforcing implication of data dependencies in database systems. It plays important roles in database theory as well as in practice. It is used, directly or indirectly, on an everyday basis by people who design databases, and it is used in commercial systems to reason about the consistency and correctness of a data design. New applications of the chase in meta-data management and data exchange are still being discovered.

In database design, a lossless join decomposition is a decomposition of a relation into relations such that a natural join of the two smaller relations yields back the original relation. This is central in removing redundancy safely from databases while preserving the original data. Lossless join can also be called non-additive.

References

  1. Date, Christopher (2015). "Codd's First Relational Papers: A Critical Analysis" (PDF). warwick.ac.uk. Retrieved 2020-01-04. Note that the extract allows a "relation" to have any number of primary keys, and moreover that such keys are allowed to be "redundant" (better: reducible). In other words, what the paper calls a primary key is what later (and better) became known as a superkey, and what the paper calls a nonredundant (better: irreducible) primary key is what later became known as a candidate key or (better) just a key.
  2. "database - Can a relation have Candidate Keys with different lengths?". Stack Overflow. Retrieved 2023-03-23.
  3. Saiedian, H. (1996-02-01). "An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema". The Computer Journal. 39 (2): 124–132. doi:10.1093/comjnl/39.2.124. ISSN   0010-4620.
  4. L. Lucchesi, Cláudio; Osborn, Sylvia L. (October 1978). "Candidate keys for relations". Journal of Computer and System Sciences. 17 (2): 270–279. doi:10.1016/0022-0000(78)90009-0.