Candidate key

Last updated January 26, 2025

A candidate key, or simply a key, of a relational database is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

Example

The definition of candidate keys can be illustrated with the following (abstract) example. Consider a relation variable (relvar) R with attributes (A, B, C, D) that has only the following two legal values r1 and r2:

r1
A	B	C	D
a1	b1	c1	d1
a1	b2	c2	d1
a2	b1	c2	d1

r2
A	B	C	D
a1	b1	c1	d1
a1	b2	c2	d1
a1	b1	c2	d2

Here r2 differs from r1 only in the A and D values of the last tuple.

For r1 the following sets have the uniqueness property, i.e., there are no two distinct tuples in the instance with the same attribute values in the set:

{A,B}, {A,C}, {B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

For r2 the uniqueness property holds for the following sets;

{B,C}, {B,D}, {C,D}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Since superkeys of a relvar are those sets of attributes that have the uniqueness property for all legal values of that relvar and because we assume that r1 and r2 are all the legal values that R can take, we can determine the set of superkeys of R by taking the intersection of the two lists:

{B,C}, {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}, {A,B,C,D}

Finally we need to select those sets for which there is no proper subset in the list, which are in this case:

{B,C}, {A,B,D}, {A,C,D}

These are indeed the candidate keys of relvar R.

We have to consider all the relations that might be assigned to a relvar to determine whether a certain set of attributes is a candidate key. For example, if we had considered only r1 then we would have concluded that {A,B} is a candidate key, which is incorrect. However, we might be able to conclude from such a relation that a certain set is not a candidate key, because that set does not have the uniqueness property (example {A,D} for r1). Note that the existence of a proper subset of a set that has the uniqueness property cannot in general be used as evidence that the superset is not a candidate key. In particular, note that in the case of an empty relation, every subset of the heading has the uniqueness property, including the empty set.

Determining candidate keys

The set of all candidate keys can be computed e.g. from the set of functional dependencies. To this end we need to define the attribute closure $\alpha ^{+}$ for an attribute set $\alpha$ . The set $\alpha ^{+}$ contains all attributes that are functionally implied by $\alpha$ .

It is quite simple to find a single candidate key. We start with a set $\alpha$ of attributes and try to remove successively each attribute. If after removing an attribute the attribute closure stays the same, then this attribute is not necessary and we can remove it permanently. We call the result ${\text{minimize}}(\alpha )$ . If $\alpha$ is the set of all attributes, then ${\text{minimize}}(\alpha )$ is a candidate key.

Actually we can detect every candidate key with this procedure by simply trying every possible order of removing attributes. However there are many more permutations of attributes ( $n!$ ) than subsets ( $2^{n}$ ). That is, many attribute orders will lead to the same candidate key.

There is a fundamental difficulty for efficient algorithms for candidate key computation: Certain sets of functional dependencies lead to exponentially many candidate keys. Consider the $2\cdot n$ functional dependencies $\{A_{i}\rightarrow B_{i}:i\in \{1,\dots ,n\}\}\cup \{B_{i}\rightarrow A_{i}:i\in \{1,\dots ,n\}\}$ which yields $2^{n}$ candidate keys: $\{A_{1},B_{1}\}\times \dots \times \{A_{n},B_{n}\}$ . That is, the best we can expect is an algorithm that is efficient with respect to the number of candidate keys.

The following algorithm actually runs in polynomial time in the number of candidate keys and functional dependencies:^[4]

function find_candidate_keys(A, F)     /* A is the set of all attributes and F is the set of functional dependencies */     K[0] := minimize(A);     n := 1; /* Number of Keys known so far */     i := 0; /* Currently processed key */     while i < n dofor each α → β ∈ F do             /* Build a new potential key from the previous known key and the current FD */             S := α ∪ (K[i] − β);             /* Search whether the new potential key is part of the already known keys */              found := false;             for j := 0 to n-1 doif K[j] ⊆ S then found := true;             /* If not, add it */             ifnot found then                 K[n] := minimize(S);                 n := n + 1;         i := i + 1     return K

The idea behind the algorithm is that given a candidate key $K_{i}$ and a functional dependency $\alpha \rightarrow \beta$ , the reverse application of the functional dependency yields the set $\alpha \cup (K_{i}\setminus \beta )$ , which is a key, too. It may however be covered by other already known candidate keys. (The algorithm checks this case using the 'found' variable.) If not, then minimizing the new key yields a new candidate key. The key insight is that all candidate keys can be created this way.

Related Research Articles

Database normalization is the process of structuring a relational database accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

The knapsack problem is the following problem in combinatorial optimization:

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data are represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics. The theory was introduced by Edgar F. Codd.

Second normal form (2NF), in database normalization, is a normal form. A relation is in the second normal form if it fulfills the following two requirements:

It is in first normal form.
It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X $Y, {X, Y} is a superkey—that is, the combination of all attributes in X and Y is either a candidate key or a superset thereof.$

In relational database theory, a functional dependency is the following constraint between two attribute sets in a relation: Given a relation R and attribute sets $, X is said to functionally determine Y if each X value is associated with precisely one Y value. R is then said to satisfy the functional dependency X \to Y . Equivalently, the projection is a function, that is, Y is a function of X . In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x . Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: X \to Y is called trivial if Y is a subset of X .$

<span class="mw-page-title-main">Partition of a set</span> Mathematical ways to group elements of a set

In mathematics, a partition of a set is a grouping of its elements into non-empty subsets, in such a way that every element is included in exactly one subset.

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

In mathematical measure theory, for every positive integer $n$ the ham sandwich theorem states that given $n$ measurable "objects" in $n$ -dimensional Euclidean space, it is possible to divide each one of them in half (with respect to their measure, e.g. volume) with a single $(n - 1)$ -dimensional hyperplane. This is possible even if the objects overlap.

In the relational data model a superkey is any set of attributes that uniquely identifies each tuple of a relation. Because superkey values are unique, tuples with the same superkey value must also have the same non-key attribute values. That is, non-key attributes are functionally dependent on the superkey.

In mathematics, a differentiable manifold is a type of manifold that is locally similar enough to a vector space to allow one to apply calculus. Any manifold can be described by a collection of charts (atlas). One may then apply ideas from calculus while working within the individual charts, since each chart lies within a vector space to which the usual rules of calculus apply. If the charts are suitably compatible, then computations done in one chart are valid in any other differentiable chart.

Armstrong's axioms are a set of axioms used to infer all the functional dependencies on a relational database. They were developed by William W. Armstrong in his 1974 paper. The axioms are sound in generating only functional dependencies in the closure of a set of functional dependencies when applied to that set. They are also complete in that repeated application of these rules will generate all functional dependencies in the closure $.$

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stricter version of the third normal form (3NF). By using BCNF, a database will remove all redundancies based on functional dependencies.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize $over unconstrained values of the real-vector where is a differentiable scalar function.$

In database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

The chase is a simple fixed-point algorithm testing and enforcing implication of data dependencies in database systems. It plays important roles in database theory as well as in practice. It is used, directly or indirectly, on an everyday basis by people who design databases, and it is used in commercial systems to reason about the consistency and correctness of a data design. New applications of the chase in meta-data management and data exchange are still being discovered.

In database design, a lossless join decomposition is a decomposition of a relation $into relations such that a natural join of the two smaller relations yields back the original relation. This is central in removing redundancy safely from databases while preserving the original data. Lossless join can also be called non-additive.$

References

↑ Date, Christopher (2015). "Codd's First Relational Papers: A Critical Analysis" (PDF). warwick.ac.uk. Retrieved 2020-01-04. Note that the extract allows a "relation" to have any number of primary keys, and moreover that such keys are allowed to be "redundant" (better: reducible). In other words, what the paper calls a primary key is what later (and better) became known as a superkey, and what the paper calls a nonredundant (better: irreducible) primary key is what later became known as a candidate key or (better) just a key.
↑ "database - Can a relation have Candidate Keys with different lengths?". Stack Overflow. Retrieved 2023-03-23.
↑ Saiedian, H. (1996-02-01). "An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema". The Computer Journal. 39 (2): 124–132. doi:10.1093/comjnl/39.2.124. ISSN 0010-4620.
↑ L. Lucchesi, Cláudio; Osborn, Sylvia L. (October 1978). "Candidate keys for relations". Journal of Computer and System Sciences. 17 (2): 270–279. doi:10.1016/0022-0000(78)90009-0.

Date, Christopher (2003). "5: Integrity". An Introduction to Database Systems. Addison-Wesley. pp. 268–276. ISBN 978-0-321-18956-1.

External links

Relational Database Management Systems - Database Design - Terms of Reference - Keys: An overview of the different types of keys in the RDBMS (Relational Database Management System).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Date, Christopher (2015). "Codd's First Relational Papers: A Critical Analysis" (PDF). warwick.ac.uk. Retrieved 2020-01-04. Note that the extract allows a "relation" to have any number of primary keys, and moreover that such keys are allowed to be "redundant" (better: reducible). In other words, what the paper calls a primary key is what later (and better) became known as a superkey, and what the paper calls a nonredundant (better: irreducible) primary key is what later became known as a candidate key or (better) just a key.

[2] "database - Can a relation have Candidate Keys with different lengths?". Stack Overflow. Retrieved 2023-03-23.

[3] Saiedian, H. (1996-02-01). "An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema". The Computer Journal. 39 (2): 124–132. doi:10.1093/comjnl/39.2.124. ISSN 0010-4620.

[4] L. Lucchesi, Cláudio; Osborn, Sylvia L. (October 1978). "Candidate keys for relations". Journal of Computer and System Sciences. 17 (2): 270–279. doi:10.1016/0022-0000(78)90009-0.

[1]

[2]

[3]

[4]

v t e Database management systems
Types	Object-oriented comparison Relational list comparison Key–value Column-oriented list Document-oriented Wide-column store Graph NoSQL NewSQL In-memory list Multi-model comparison Cloud Blockchain-based database
Concepts	Database ACID Armstrong's axioms Codd's 12 rules CAP theorem CRUD Null Candidate key Foreign key PACELC theorem Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Transaction log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query rewriting system Query plan
Functions	Administration Query optimization Replication Sharding
Related topics	Database models Database normalization Database storage Distributed database Federated database system Referential integrity Relational algebra Relational calculus Relational model Object–relational database Transaction processing
Category Outline