Third normal form

Last updated

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

Contents

A database relation (e.g. a database table) is said to meet third normal form standards if all the attributes (e.g. database columns) are functionally dependent on solely a key, except the case of functional dependency whose right hand side is a prime attribute (an attribute which is strictly included into some key) . Codd defined this as a relation in second normal form where all non-prime attributes depend only on the candidate keys and do not have a transitive dependency on another key. [1]

A hypothetical example of a failure to meet third normal form would be a hospital database having a table of patients which included a column for the telephone number of their doctor. (The phone number is dependent on the doctor, rather than the patient, thus would be better stored in a table of doctors.) The negative outcome of such a design is that a doctor's number will be duplicated in the database if they have multiple patients, thus increasing both the chance of input error and the cost and risk of updating that number should it change (compared to a third normal form-compliant data model that only stores a doctor's number once on a doctor table).

Codd later realized that 3NF did not eliminate all undesirable data anomalies and developed a stronger version to address this in 1974, known as Boyce–Codd normal form.

Definition of third normal form

The third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E. F. Codd in 1971. [2]

Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

A non-prime attribute of R is an attribute that does not belong to any candidate key of R. [3] A transitive dependency is a functional dependency in which XZ (X determines Z) indirectly, by virtue of XY and YZ (where it is not the case that YX). [4]

A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if for each of its functional dependencies XY, at least one of the following conditions holds: [5] [6] [ need quotation to verify ]

To rephrase Zaniolo's definition more simply, the relation is in 3NF if and only if for every non-trivial functional dependency X → Y, X is a superkey or Y \ X consists of prime attributes. Zaniolo's definition gives a clear sense of the difference between 3NF and the more stringent Boyce–Codd normal form (BCNF). BCNF simply eliminates the third alternative ("Every element of Y \ X, the set difference between Y and X, is a prime attribute.").

"Nothing but the key"

An approximation of Codd's definition of 3NF, paralleling the traditional oath to give true evidence in a court of law, was given by Bill Kent: "[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key". [7] A common variation supplements this definition with the oath "so help me Codd". [8]

Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF. While this phrase is a useful mnemonic, the fact that it only mentions a single key means it defines some necessary but not sufficient conditions to satisfy the 2nd and 3rd normal forms. Both 2NF and 3NF are concerned equally with all candidate keys of a table and not just any one key.

Chris Date refers to Kent's summary as "an intuitively attractive characterization" of 3NF and notes that with slight adaptation it may serve as a definition of the slightly stronger Boyce–Codd normal form: "Each attribute must represent a fact about the key, the whole key, and nothing but the key." [9] The 3NF version of the definition is weaker than Date's BCNF variation, as the former is concerned only with ensuring that non-key attributes are dependent on keys. Prime attributes (which are keys or parts of keys) must not be functionally dependent at all; they each represent a fact about the key in the sense of providing part or all of the key itself. (This rule applies only to functionally dependent attributes, as applying it to all attributes would implicitly prohibit composite candidate keys, since each part of any such key would violate the "whole key" clause.)

An example of a table that fails to meet the requirements of 3NF is:

Tournament winners
TournamentYearWinnerWinner's date of birth
Indiana Invitational1998Al Fredrickson21 July 1975
Cleveland Open1999Bob Albertson28 September 1968
Des Moines Masters1999Al Fredrickson21 July 1975
Indiana Invitational1999Chip Masterson14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute (Winner's date of birth) is transitively dependent on the candidate key {Tournament, Year} through the non-prime attribute Winner. The fact that Winner's date of birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament winners
TournamentYearWinner
Indiana Invitational1998Al Fredrickson
Cleveland Open1999Bob Albertson
Des Moines Masters1999Al Fredrickson
Indiana Invitational1999Chip Masterson
Winner's dates of birth
WinnerDate of birth
Chip Masterson14 March 1977
Al Fredrickson21 July 1975
Bob Albertson28 September 1968

Update anomalies cannot occur in these tables, because unlike before, Winner is now a candidate key in the second table, thus allowing only one value for Date of birth for each Winner.

Computation

A relation can always be decomposed in third normal form, that is, the relation R is rewritten to projections R1, ..., Rn whose join is equal to the original relation. Further, this decomposition does not lose any functional dependency, in the sense that every functional dependency on R can be derived from the functional dependencies that hold on the projections R1, ..., Rn. What is more, such a decomposition can be computed in polynomial time. [10]

To decompose a relation into 3NF from 2NF, break the table into the canonical cover functional dependencies, then create a relation for every candidate key of the original relation which was not already a subset of a relation in the decomposition. [11]

Equivalence of the Codd and Zaniolo definitions of 3NF

The definition of 3NF offered by Carlo Zaniolo in 1982, and given above, can be shown to be equivalent to the Codd definition in the following way: Let X → A be a nontrivial FD (i.e. one where X does not contain A) and let A be a non-prime attribute. Also let Y be a candidate key of R. Then Y → X. Therefore, A is not transitively dependent on Y if there is a functional dependency X → Y iff X is a superkey of R.

Normalization beyond 3NF

Most 3NF tables are free of update, insertion, and deletion anomalies. Certain types of 3NF tables, rarely met with in practice, are affected by such anomalies; these are tables which either fall short of Boyce–Codd normal form (BCNF) or, if they meet BCNF, fall short of the higher normal forms 4NF or 5NF.

Considerations for use in reporting environments

While 3NF was ideal for machine processing, the segmented nature of the data model can be difficult to intuitively consume by a human user. Analytics via query, reporting, and dashboards were often facilitated by a different type of data model that provided pre-calculated analysis such as trend lines, period-to-date calculations (month-to-date, quarter-to-date, year-to-date), cumulative calculations, basic statistics (average, standard deviation, moving averages) and previous period comparisons (year ago, month ago, week ago) e.g. dimensional modeling and beyond dimensional modeling, flattening of stars via Hadoop and data science. [12] [13]

See also

Related Research Articles

Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if no attribute domain has relations as elements. Or more informally, that no table column can have tables as values. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. SQL-92 does not support creating or using table-valued columns, which means that using only the "traditional relational database features" most relational databases will be in first normal form by necessity. Database systems which do not require first normal form are often called NoSQL systems. Newer SQL standards like SQL:1999 have started to allow so called non-atomic types, which include composite types. Even newer versions like SQL:2016 allow JSON.

Second normal form (2NF) is a normal form used in database normalization. A relation is in the second normal form if it fulfills the following two requirements:

  1. It is in first normal form.
  2. It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies XY, X is a superkey—that is, X is either a candidate key or a superset thereof.

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, a functional dependency is a constraint between two attributes in a relation. Given a relation R and sets of attributes , X is said to functionally determineY if and only if each X value in R is associated with precisely one Y value in R; R is then said to satisfy the functional dependency XY. Equivalently, the projection is a function, i.e. Y is a function of X. In simple words, if the values for the X attributes are known, then the values for the Y attributes corresponding to x can be determined by looking them up in any tuple of R containing x. Customarily X is called the determinant set and Y the dependent set. A functional dependency FD: XY is called trivial if Y is a subset of X.

A candidate key, or simply a key, of a relational database is any set of columns that have a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model. A database management system manages the data accordingly.

In the relational data model a superkey is a set of attributes that uniquely identifies each tuple of a relation. Because superkey values are unique, tuples with the same superkey value must also have the same non-key attribute values. That is, non-key attributes are functionally dependent on the superkey.

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd to address certain types of anomalies not dealt with by 3NF as originally defined.

Fifth normal form (5NF), also known as projection–join normal form (PJ/NF), is a level of database normalization designed to remove redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in that table is implied by the candidate keys. It is the final normal form as far as removing redundancy is concerned.

Sixth normal form (6NF) is a term in relational database theory, used in two different ways.

A transitive dependency is an indirect dependency relationship between software components. This kind of dependency is held by virtue of a transitive relation from a component that the software depends on directly.

In database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

Domain-key normal form is a normal form used in database normalization which requires that the database contains no constraints other than domain constraints and key constraints.

Elementary key normal form (EKNF) is a subtle enhancement on third normal form, thus EKNF tables are in 3NF by definition. This happens when there is more than one unique compound key and they overlap. Such cases can cause redundant information in the overlapping column(s).

In database normalization, unnormalized form (UNF or 0NF), also known as an unnormalized relation or non-first normal form (N1NF or NF2), is a database data model (organization of data in a database) which does not meet any of the conditions of database normalization defined by the relational model. Database systems which support unnormalized data are sometimes called non-relational or NoSQL databases. In the relational model, unnormalized relations can be considered the starting point for a process of normalization.

References

  1. Codd, E. F. "Further Normalization of the Data Base Relational Model", p. 34.
  2. Codd, E. F. "Further Normalization of the Data Base Relational Model". (Presented at Courant Computer Science Symposia Series 6, "Data Base Systems", New York City, May 24–25, 1971.) IBM Research Report RJ909 (August 31, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
  3. Codd, p. 43.
  4. Codd, p. 45–46.
  5. Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata". ACM Transactions on Database Systems 7(3), September 1982.
  6. Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts (5th edition), p. 276–277.
  7. Kent, William. "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26 (2), Feb. 1983, pp. 120–125.
  8. The author of a 1989 book on database management credits one of his students with coming up with the "so help me Codd" addendum. Diehr, George. Database Management (Scott, Foresman, 1989), p. 331.
  9. Date, C. J. An Introduction to Database Systems (7th ed.) (Addison Wesley, 2000), p. 379.
  10. Serge Abiteboul, Richard B. Hull, Victor Vianu: Foundations of Databases. Addison-Wesley, 1995. http://webdam.inria.fr/Alice/ ISBN   0201537710. Theorem 11.2.14.
  11. Hammo, Bassam. "Decomposition, 3NF, BCNF" (PDF). Archived (PDF) from the original on 2023-03-15.
  12. "Comparisons between Data Warehouse modelling techniques – Roelant Vos". Roelant Vos. 12 February 2013. Retrieved 5 March 2018.
  13. "Hadoop Data Modeling Lessons | EMC". InFocus Blog | Dell EMC Services. 23 September 2014. Retrieved 5 March 2018.

Further reading