First normal form

Last updated

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if no attribute domain has relations as elements. [1] Or more informally, that no table column can have tables as values. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. SQL-92 does not support creating or using table-valued columns, which means that using only the "traditional relational database features" (excluding extensions even if they were later standardized) most relational databases will be in first normal form by necessity. Database systems which do not require first normal form are often called NoSQL systems. Newer SQL standards like SQL:1999 have started to allow so called non-atomic types, which include composite types. Even newer versions like SQL:2016 allow JSON.

Contents

Overview

In a hierarchical database, a record can contain sets of child records ― known as repeating groups or table-valued attributes. If such a data model is represented as relations, a repeating group would be an attribute where the value is itself a relation. First normal form eliminates nested relations by turning them into separate "top-level" relations associated with the parent row through foreign keys rather than through direct containment.

The purpose of this normalization is to increase flexibility and data independence, and to simplify the data language. It also opens the door to further normalization, which eliminates redundancy and anomalies.

Most relational database management systems do not support nested records, so tables are in first normal form by default. In particular, SQL does not have any facilities for creating or exploiting nested tables. Normalization to first normal form would therefore be a necessary step when moving data from a hierarchical database to a relational database.

Rationale

The rationale for normalizing to 1NF: [2]

Drawbacks and criticism

History

First normal form was introduced in 1970 by Edgar F. Codd in the paper A Relational Model of Data for Large Shared Data Banks, although it was initially just called "Normal Form". It was renamed to "First Normal Form" when additional normal forms were introduced in the paper Further Normalization of the Relational Model in 1971. [3]

Examples

The following scenarios first illustrate how a database design might violate first normal form, followed by examples that comply.

Designs that violate 1NF

This table over customers' credit card transactions does not conform to first normal form:

CustomerCustomer IDTransactions
Abraham1
Transaction IDDateAmount
1289014-Oct-200387
1290415-Oct-200350
Isaac2
Transaction IDDateAmount
1289814-Oct-200321
Jacob3
Transaction IDDateAmount
1290715-Oct-200318
1492020-Nov-200370
1500327-Nov-200360

To each customer corresponds a 'repeating group' of transactions. Such a design can be represented in a hierarchical database but not a SQL database, since SQL does not support nested tables.

The automated evaluation of any query relating to customers' transactions would broadly involve two stages:

  1. Unpacking one or more customers' groups of transactions allowing the individual transactions in a group to be examined, and
  2. Deriving a query result based on the results of the first stage

For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all customers, the system would have to know that it must first unpack the Transactions group of each customer, then sum the Amounts of all transactions thus obtained where the Date of the transaction falls in October 2003.

One of Codd's important insights was that structural complexity can be reduced. Reduced structural complexity gives users, applications, and DBMSs more power and flexibility to formulate and evaluate the queries. A more normalized equivalent of the structure above might look like this:

Designs that comply with 1NF

To bring the model into the first normal form, we can perform normalization. Normalization (to first normal form) is a process where attributes with non-simple domains are extracted to separate stand-alone relations. The extracted relations are amended with foreign keys referring to the primary key of the relation which contained it. The process can be applied recursively to non-simple domains nested in multiple levels. [4]

In this example, Customer ID is the primary key of the containing relations and will therefore be appended as foreign key to the new relation:

CustomerCustomer ID
Abraham1
Isaac2
Jacob3
Customer IDTransaction IDDateAmount
11289014-Oct-200387
11290415-Oct-200350
21289814-Oct-200321
31290715-Oct-200318
31492020-Nov-200370
31500327-Nov-200360

In the modified structure, the primary key is {Customer ID} in the first relation, {Customer ID, Transaction ID} in the second relation.

Now each row represents an individual credit card transaction, and the DBMS can obtain the answer of interest, simply by finding all rows with a Date falling in October, and summing their Amounts. The data structure places all of the values on an equal footing, exposing each to the DBMS directly, so each can potentially participate directly in queries; whereas in the previous situation some values were embedded in lower-level structures that had to be handled specially. Accordingly, the normalized design lends itself to general-purpose query processing, whereas the unnormalized design does not.

It is worth noting that this design meets the additional requirements for second and third normal form.

Atomicity

Edgar F. Codd's definition of 1NF makes reference to the concept of 'atomicity'. Codd states that the "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS." [5] Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)" [6] meaning a column should not be divided into parts with more than one kind of data in it such that what one part means to the DBMS depends on another part of the same column.

Hugh Darwen and Chris Date have suggested that Codd's concept of an "atomic value" is ambiguous, and that this ambiguity has led to widespread confusion about how 1NF should be understood. [7] [8] In particular, the notion of a "value that cannot be decomposed" is problematic, as it would seem to imply that few, if any, data types are atomic:

Date suggests that "the notion of atomicity has no absolute meaning": [9] [10] a value may be considered atomic for some purposes, but may be considered an assemblage of more basic elements for other purposes. If this position is accepted, 1NF cannot be defined with reference to atomicity. Columns of any conceivable data type (from string types and numeric types to array types and table types) are then acceptable in a 1NF table—although perhaps not always desirable; for example, it may be more desirable to separate a Customer Name column into two separate columns as First Name, Surname.

1NF tables as representations of relations

According to Date's definition, a table is in first normal form if and only if it is "isomorphic to some relation", which means, specifically, that it satisfies the following five conditions: [11]

  1. There's no top-to-bottom ordering to the rows.
  2. There's no left-to-right ordering to the columns.
  3. There are no duplicate rows.
  4. Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
  5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].

Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in first normal form.

Examples of tables (or views) that would not meet this definition of first normal form are:

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by British computer scientist Edgar F. Codd as part of his relational model.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

<span class="mw-page-title-main">Object–relational database</span> Database management system

An object–relational database (ORD), or object–relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the data model with custom data types and methods.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

Second normal form (2NF) is a normal form used in database normalization. 2NF was originally defined by E. F. Codd in 1971.

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.

A hierarchical database model is a data model in which the data are organized into a tree-like structure. The data are stored as records which are connected to one another through links. A record is a collection of fields, with each field containing only one value. The type of a record defines which fields the record contains.

A foreign key is a set of attributes in a table that refers to the primary key of another table, linking these two tables. In the context of relational databases, a foreign key is subject to a inclusion dependency constraint that the tuples consisting of the foreign key attributes in one relation, R, must also exist in some other relation, S; furthermore that those attributes must also be a candidate key in S.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Object–relational impedance mismatch creates difficulties going from data in relational data stores to usage in domain-driven object models. Object-orientation (OO) is the default method for business-centric design in programming languages. The problem lies in neither relational nor OO, but in the conceptual difficulty mapping between the two logic models. Both are logical models implementable differently on database servers, programming languages, design patterns, or other technologies. Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.

Boyce–Codd normal form is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). BCNF was developed in 1974 by Raymond F. Boyce and Edgar F. Codd to address certain types of anomalies not dealt with by 3NF as originally defined.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

Relational Model/Tasmania (RM/T) was published by Edgar F. Codd in 1979 and is the name given to a number of extensions to his original relational model (RM) published in 1970. The overall goal of the RM/T was to define some fundamental semantic units, at "atomic" and "molecular" levels, for data modelling. Codd writes: "the result is a model with a richer variety of objects than the original relational model, additional insert-update-delete rules and some additional operators that make the algebra more powerful."

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

<span class="mw-page-title-main">Relation (database)</span> Set of tuples consisting of values indexed by attributes

In database theory, a relation, as originally defined by E. F. Codd, is a set of tuples (d1, d2, ..., dn), where each element dj is a member of Dj, a data domain. Codd's original definition notwithstanding, and contrary to the usual definition in mathematics, there is no ordering to the elements of the tuples of a relation. Instead, each element is termed an attribute value. An attribute is a name paired with a domain. An attribute value is an attribute name paired with an element of that attribute's domain, and a tuple is a set of attribute values in which no two distinct elements have the same name. Thus, in some accounts, a tuple is described as a function, mapping names to values.

QUEL is a relational database query language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the Ingres DBMS effort at University of California, Berkeley, based on Codd's earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably in an implementation called POSTQUEL supported by POSTGRES. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead. QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years.

The following is provided as an overview of and topical guide to databases:

In database normalization, unnormalized form (UNF or 0NF), also known as an unnormalized relation or non-first normal form (N1NF or NF2), is a database data model (organization of data in a database) which does not meet any of the conditions of database normalization defined by the relational model. Database systems which support unnormalized data are sometimes called non-relational or NoSQL databases. In the relational model, unnormalized relations can be considered the starting point for a process of normalization.

References

  1. Codd, E.F (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM. Classics. 13 (6): 377–87. p. 380-381
  2. Codd, E.F (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM. Classics. 13 (6): 377–87.
  3. Codd, E. F. (1971). Further Normalization of the Relational Model. Courant Computer Science Symposium 6 in Data Base Systems edited by Rustin, R.
  4. Codd, E.F (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM. Classics. 13 (6): 377–87. p. 381
  5. Codd, E. F. The Relational Model for Database Management Version 2 (Addison-Wesley, 1990).
  6. Codd, E. F. The Relational Model for Database Management Version 2 (Addison-Wesley, 1990), p. 6.
  7. Darwen, Hugh. "Relation-Valued Attributes; or, Will the Real First Normal Form Please Stand Up?", in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991 (Addison-Wesley, 1992).
  8. Date, C. J. (2007). What First Normal Form Really Means. Apress. p. 108. ISBN   978-1-4842-2029-0. '[F]or many years,' writes Date, 'I was as confused as anyone else. What's worse, I did my best (worst?) to spread that confusion through my writings, seminars, and other presentations.'{{cite book}}: |work= ignored (help)
  9. Date, C. J. (2007). What First Normal Form Really Means. Apress. p. 112. ISBN   978-1-4842-2029-0.{{cite book}}: |work= ignored (help)
  10. Date, C. J. (6 November 2015). SQL and Relational Theory: How to Write Accurate SQL Code. O'Reilly Media. pp. 50–. ISBN   978-1-4919-4115-7 . Retrieved 31 October 2018.
  11. Date, C. J. (2007). What First Normal Form Really Means. Apress. pp. 127–128. ISBN   978-1-4842-2029-0.{{cite book}}: |work= ignored (help)
  12. Date, C. J. (2009). "Appendix A.2". SQL and Relational Theory. O'Reilly. Codd first defined the relational model in 1969 and didn't introduce nulls until 1979
  13. Date, C. J. (October 14, 1985). "Is Your DBMS Really Relational?". Computerworld. Null values ... [must be] supported in a fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type. (the third of Codd's 12 rules)
  14. Date, C. J. (2007). What First Normal Form Really Means. Apress. pp. 121–126. ISBN   978-1-4842-2029-0.{{cite book}}: |work= ignored (help)

Further reading