Surrogate key

Last updated

A surrogate key (or synthetic key, pseudokey, entity identifier, factless key, or technical key[ citation needed ]) in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key. [1]

Contents

Definition

There are at least two definitions of a surrogate:

Surrogate (1) Hall, Owlett and Todd (1976)
A surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application. [2]
Surrogate (2) Wieringa and De Jonge (1991)
A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application.

The Surrogate (1) definition relates to a data model rather than a storage model and is used throughout this article. See Date (1998).

An important distinction between a surrogate and a primary key depends on whether the database is a current database or a temporal database. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.

Although Hall et al. (1976) say nothing about this, others[ specify ] have argued that a surrogate should have the following characteristics:

Surrogates in practice

In a current database, the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated UUID (for example, an HR number for each employee other than the UUID of each employee).

A surrogate key is frequently a sequential number (e.g. a Sybase or SQL Server "identity column", a PostgreSQL or Informix serial, an Oracle or SQL Server SEQUENCE or a column defined with AUTO_INCREMENT in MySQL). Some databases provide UUID/GUID as a possible data type for surrogate keys (e.g. PostgreSQL UUID or SQL Server UNIQUEIDENTIFIER).

Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile) and guarantees uniqueness.

In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled world. One table row represents a slice of time holding all the entity's attributes for a defined timespan. Those slices depict the whole lifespan of one business entity. For example, a table EmployeeContracts may hold temporal information to keep track of contracted working hours. The business key for one contract will be identical (non-unique) in both rows however the surrogate key for each row is unique.

SurrogateKeyBusinessKeyEmployeeNameWorkingHoursPerWeekRowValidFromRowValidTo
1BOS0120John Smith402000-01-012000-12-31
56P0000123Bob Brown251999-01-012011-12-31
234BOS0120John Smith352001-01-012009-12-31

Some database designers use surrogate keys systematically regardless of the suitability of other candidate keys, while others will use a key already present in the data, if there is one.

Some of the alternate names ("system-generated key") describe the way of generating new surrogate values rather than the nature of the surrogate concept.

Approaches to generating surrogates include:

Advantages

Stability

Surrogate keys typically do not change while the row exists. This has the following advantages:

Requirement changes

Attributes that uniquely identify an entity might change, which might invalidate the suitability of natural keys. Consider the following example:

An employee's network user name is chosen as a natural key. Upon merging with another company, new employees must be inserted. Some of the new network user names create conflicts because their user names were generated independently (when the companies were separate).

In these cases, generally a new attribute must be added to the natural key (for example, an original_company column). With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.

Some problem domains do not clearly identify a suitable natural key. Surrogate keys avoid choosing a natural key that might be incorrect.

Performance

Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Furthermore, a non-redundant distribution of keys causes the resulting b-tree index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than compound key.

Compatibility

While using several database application development systems, drivers, and object–relational mapping systems, such as Ruby on Rails or Hibernate, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.

Uniformity

When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.

Validation

It is possible to design key-values that follow a well-known pattern or structure which can be automatically verified. For instance, the keys that are intended to be used in some column of some table might be designed to "look differently from" those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of Database normalization.

Disadvantages

Disassociation

The values of generated surrogate keys have no relationship to the real-world meaning of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. If appropriate database constraints have not been set, or data imported from a legacy system where referential integrity was not employed, it is possible to have a foreign-key value that does not correspond to a primary-key value and is therefore invalid. (In this regard, C.J. Date regards the meaninglessness of surrogate keys as an advantage. [6] )

To discover such errors, one must perform a query that uses a left outer join between the table with the foreign key and the table with the primary key, showing both key fields in addition to any fields required to distinguish the record; all invalid foreign-key values will have the primary-key column as NULL. The need to perform such a check is so common that Microsoft Access actually provides a "Find Unmatched Query" wizard that generates the appropriate SQL after walking the user through a dialog. (It is, however, not too difficult to compose such queries manually.) "Find Unmatched" queries are typically employed as part of a data cleansing process when inheriting legacy data.

Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by NOT exporting surrogate keys, except as transient data (most obviously, in executing applications that have a "live" connection to the database).

When surrogate keys supplant natural keys, then domain specific referential integrity will be compromised. For example, in a customer master table, the same customer may have multiple records under separate customer IDs, even though the natural key (a combination of customer name, date of birth, and e-mail address) would be unique. To prevent compromise, the natural key of the table must NOT be supplanted: it must be preserved as a unique constraint, which is implemented as a unique index on the combination of natural-key fields.

Query optimization

Relational databases assume a unique index is applied to a table's primary key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a candidate key, the index can be a unique index.

These additional indexes, however, will take up disk space and slow down inserts and deletes.

Normalization

Surrogate keys can result in duplicate values in any natural keys. To prevent duplication, one must preserve the role of the natural keys as unique constraints when defining the table using either SQL's CREATE TABLE statement or ALTER TABLE ... ADD CONSTRAINT statement, if the constraints are added as an afterthought.

Business process modeling

Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.

Inadvertent disclosure

Proprietary information can be leaked if surrogate keys are generated sequentially. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. For example see German tank problem.

There are a few ways to overcome this problem:

Inadvertent assumptions

Sequentially generated surrogate keys can imply that events with a higher key value occurred after events with a lower value. This is not necessarily true, because such values do not guarantee time sequence as it is possible for inserts to fail and leave gaps which may be filled at a later time. If chronology is important then date and time must be separately recorded.

See also

Related Research Articles

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.

In the relational model of databases, a primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a relation (table). Informally, a primary key is "which attributes identify a record," and in simple cases constitute a single attribute: a unique ID. More formally, a primary key is a choice of candidate key ; any other candidate key is an alternate key.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

In the context of SQL, data definition or data description language (DDL) is a syntax for creating and modifying database objects such as tables, indices, and users. DDL statements are similar to a computer programming language for defining data structures, especially database schemas. Common examples of DDL statements include CREATE, ALTER, and DROP.

An SQL INSERT statement adds one or more records to any single table in a relational database.

A query plan is a sequence of steps used to access data in a SQL relational database management system. This is a specific case of the relational model concept of access plans.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

<span class="mw-page-title-main">Dimension (data warehouse)</span> Structure that categorizes facts and measures in a data warehouse

A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.

A slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule. Some examples of typical slowly changing dimensions are entities such as names of geographical locations, customers, or products.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix.

In relational database management systems, a unique key is a candidate key. All the candidate keys of a relation can uniquely identify the records of the relation, but only one of them is used as the primary key of the relation. The remaining candidate keys are called unique keys because they can uniquely identify a record in a relation. Unique keys can consist of multiple columns. Unique keys are also called alternate keys. Unique keys are an alternative to the primary key of the relation. In SQL, the unique keys have a UNIQUE constraint assigned to them in order to prevent duplicates. Alternate keys may be used like the primary key when doing a single-table select or when filtering in a where clause, but are not typically used to join multiple tables.

An identity column is a column in a database table that is made up of values generated by the database. This is much like an AutoNumber field in Microsoft Access or a sequence in Oracle. Because the concept is so important in database science, many RDBMS systems implement some type of generated key, although each has its own terminology. Today a popular technique for generating identity is to generate a random UUID.

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

The following is provided as an overview of and topical guide to databases:

A Block Range Index or BRIN is a database indexing technique. They are intended to improve performance with extremely large tables.

Hi/Lo is an algorithm and a key generation strategy used for generating unique keys for use in a database as a primary key. It uses a sequence-based hi-lo pattern to generate values. Hi/Lo is used in scenarios where an application needs its entities to have an identity prior to persistence. It is a value generation strategy. An alternative to Hi/Lo would be to generate keys as universally unique identifiers (UUID).

References

Citations

  1. "What is a Surrogate Key? - Definition from Techopedia". Techopedia.com. Retrieved 2020-02-21.
  2. P A V Hall, J Owlett, S J P Todd, "Relations and Entities", Modelling in Data Base Management Systems (ed GM Nijssen), North Holland 1976.
  3. "Database SQL Language Reference".
  4. "CREATE SEQUENCE (Transact-SQL) - SQL Server".
  5. "SQLite Autoincrement". SQLite. 2017-02-02. Retrieved 2022-12-02.
  6. C.J. Date. The primacy of primary keys. From "Relational Database Writings, 1991-1994. Addison-Wesley, Reading, MA.

Sources