Data integrity

Last updated

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle. [1] It is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context even under the same general umbrella of computing. It is at times used as a proxy term for data quality, [2] while data validation is a prerequisite for data integrity. [3]

Contents

Definition

Data integrity is the opposite of data corruption. [4] The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.

Any unintended changes to data as the result of a storage, retrieval or processing operation, including malicious intent, unexpected hardware failure, and human error, is failure of data integrity. If the changes are the result of unauthorized access, it may also be a failure of data security. Depending on the data involved this could manifest itself as benign as a single pixel in an image appearing a different color than was originally recorded, to the loss of vacation pictures or a business-critical database, to even catastrophic loss of human life in a life-critical system.

Integrity types

Physical integrity

Physical integrity deals with challenges which are associated with correctly storing and fetching the data itself. Challenges with physical integrity may include electromechanical faults, design flaws, material fatigue, corrosion, power outages, natural disasters, and other special environmental hazards such as ionizing radiation, extreme temperatures, pressures and g-forces. Ensuring physical integrity includes methods such as redundant hardware, an uninterruptible power supply, certain types of RAID arrays, radiation hardened chips, error-correcting memory, use of a clustered file system, using file systems that employ block level checksums such as ZFS, storage arrays that compute parity calculations such as exclusive or or use a cryptographic hash function and even having a watchdog timer on critical subsystems.

Physical integrity often makes extensive use of error detecting algorithms known as error-correcting codes. Human-induced data integrity errors are often detected through the use of simpler checks and algorithms, such as the Damm algorithm or Luhn algorithm. These are used to maintain data integrity after manual transcription from one computer system to another by a human intermediary (e.g. credit card or bank routing numbers). Computer-induced transcription errors can be detected through hash functions.

In production systems, these techniques are used together to ensure various degrees of data integrity. For example, a computer file system may be configured on a fault-tolerant RAID array, but might not provide block-level checksums to detect and prevent silent data corruption. As another example, a database management system might be compliant with the ACID properties, but the RAID controller or hard disk drive's internal write cache might not be.

Logical integrity

This type of integrity is concerned with the correctness or rationality of a piece of data, given a particular context. This includes topics such as referential integrity and entity integrity in a relational database or correctly ignoring impossible sensor data in robotic systems. These concerns involve ensuring that the data "makes sense" given its environment. Challenges include software bugs, design flaws, and human errors. Common methods of ensuring logical integrity include things such as check constraints, foreign key constraints, program assertions, and other run-time sanity checks.

Physical and logical integrity often share many challenges such as human errors and design flaws, and both must appropriately deal with concurrent requests to record and retrieve data, the latter of which is entirely a subject on its own.

If a data sector only has a logical error, it can be reused by overwriting it with new data. In case of a physical error, the affected data sector is permanently unusable.

Databases

Data integrity contains guidelines for data retention, specifying or guaranteeing the length of time data can be retained in a particular database (typically a relational database). To achieve data integrity, these rules are consistently and routinely applied to all data entering the system, and any relaxation of enforcement could cause errors in the data. Implementing checks on the data as close as possible to the source of input (such as human data entry), causes less erroneous data to enter the system. Strict enforcement of data integrity rules results in lower error rates, and time saved troubleshooting and tracing erroneous data and the errors it causes to algorithms.

Data integrity also includes rules defining the relations a piece of data can have to other pieces of data, such as a Customer record being allowed to link to purchased Products, but not to unrelated data such as Corporate Assets. Data integrity often includes checks and correction for invalid data, based on a fixed schema or a predefined set of rules. An example being textual data entered where a date-time value is required. Rules for data derivation are also applicable, specifying how a data value is derived based on algorithm, contributors and conditions. It also specifies the conditions on how the data value could be re-derived.

Types of integrity constraints

Data integrity is normally enforced in a database system by a series of integrity constraints or rules. Three types of integrity constraints are an inherent part of the relational data model: entity integrity, referential integrity and domain integrity.

If a database supports these features, it is the responsibility of the database to ensure data integrity as well as the consistency model for the data storage and retrieval. If a database does not support these features, it is the responsibility of the applications to ensure data integrity while the database supports the consistency model for the data storage and retrieval.

Having a single, well-controlled, and well-defined data-integrity system increases:

Modern databases support these features (see Comparison of relational database management systems), and it has become the de facto responsibility of the database to ensure data integrity. Companies, and indeed many database systems, offer products and services to migrate legacy systems to modern databases.

Examples

An example of a data-integrity mechanism is the parent-and-child relationship of related records. If a parent record owns one or more related child records all of the referential integrity processes are handled by the database itself, which automatically ensures the accuracy and integrity of the data so that no child record can exist without a parent (also called being orphaned) and that no parent loses their child records. It also ensures that no parent record can be deleted while the parent record owns any child records. All of this is handled at the database level and does not require coding integrity checks into each application.

File systems

Various research results show that neither widespread filesystems (including UFS, Ext, XFS, JFS and NTFS) nor hardware RAID solutions provide sufficient protection against data integrity problems. [5] [6] [7] [8] [9]

Some filesystems (including Btrfs and ZFS) provide internal data and metadata checksumming that is used for detecting silent data corruption and improving data integrity. If a corruption is detected that way and internal RAID mechanisms provided by those filesystems are also used, such filesystems can additionally reconstruct corrupted data in a transparent way. [10] This approach allows improved data integrity protection covering the entire data paths, which is usually known as end-to-end data protection. [11]

Data integrity as applied to various industries

See also

Related Research Articles

<span class="mw-page-title-main">Checksum</span> Data used to detect errors in other data

A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

<span class="mw-page-title-main">Error detection and correction</span> Techniques that enable reliable delivery of digital data over unreliable communication channels

In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

A database transaction symbolizes a unit of work, performed within a database management system against a database, that is treated in a coherent and reliable way independent of other transactions. A transaction generally represents any change in a database. Transactions in a database environment have two main purposes:

  1. To provide reliable units of work that allow correct recovery from failures and keep a database consistent even in cases of system failure. For example: when execution prematurely and unexpectedly stops in which case many operations upon a database remain uncompleted, with unclear status.
  2. To provide isolation between programs accessing a database concurrently. If this isolation is not provided, the programs' outcomes are possibly erroneous.

A foreign key is a set of attributes in a table that refers to the primary key of another table, linking these two tables. In the context of relational databases, a foreign key is subject to a inclusion dependency constraint that the tuples consisting of the foreign key attributes in one relation, R, must also exist in some other relation, S; furthermore that those attributes must also be a candidate key in S.

Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. The phenomenon is also known as data decay, data rot or bit rot. This process leads to the slow deterioration of data quality over time, even if the data is not actively being used or accessed.

<span class="mw-page-title-main">Referential integrity</span> Where all data references are valid

Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute, then the referenced value must exist.

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural key.

A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes.

Database design is the organization of data according to a database model. The designer determines what data must be stored and how the data elements interrelate. With this information, they can begin to fit the data to the database model. A database management system manages the data accordingly.

<span class="mw-page-title-main">Data corruption</span> Errors in computer data that introduce unintended changes to the original data

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.

In computing, data validation or input validation is the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.

In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a complete copy of the actual data, or only select pieces of data that allow detection of errors and reconstruction of lost or damaged data up to a certain level.

<span class="mw-page-title-main">Disk sector</span> Logical or physical division of storage media

In computer disk storage, a sector is a subdivision of a track on a magnetic disk or optical disc. For most disks, each sector stores a fixed amount of user-accessible data, traditionally 512 bytes for hard disk drives (HDDs) and 2048 bytes for CD-ROMs and DVD-ROMs. Newer HDDs and SSDs use 4096-byte (4 KiB) sectors, which are known as the Advanced Format (AF).

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

The following is provided as an overview of and topical guide to databases:

ZFS is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open source license as OpenSolaris for around 5 years from 2005 before being placed under a closed source license when Oracle Corporation acquired Sun in 2009–2010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X and FreeBSD. In 2010, the illumos project forked a recent version of OpenSolaris, including ZFS, to continue its development as an open source project. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems.

References

  1. Boritz, J. "IS Practitioners' Views on Core Concepts of Information Integrity". International Journal of Accounting Information Systems. Elsevier. Archived from the original on 5 October 2011. Retrieved 12 August 2011.
  2. What is Data Integrity? Learn How to Ensure Database Data Integrity via Checks, Tests, & Best Practices
  3. What is Data Integrity? Data Protection 101
  4. From the book: Uberveillance and the Social Implications of Microchip Implants: Emerging Page 40
  5. Vijayan Prabhakaran (2006). "IRON FILE SYSTEMS" (PDF). Doctor of Philosophy in Computer Sciences. University of Wisconsin-Madison. Archived (PDF) from the original on 2022-10-09. Retrieved 9 June 2012.
  6. "Parity Lost and Parity Regained".
  7. "An Analysis of Data Corruption in the Storage Stack" (PDF). Archived (PDF) from the original on 2022-10-09.
  8. "Impact of Disk Corruption on Open-Source DBMS" (PDF). Archived (PDF) from the original on 2022-10-09.
  9. "Baarf.com". Baarf.com. Retrieved 4 November 2011.
  10. Bierman, Margaret; Grimmer, Lenz (August 2012). "How I Use the Advanced Capabilities of Btrfs" . Retrieved 2014-01-02.
  11. Yupu Zhang; Abhishek Rajimwale; Andrea Arpaci-Dusseau; Remzi H. Arpaci-Dusseau (2010). "End-to-end data integrity for file systems: a ZFS case study" (PDF). USENIX Conference on File and Storage Technologies. CiteSeerX   10.1.1.154.3979 . S2CID   5722163. Wikidata   Q111972797 . Retrieved 2014-01-02.
  12. "Data Integrity and Compliance with CGMP: Guidance for Industry" (PDF). U.S. Food and Drug Administration. April 2016. Archived (PDF) from the original on 2022-10-09. Retrieved 20 January 2018.
  13. Davidson, J. (18 July 2017). "Data Integrity Guidance Around the World". Contract Pharma. Rodman Media. Retrieved 20 January 2018.
  14. Scannel, P. (12 May 2015). "Data Integrity: A perspective from the medical device regulatory and standards framework" (PDF). Data Integrity Seminar. Parenteral Drug Association. pp. 10–57. Archived from the original (PDF) on 20 January 2018. Retrieved 20 January 2018.
  15. Cook, R. (4 January 2017). "2017 Regulatory and Examination Priorities Letter". Financial Industry Regulatory Authority. Retrieved 20 January 2018.
  16. Cook, R. (8 January 2018). "2018 Regulatory and Examination Priorities Letter". Financial Industry Regulatory Authority. Retrieved 20 January 2018.
  17. "Data Integrity: Enabling Effective Decisions in Mining Operations" (PDF). Accenture. 2016. Archived (PDF) from the original on 2022-10-09. Retrieved 20 January 2018.
  18. "Industry 4.0 and Cyber-Physical Systems Raise the Data Integrity Imperative". Nymi Blog. Nymi, Inc. 24 October 2017. Retrieved 20 January 2018.[ permanent dead link ]
  19. Priyadharshini, B.; Parvathi, P. (2012). "Data integrity in cloud storage". Proceedings from the 2012 International Conference on Advances in Engineering, Science and Management. ISBN   9788190904223.
  20. Zafar, F.; Khan, A.; Malik, S.U.R.; et al. (2017). "A survey of cloud computing data integrity schemes: Design challenges, taxonomy and future trends". Computers & Security. 65 (3): 29–49. doi:10.1016/j.cose.2016.10.006.
  21. Imran, M.; Hlavacs, H.; Haq, I.U.I.; et al. (2017). "Provenance based data integrity checking and verification in cloud environments". PLOS ONE. 12 (5): e0177576. Bibcode:2017PLoSO..1277576I. doi: 10.1371/journal.pone.0177576 . PMC   5435237 . PMID   28545151.