Data corruption

Last updated
Photo data corruption; in this case, a result of a failed data recovery from a hard disk drive Data loss of image file.JPG
Photo data corruption; in this case, a result of a failed data recovery from a hard disk drive

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of measures to provide end-to-end data integrity, or lack of errors.

Contents

In general, when data corruption occurs, a file containing that data will produce unexpected results when accessed by the system or the related application. Results could range from a minor loss of data to a system crash. For example, if a document file is corrupted, when a person tries to open that file with a document editor they may get an error message, thus the file might not be opened or might open with some of the data corrupted (or in some cases, completely corrupted, leaving the document unintelligible). The adjacent image is a corrupted image file in which most of the information has been lost.

Some types of malware may intentionally corrupt files as part of their payloads, usually by overwriting them with inoperative or garbage code, while a non-malicious virus may also unintentionally corrupt files when it accesses them. If a virus or trojan with this payload method manages to alter files critical to the running of the computer's operating system software or physical hardware, the entire system may be rendered unusable.

Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the built-in functionality of the application to handle the error. There are various causes of the corruption.

Overview

Photo of an Atari 2600 with corrupted RAM. Atari 2600 with corrupted ram.jpg
Photo of an Atari 2600 with corrupted RAM.
A video that has been corrupted. Epilepsy warning: This video contains bright, flashing images.

There are two types of data corruption associated with computer systems: undetected and detected. Undetected data corruption, also known as silent data corruption, results in the most dangerous errors as there is no indication that the data is incorrect. Detected data corruption may be permanent with the loss of data, or may be temporary when some part of the system is able to detect and correct the error; there is no data corruption in the latter case.

Data corruption can occur at any level in a system, from the host to the storage medium. Modern systems attempt to detect corruption at many layers and then recover or correct the corruption; this is almost always successful but very rarely the information arriving in the systems memory is corrupted and can cause unpredictable results.

Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.

Hardware and software failure are the two main causes for data loss. Background radiation, head crashes, and aging or wear of the storage device fall into the former category, while software failure typically occurs due to bugs in the code. Cosmic rays cause most soft errors in DRAM. [1]

Silent

Some errors go unnoticed, without being detected by the disk firmware or the host operating system; these errors are known as silent data corruption. [2]

There are many error sources beyond the disk storage subsystem itself. For instance, cables might be slightly loose, the power supply might be unreliable, [3] external vibrations such as a loud sound, [4] the network might introduce undetected corruption, [5] cosmic radiation and many other causes of soft memory errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures. [6] All in all, the error rates as observed by a CERN study on silent corruption are far higher than one in every 1016 bits. [7] Webshop Amazon.com has acknowledged similar high data corruption rates in their systems. [8] In 2021, faulty processor cores were identified as an additional cause in publications by Google and Facebook; cores were found to be faulty at a rate of several in thousands of cores. [9] [10]

One problem is that hard disk drive capacities have increased substantially, but their error rates remain unchanged. The data corruption rate has always been roughly constant in time, meaning that modern disks are not much safer than old disks. In old disks the probability of data corruption was very small because they stored tiny amounts of data. In modern disks the probability is much larger because they store much more data, whilst not being safer. That way, silent data corruption has not been a serious concern while storage devices remained relatively small and slow. In modern times and with the advent of larger drives and very fast RAID setups, users are capable of transferring 1016 bits in a reasonably short time, thus easily reaching the data corruption thresholds. [11]

As an example, ZFS creator Jeff Bonwick stated that the fast database at Greenplum, which is a database software company specializing in large-scale data warehousing and analytics, faces silent corruption every 15 minutes. [12] As another example, a real-life study performed by NetApp on more than 1.5 million HDDs over 41 months found more than 400,000 silent data corruptions, out of which more than 30,000 were not detected by the hardware RAID controller (only detected during scrubbing). [13] Another study, performed by CERN over six months and involving about 97  petabytes of data, found that about 128  megabytes of data became permanently corrupted silently somewhere in the pathway from network to disk. [14]

Silent data corruption may result in cascading failures, in which the system may run for a period of time with undetected initial error causing increasingly more problems until it is ultimately detected. [15] For example, a failure affecting file system metadata can result in multiple files being partially damaged or made completely inaccessible as the file system is used in its corrupted state.

Countermeasures

When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes (ECC).

If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented. Some CPU architectures employ various transparent checks to detect and mitigate data corruption in CPU caches, CPU buffers and instruction pipelines; an example is Intel Instruction Replay technology, which is available on Intel Itanium processors. [16]

Many errors are detected and corrected by the hard disk drives using the ECC codes [17] which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk, by remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector). This "silent correction" can be monitored using S.M.A.R.T. and tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

Some file systems, such as Btrfs, HAMMER, ReFS, and ZFS, use internal data and metadata checksumming to detect silent data corruption. In addition, if a corruption is detected and the file system uses integrated RAID mechanisms that provide data redundancy, such file systems can also reconstruct corrupted data in a transparent way. [18] This approach allows improved data integrity protection covering the entire data paths, which is usually known as end-to-end data protection, compared with other data integrity approaches that do not span different layers in the storage stack and allow data corruption to occur while the data passes boundaries between the different layers. [19]

Data scrubbing is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. The "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.

If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable. [7]

See also

Related Research Articles

<span class="mw-page-title-main">Checksum</span> Data used to detect errors in other data

A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

<span class="mw-page-title-main">Error detection and correction</span> Techniques that enable reliable delivery of digital data over unreliable communication channels

In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context – even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a prerequisite for data integrity. Data integrity is the opposite of data corruption. The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended. Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes), although they can also be applied separately to an entire message string of bits.

Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. The phenomenon is also known as data decay, data rot or bit rot.

<span class="mw-page-title-main">Data striping</span>

In computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, so that consecutive segments are stored on different physical storage devices.

<span class="mw-page-title-main">File system</span> Format or program for storing files and directories

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

RAM parity checking is the storing of a redundant parity bit representing the parity of a small amount of computer data stored in random-access memory, and the subsequent comparison of the stored and the computed parity to detect whether a data error has occurred.

Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.

<span class="mw-page-title-main">ECC memory</span> Self-correcting computer data storage

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a complete copy of the actual data, or only select pieces of data that allow detection of errors and reconstruction of lost or damaged data up to a certain level.

A bad sector in computing is a disk sector on a disk storage unit that is unreadable. Upon taking damage, all information stored on that sector is lost. When a bad sector is found and marked, the operating system like Windows or Linux will skip it in the future. Bad sectors are a threat to information security in the sense of data remanence.

In computer storage, the standard RAID levels comprise a basic set of RAID configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives (HDDs). The most common types are RAID 0 (striping), RAID 1 (mirroring) and its variants, RAID 5, and RAID 6. Multiple RAID levels can also be combined or nested, for instance RAID 10 or RAID 01. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard. The numerical values only serve as identifiers and do not signify performance, reliability, generation, or any other metric.

Although all RAID implementations differ from the specification to some extent, some companies and open-source projects have developed non-standard RAID implementations that differ substantially from the standard. Additionally, there are non-RAID drive architectures, providing configurations of multiple hard drives not referred to by RAID acronyms.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

Data Integrity Field (DIF) is an approach to protect data integrity in computer data storage from data corruption. It was proposed in 2003 by the T10 subcommittee of the International Committee for Information Technology Standards. A similar approach for data integrity was added in 2016 to the NVMe 1.2.1 specification.

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

ZFS is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open source license as OpenSolaris for around 5 years from 2005 before being placed under a closed source license when Oracle Corporation acquired Sun in 2009–2010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X and FreeBSD. In 2010, the illumos project forked a recent version of OpenSolaris, including ZFS, to continue its development as an open source project. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems.

References

  1. Scientific American (2008-07-21). "Solar Storms: Fast Facts". Nature Publishing Group. Archived from the original on 2010-12-26. Retrieved 2009-12-08.
  2. "Silent Data Corruption". Google Inc. 2023. Retrieved January 30, 2023. Silent Data Corruption (SDC), sometimes referred to as Silent Data Error (SDE), is an industry-wide issue impacting not only long-protected memory, storage, and networking, but also computer CPUs.
  3. Eric Lowe (16 November 2005). "ZFS saves the day(-ta)!". Oracle – Core Dumps of a Kernel Hacker's Brain – Eric Lowe's Blog. Oracle. Archived from the original (Blog) on 5 February 2012. Retrieved 9 June 2012.
  4. bcantrill (31 December 2008). "Shouting in the Datacenter" (Video file). YouTube. Archived from the original on 3 July 2012. Retrieved 9 June 2012.
  5. jforonda (31 January 2007). "Faulty FC port meets ZFS" (Blog). Blogger – Outside the Box. Archived from the original on 26 April 2012. Retrieved 9 June 2012.
  6. "Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics" (PDF). USENIX. Archived (PDF) from the original on 2022-01-25. Retrieved 2014-01-18.
  7. 1 2 Bernd Panzer-Steindel (8 April 2007). "Draft 1.3". Data integrity. CERN. Archived from the original on 27 October 2012. Retrieved 9 June 2012.
  8. "Observations on Errors, Corrections, & Trust of Dependent Systems". Archived from the original on 2013-10-29.
  9. Hochschild, Peter H.; Turner, Paul Jack; Mogul, Jeffrey C.; Govindaraju, Rama Krishna; Ranganathan, Parthasarathy; Culler, David E.; Vahdat, Amin (2021). "Cores that don't count" (PDF). Proceedings of the Workshop on Hot Topics in Operating Systems. pp. 9–16. doi:10.1145/3458336.3465297. ISBN   9781450384384. S2CID   235311320. Archived (PDF) from the original on 2021-06-03. Retrieved 2021-06-02.
  10. HotOS 2021: Cores That Don't Count (Fun Hardware), archived from the original on 2021-12-22, retrieved 2021-06-02
  11. "Silent data corruption in disk arrays: A solution". NEC. 2009. Archived from the original (PDF) on 29 October 2013. Retrieved 14 December 2020.
  12. "A Conversation with Jeff Bonwick and Bill Moore". Association for Computing Machinery. November 15, 2007. Archived from the original on 16 July 2011. Retrieved 14 December 2020.
  13. David S. H. Rosenthal (October 1, 2010). "Keeping Bits Safe: How Hard Can It Be?". ACM Queue. Archived from the original on December 17, 2013. Retrieved 2014-01-02.; Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H. 2008. An analysis of data corruption in the storage stack. In Proceedings of 6th Usenix Conference on File and Storage Technologies.
  14. Kelemen, P. Silent corruptions (PDF). 8th Annual Workshop on Linux Clusters for Super Computing.
  15. David Fiala; Frank Mueller; Christian Engelmann; Rolf Riesen; Kurt Ferreira; Ron Brightwell (November 2012). "Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" (PDF). fiala.me. IEEE. Archived (PDF) from the original on 2014-11-07. Retrieved 2015-01-26.
  16. Steve Bostian (2012). "Rachet Up Reliability for Mission-Critical Applications: Intel Instruction Replay Technology" (PDF). Intel. Archived (PDF) from the original on 2016-02-02. Retrieved 2016-01-27.
  17. "Read Error Severities and Error Management Logic". Archived from the original on 7 April 2012. Retrieved 4 April 2012.
  18. Margaret Bierman; Lenz Grimmer (August 2012). "How I Use the Advanced Capabilities of Btrfs". Oracle Corporation. Archived from the original on 2014-01-02. Retrieved 2014-01-02.
  19. Yupu Zhang; Abhishek Rajimwale; Andrea Arpaci-Dusseau; Remzi H. Arpaci-Dusseau (2010). "End-to-end data integrity for file systems: a ZFS case study" (PDF). USENIX Conference on File and Storage Technologies. CiteSeerX   10.1.1.154.3979 . S2CID   5722163. Wikidata   Q111972797 . Retrieved 2014-08-12.