Data deduplication

Last updated

In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent.

Contents

The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. [1] [2]

A related technique is single-instance (data) storage, which replaces multiple copies of content at the whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level).

Deduplication is different from data compression algorithms, such as LZ77 and LZ78. Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy.

Functioning principle

For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. [3]

In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in MediaWiki.

Benefits

Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).

In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information.

Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved.

Classification

Post-process versus in-line deduplication

Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written.

With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity.

Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block.

The advantage of in-line deduplication over post-process deduplication is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates.

Post-process and in-line deduplication methods are often heavily debated. [4] [5]

Data formats

The SNIA Dictionary identifies two methods: [2]

Source versus target deduplication

Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication".

Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. [6] [7]

Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of "linking" on file systems called the reflink (Linux) or clonefile (MacOS), where one or more inodes (file information entries) are made to share some or all of their data. It is named analogously to hard links, which work at the inode level, and symbolic links that work at the filename level. [8] The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. [9] Microsoft's ReFS also supports this operation. [10]

Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library.

Deduplication methods

One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with the same identification is identical. [11] If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link.

Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The deduplication process is intended to be transparent to end users and applications.

Commercial deduplication implementations differ by their chunking methods and architectures.

To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold: First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time.

Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance.

Single instance storage

Single-instance storage (SIS) is a system's ability to take multiple copies of content objects and replace them by a single shared copy. It is a means to eliminate data duplication and to increase efficiency. SIS is frequently implemented in file systems, email server software, data backup, and other storage-related computer software. Single-instance storage is a simple variant of data deduplication. While data deduplication may work at a segment or sub-block level, single instance storage works at the object level, eliminating redundant copies of objects such as entire files or email messages. [12]

Single-instance storage can be used alongside (or layered upon) other data duplication or data compression methods to improve performance in exchange for an increase in complexity and for (in some cases) a minor increase in storage space requirements.

Drawbacks and concerns

One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends mainly on the hash length (see birthday attack). Thus, the concern arises that data corruption can occur if a hash collision occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity. The hash functions used include standards such as SHA-1, SHA-256, and others.

The computational resource intensity of the process can be a drawback of data deduplication. To improve performance, some systems utilize both weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.

Another concern is the interaction of compression and encryption. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though the underlying data may be redundant.

Although not a shortcoming of data deduplication, there have been data breaches when insufficient security and access validation procedures are used with large repositories of deduplicated data. In some systems, as typical with cloud storage,[ citation needed ] an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data. [13]

Implementations

Deduplication is implemented in some filesystems such as in ZFS or Write Anywhere File Layout and in different disk arrays models.[ citation needed ] It is a service available on both NTFS and ReFS on Windows servers.

See also

Related Research Articles

A disk image is a snapshot of a storage device's structure and data typically stored in one or more computer files on another storage device. Traditionally, disk images were bit-by-bit copies of every sector on a hard disk often created for digital forensic purposes, but it is now common to only copy allocated data to reduce storage space. Compression and deduplication are commonly used to reduce the size of the image file set. Disk imaging is done for a variety of purposes including digital forensics, cloud computing, system administration, as part of a backup strategy, and legacy emulation as part of a digital preservation strategy. Disk images can be made in a variety of formats depending on the purpose. Virtual disk images are intended to be used for cloud computing, ISO images are intended to emulate optical media and raw disk images are used for forensic purposes. Proprietary formats are typically used by disk imaging software. Despite the benefits of disk imaging the storage costs can be high, management can be difficult and they can be time consuming to create.

rsync File synchronization protocol and software

rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operating systems and is under the GPL-3.0-or-later license.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

In computer programming, chunking has multiple meanings.

Filesystem in Userspace (FUSE) is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

Capacity optimization is a general term for technologies used to improve storage use by shrinking stored data. Primary technologies used for capacity optimization are data deduplication and data compression. These are delivered as software or hardware, integrated with storage systems or delivered as standalone products. Deduplication algorithms look for redundancy in sequences of bytes across comparison windows. Typically using cryptographic hash functions as identifiers of unique sequences, sequences are compared to the history of other such sequences, and where possible, the first uniquely stored version of a sequence is referenced rather than stored again. Different methods for selecting data windows include 4KB blocks to full-file comparisons known as single-instance storage (SIS).

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data acts as the address of the data. This enforces a write-once policy since no other data block can be found with the same address: the addresses of multiple writes of the same data are identical, so duplicate data is easily identified and the data block is stored only once. Data blocks cannot be removed, making it ideal for permanent or backup storage. Venti is typically used with Fossil to provide a file system with permanent snapshots.

EMC NetWorker is an enterprise-level data protection software product from Dell EMC that unifies and automates backup to tape, disk-based, and flash-based storage media across physical and virtual environments for granular and disaster recovery.

An NTFS reparse point is a type of NTFS file system object. It is available with the NTFS v3.0 found in Windows 2000 or later versions. Reparse points provide a way to extend the NTFS filesystem. A reparse point contains a reparse tag and data that are interpreted by a filesystem filter driver identified by the tag. Microsoft includes several default tags including NTFS symbolic links, directory junction points, volume mount points and Unix domain sockets. Also, reparse points are used as placeholders for files moved by Windows 2000's Remote Storage Hierarchical Storage System. They also can act as hard links, but are not limited to pointing to files on the same volume: they can point to directories on any local volume. The feature is inherited to ReFS.

Single-instance storage (SIS) is a system's ability to take multiple copies of content and replace them by a single shared copy. It is a means to eliminate data duplication and to increase efficiency. SIS is frequently implemented in file systems, e-mail server software, data backup, and other storage-related computer software. Single-instance storage is a simple variant of data deduplication. While data deduplication may work at a segment or sub-block level, single-instance storage works at the whole-file level and eliminates redundant copies of entire files or e-mail messages.

Hardware virtualization is the virtualization of computers as complete hardware platforms, certain logical abstractions of their componentry, or only the functionality required to run various operating systems. Virtualization hides the physical characteristics of a computing platform from the users, presenting instead an abstract computing platform. At its origins, the software that controlled virtualization was called a "control program", but the terms "hypervisor" or "virtual machine monitor" became preferred over time.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was founded by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.

Storage efficiency is the ability to store and manage data that consumes the least amount of space with little to no impact on performance; resulting in a lower total operational cost. Efficiency addresses the real-world demands of managing costs, reducing complexity and limiting risk. The Storage Networking Industry Association (SNIA) defines storage efficiency in the SNIA Dictionary as follows:

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

Disk-based backup refers to technology that allows one to back up large amounts of data to a disk storage unit. It is the technology which is often supplemented by tape drives for data archival or replication to another facility for disaster recovery. Additionally, backup-to-disk has several advantages over traditional tape backup for both technical and business reasons. With continued improvements in storage devices to provide faster access and higher storage capacity, a prime consideration for backup and restore operations, backup-to-disk will become more prominent in organizations.

Convergent encryption, also known as content hash keying, is a cryptosystem that produces identical ciphertext from identical plaintext files. This has applications in cloud computing to remove duplicate files from storage without the provider having access to the encryption keys. The combination of deduplication and convergent encryption was described in a backup system patent filed by Stac Electronics in 1995. This combination has been used by Farsite, Permabit, Freenet, MojoNation, GNUnet, flud, and the Tahoe Least-Authority File Store.

<span class="mw-page-title-main">Veeam Backup & Replication</span> Backup and disaster recovery software

Veeam Backup & Replication is a proprietary backup app developed by Veeam for virtual environments built on VMware vSphere, Nutanix AHV, and Microsoft Hyper-V hypervisors. The software provides backup, restore and replication functionality for virtual machines, physical servers and workstations as well as cloud-based workload.

ONTAP or Data ONTAP or Clustered Data ONTAP (cDOT) or Data ONTAP 7-Mode is NetApp's proprietary operating system used in storage disk arrays such as NetApp FAS and AFF, ONTAP Select, and Cloud Volumes ONTAP. With the release of version 9.0, NetApp decided to simplify the Data ONTAP name and removed the word "Data" from it, and remove the 7-Mode image, therefore, ONTAP 9 is the successor of Clustered Data ONTAP 8.

<span class="mw-page-title-main">Proxmox Backup Server</span> Linux distribution for backup of VMs, container, and physical hosts.

Proxmox Backup Server is an open-source backup software project supporting virtual machines, containers, and physical hosts. The Bare-metal server is based on the Debian Linux distribution, with some extended features, such as out-of-the-box ZFS support and Linux kernel 5.4 LTS. Proxmox Backup Server is licensed under the GNU Affero General Public License, version 3.

References

  1. "Understanding Data Deduplication". Druva. 2009-01-09. Archived from the original on 2019-08-06. Retrieved 2019-08-06.
  2. 1 2 "SNIA Dictionary » Dictionary D". Archived from the original on 2018-12-24. Retrieved 2023-12-06.
  3. Compression, deduplication and encryption: What's the difference? Archived 2018-12-23 at the Wayback Machine , Stephen Bigelow and Paul Crocetti
  4. "In-line or post-process de-duplication? (updated 6-08)". Backup Central. Archived from the original on 2009-12-06. Retrieved 2023-12-06.
  5. "Inline vs. post-processing deduplication appliances". techtarget.com. Archived from the original on 2009-06-09. Retrieved 2023-12-06.
  6. "Windows Server 2008: Windows Storage Server 2008". Microsoft.com. Archived from the original on 2009-10-04. Retrieved 2009-10-16.
  7. "Products - Platform OS". NetApp. Archived from the original on 2010-02-06. Retrieved 2009-10-16.
  8. "The reflink(2) system call v5". lwn.net. Archived from the original on 2015-10-02. Retrieved 2019-10-04.
  9. "ioctl_ficlonerange(2)". Linux Manual Page. Archived from the original on 2019-10-07. Retrieved 2019-10-04.
  10. Kazuki MATSUDA. "Add clonefile on Windows over ReFS support". GitHub. Archived from the original on 2021-01-13. Retrieved 2020-02-23.
  11. An example of an implementation that checks for identity rather than assuming it is described in "US Patent application # 20090307251" Archived 2017-01-15 at the Wayback Machine .
  12. Explaining deduplication rates and single-instance storage to clients Archived 2018-12-23 at the Wayback Machine . George Crump, Storage Switzerland
  13. CHRISTIAN CACHIN; MATTHIAS SCHUNTER (December 2011). "A Cloud You Can Trust". IEEE Spectrum . IEEE. Archived from the original on 2012-01-02. Retrieved 2011-12-21.