Content-addressable storage

Last updated

Content-addressable storage (CAS), also referred to as content-addressed storage or fixed-content storage, is a way to store information so it can be retrieved based on its content, not its name or location. It has been used for high-speed storage and retrieval of fixed content, such as documents stored for compliance with government regulations[ citation needed ]. Content-addressable storage is similar to content-addressable memory.

Contents

CAS systems work by passing the content of the file through a cryptographic hash function to generate a unique key, the "content address". The file system's directory stores these addresses and a pointer to the physical storage of the content. Because an attempt to store the same file will generate the same key, CAS systems ensure that the files within them are unique, and because changing the file will result in a new key, CAS systems provide assurance that the file is unchanged.

CAS became a significant market during the 2000s, especially after the introduction of the 2002 Sarbanes–Oxley Act in the United States which required the storage of enormous numbers of documents for long periods and retrieved only rarely. Ever-increasing performance of traditional file systems and new software systems have eroded the value of legacy CAS systems, which have become increasingly rare after roughly 2018[ citation needed ]. However, the principles of content addressability continue to be of great interest to computer scientists, and form the core of numerous emerging technologies, such as peer-to-peer file sharing, cryptocurrencies, and distributed computing.

Description

Location-based approaches

Traditional file systems generally track files based on their filename. On random-access media like a floppy disk, this is accomplished using a directory that consists of some sort of list of filenames and pointers to the data. The pointers refer to a physical location on the disk, normally using disk sectors. On more modern systems and larger formats like hard drives, the directory is itself split into many subdirectories, each tracking a subset of the overall collection of files. Subdirectories are themselves represented as files in a parent directory, producing a hierarchy or tree-like organization. The series of directories leading to a particular file is known as a "path". [1]

In the context of CAS, these traditional approaches are referred to as "location-addressed", as each file is represented by a list of one or more locations, the path and filename, on the physical storage. In these systems, the same file with two different names will be stored as two files on disk and thus have two addresses. The same is true if the same file, even with the same name, is stored in more than one location in the directory hierarchy. This makes them less than ideal for a digital archive, where any unique information should only be stored once. [2]

As the concept of the hierarchical directory became more common in operating systems especially during the late 1980s, this sort of access pattern began to be used by entirely unrelated systems. For instance, the World Wide Web uses a similar pathname/filename-like system known as the URL to point to documents. The same document on another web server has a different URL in spite of being identical content. Likewise, if an existing location changes in any way, if the filename changes or the server moves to a new domain name service name, the document is no longer accessible. This leads to the common problem of link rot. [2]

CAS and FCS

Although location-based storage is widely used in many fields, this was not always the case. Previously, the most common way to retrieve data from a large collection was to use some sort of identifier based on the content of the document. For instance, the ISBN system is used to generate a unique number for every book. If one performs a web search for "ISBN 0465048994", one will be provided with a list of locations for the book Why Information Grows on the topic of information storage. Although many locations will be returned, they all refer to the same work, and the user can then pick whichever location is most appropriate. Additionally, if any one of these locations changes or disappears, the content can be found at any of the other locations. [2]

CAS systems attempt to produce ISBN like results automatically and on any document. They do this by using a cryptographic hash function on the data of the document to produce what is sometimes known as a "key" or "fingerprint". This key is strongly tied to the exact content of the document, adding a single space at the end of the file, for instance, will produce a different key. In a CAS system, the directory does not map filenames onto locations, but uses the keys instead. [2]

This provides several benefits. For one, when a file is sent to the CAS for storage, the hash function will produce a key and then check to see if that key already exists in the directory. If it does, the file is not stored as the one already in storage is identical. This allows CAS systems to easily avoid duplicate data. Additionally, as the key is based on the content of the file, retrieving a document with a given key ensures that the stored file has not been changed. The downside to this approach is that any changes to the document produces a different key, which makes CAS systems unsuitable for files that are often edited. For all of these reasons, CAS systems are normally used for archives of largely static documents, [2] and are sometimes known as "fixed content storage" (FCS). [3]

Because the keys are not human-readable, CAS systems implement a second type of directory that stores metadata that will help users find a document. These almost always include a filename, allowing the classic name-based retrieval to be used. But the directory will also include fields for common identification systems like ISBN or ISSN codes, user-provided keywords, time and date stamps, and full-text search indexes. Users can search these directories and retrieve a key, which can then be used to retrieve the actual document. [2]

Using a CAS is very similar to using a web search engine. The primary difference is that a web search is generally performed on a topic basis using an internal algorithm that finds "related" content and then produces a list of locations. The results may be a list of the identical content in multiple locations. In a CAS, more than one document may be returned for a given search, but each of those documents will be unique and presented only once.

Another advantage to CAS is that the physical location in storage is not part of the lookup system. If, for instance, a library's card catalog stated a book could be found on "shelf 43, bin 10", if the library is re-arranged the entire catalog has to be updated. In contrast, the ISBN will not change and the book can be found by looking for the shelf with those numbers. In the computer setting, a file in the DOS filesystem at the path A:\myfiles\textfile.txt points to the physical storage of the file in the myfiles subdirectory. This file disappears if the floppy is moved to the B: drive, and even moving its location within the disk hierarchy requires the user-facing directories to be updated. In CAS, only the internal mapping from key to physical location changes, and this exists in only one place and can be designed for efficient updating. This allows files to be moved among storage devices, and even across media, without requiring any changes to the retrieval.

For data that changes frequently, CAS is not as efficient as location-based addressing. In these cases, the CAS device would need to continually recompute the address of data as it was changed. This would result in multiple copies of the entire almost-identical document being stored, the problem that CAS attempts to avoid. Additionally, the user-facing directories would have to be continually updated with these "new" files, which would become polluted by many similar documents that would make searching more difficult. In contrast, updating a file in a location-based system is highly optimized, only the internal list of sectors has to be changed and many years of tuning have been applied to this operation.

Because CAS is used primarily for archiving, file deletion is often tightly controlled or even impossible under user control. In contrast, automatic deletion is a common feature, removing all files older than some legally defined requirement, say ten years. [2]

In distributed computing

The simplest way to implement a CAS system is to store all of the files within a typical database to which clients connect to add, query, and retrieve files. However, the unique properties of content addressability mean that the paradigm is well suited for computer systems in which multiple hosts collaboratively manage files with no central authority, such as distributed file sharing systems, in which the physical location of a hosted file can change rapidly in response to changes in network topology, while the exact content of the files to be retrieved are of more importance to users than their current physical location. In a distributed system, content hashes are often used for quick network-wide searches for specific files, or to quickly see which data in a given file has been changed and must be propagated to other members of the network with minimal bandwidth usage. In these systems, content addressability allows highly variable network topology to be abstracted away from users who wish to access data, compared to systems like the World Wide Web, in which a consistent location of a file or service is key to easy use.

Content-addressable networks

The content-addressable network (CAN) is a distributed, decentralized P2P infrastructure that provides hash table functionality on an Internet-like scale. CAN was one of the original four distributed hash table proposals, introduced concurrently with Chord, Pastry, and Tapestry.

History

A hardware device called the Content Addressable File Store (CAFS) was developed by International Computers Limited (ICL) in the late 1960s and put into use by British Telecom in the early 1970s for telephone directory lookups. The user-accessible search functionality was maintained by the disk controller with a high-level application programming interface (API) so users could send queries into what appeared to be a black box that returned documents. The advantage was that no information had to be exchanged with the host computer while the disk performed the search.

Paul Carpentier and Jan van Riel coined the term CAS while working at a company called FilePool in the late 1990s. FilePool was purchased by EMC Corporation in 2001 and was released the next year as Centera. [4] The timing was perfect; the introduction of the Sarbanes–Oxley Act in 2002 required companies to store huge amounts of documentation for extended periods and required them to do so in a fashion that ensured they were not edited after-the-fact. [5]

A number of similar products soon appeared from other large-system vendors. In mid-2004, the industry group SNIA began working with a number of CAS providers to create standard behavior and interoperability guidelines for CAS systems. [6]

In addition to CAS, a number of similar products emerged that added CAS-like capabilities to existing products; notable among these was IBM Tivoli Storage Manager. The rise of cloud computing and the associated elastic cloud storage systems like Amazon S3 further diluted the value of dedicated CAS systems. Dell purchased EMC in 2016 and stopped sales of the original Centera in 2018 in favor of their elastic storage product. [7]

CAS was not associated with peer-to-peer applications until the 2000s, when rapidly proliferating Internet access in homes and businesses led to a large number of computer users who wanted to swap files, originally doing so on centrally managed services like Napster. However, an injunction against Napster prompted the independent development of file-sharing services such as BitTorrent, which could not be centrally shut down. In order to function without a central federating server, these services rely heavily on CAS to enforce the faithful copying and easy querying of unique files. At the same time, the growth of the open-source software movement in the 2000s led to the rapid proliferation of CAS-based services such as Git, a version control system that uses numerous cryptographic functions such as Merkle trees to enforce data integrity between users and allow for multiple versions of files with minimal disk and network usage. Around this time, individual users of public-key cryptography used CAS to store their public keys on systems such as key servers.

The rise of mobile computing and high capacity mobile broadband networks in the 2010s, coupled with increasing reliance on web applications for everyday computing tasks, strained the existing location-addressed client–server model commonplace among Internet services, leading to an accelerated pace of link rot and an increased reliance on centralized cloud hosting. Furthermore, growing concerns about the centralization of computing power in the hands of large technology companies, potential monopoly power abuses, and privacy concerns led to a number of projects created with the goal of creating more decentralized systems. Bitcoin uses CAS and public/private key pairs to manage wallet addresses, as do most other cryptocurrencies. IPFS uses CAS to identify and address communally hosted files on its network. Numerous other peer-to-peer systems designed to run on smartphones, which often access the Internet from varying locations, utilize CAS to store and access user data for both convenience and data privacy purposes, such as secure instant messaging.

Implementations

Proprietary

The Centera CAS system consists of a series of networked nodes (typically large servers running Linux), divided between storage nodes and access nodes. The access nodes maintain a synchronized directory of content addresses, and the corresponding storage node where each address can be found. When a new data element, or blob, is added, the device calculates a hash of the content and returns this hash as the blob's content address. [8] As mentioned above, the hash is searched to verify that identical content is not already present. If the content already exists, the device does not need to perform any additional steps; the content address already points to the proper content. Otherwise, the data is passed off to a storage node and written to the physical media.

When a content address is provided to the device, it first queries the directory for the physical location of the specified content address. The information is then retrieved from a storage node, and the actual hash of the data recomputed and verified. Once this is complete, the device can supply the requested data to the client. Within the Centera system, each content address actually represents a number of distinct data blobs, as well as optional metadata. Whenever a client adds an additional blob to an existing content block, the system recomputes the content address.

To provide additional data security, the Centera access nodes, when no read or write operation is in progress, constantly communicate with the storage nodes, checking the presence of at least two copies of each blob as well as their integrity. Additionally, they can be configured to exchange data with a different, e.g., off-site, Centera system, thereby strengthening the precautions against accidental data loss.

IBM has another flavor of CAS which can be software-based, Tivoli Storage manager 5.3, or hardware-based, the IBM DR550. The architecture is different in that it is based on hierarchical storage management (HSM) design which provides some additional flexibility such as being able to support not only WORM disk but WORM tape and the migration of data from WORM disk to WORM tape and vice versa. This provides for additional flexibility in disaster recovery situations as well as the ability to reduce storage costs by moving data off the disk to tape.

Another typical implementation is iCAS from iTernity. The concept of iCAS is based on containers. Each container is addressed by its hash value. A container holds different numbers of fixed content documents. The container is not changeable, and the hash value is fixed after the write process.

Open-source

See also

Related Research Articles

In computing, a computer file is a resource for recording data on a computer storage device, primarily identified by its filename. Just as words can be written on paper, so can data be written to a computer file. Files can be shared with and transferred between computers and mobile devices via removable media, networks, or the Internet.

<span class="mw-page-title-main">Hyphanet</span> Peer-to-peer Internet platform for censorship-resistant communication

Hyphanet is a peer-to-peer platform for censorship-resistant, anonymous communication. It uses a decentralized distributed data store to keep and deliver information, and has a suite of free software for publishing and communicating on the Web without fear of censorship. Both Freenet and some of its associated tools were originally designed by Ian Clarke, who defined Freenet's goal as providing freedom of speech on the Internet with strong anonymity protection.

<span class="mw-page-title-main">Peer-to-peer</span> Type of decentralized and distributed network architecture

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network, forming a peer-to-peer network of nodes.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

<span class="mw-page-title-main">Cryptographic hash function</span> Hash function that is suitable for use in cryptography

A cryptographic hash function (CHF) is a hash algorithm that has special properties desirable for a cryptographic application:

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

Files-11 is the file system used in the RSX-11 and OpenVMS operating systems from Digital Equipment Corporation. It supports record-oriented I/O, remote network access, and file versioning. The original ODS-1 layer is a flat file system; the ODS-2 version is a hierarchical file system, with support for access control lists,.

A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes.

<span class="mw-page-title-main">File system</span> Format or program for storing files and directories

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only keys need to be remapped on average where is the number of keys and is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

BitVault is a content-addressable distributed storage system, developed by Microsoft Research in China. BitVault uses peer-to-peer technology to distribute the tasks of storing and managing data. As such, there is no central authority responsible for management of the system. Rather, it is self-managing, provides high availability, reliability and scales up in a self-organizing manner, with low administrative overhead, which is almost constant irrespective of the size of the distributed overlay network.

Pastry is an overlay network and routing network for the implementation of a distributed hash table (DHT) similar to Chord. The key–value pairs are stored in a redundant peer-to-peer network of connected Internet hosts. The protocol is bootstrapped by supplying it with the IP address of a peer already in the network and from then on via the routing table which is dynamically built and repaired. It is claimed that because of its redundant and decentralized nature there is no single point of failure and any single node can leave the network at any time without warning and with little or no chance of data loss. The protocol is also capable of using a routing metric supplied by an outside program, such as ping or traceroute, to determine the best routes to store in its routing table.

Filesystem-level encryption, often called file-based encryption, FBE, or file/folder encryption, is a form of disk encryption where individual files or directories are encrypted by the file system itself.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was founded by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.

<span class="mw-page-title-main">MagmaFS</span>

Magma is a distributed file system based on a distributed hash table, written in C, compatible with Linux and BSD kernels using FUSE.

In the BitTorrent file distribution system, a torrent file or meta-info file is a computer file that contains metadata about files and folders to be distributed, and usually also a list of the network locations of trackers, which are computers that help participants in the system find each other and form efficient distribution groups called swarms. Torrent files are normally named with the extension .torrent.

Object storage is a computer data storage approach that manages data as "blobs" or "objects", as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Each object is typically associated with a variable amount of metadata, and a globally unique identifier. Object storage can be implemented at multiple levels, including the device level, the system level, and the interface level. In each case, object storage seeks to enable capabilities not addressed by other storage architectures, like interfaces that are directly programmable by the application, a namespace that can span multiple instances of physical hardware, and data-management functions like data replication and data distribution at object-level granularity.

Elliptics is a distributed key–value data storage with open source code. By default it is a classic distributed hash table (DHT) with multiple replicas put in different groups. Elliptics was created to meet requirements of multi-datacenter and physically distributed storage locations when storing huge amount of medium and large files.

<span class="mw-page-title-main">InterPlanetary File System</span> Content-addressable, peer-to-peer hypermedia distribution protocol

The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting IPFS hosts.

References

  1. Cheung, Shun Yan. "Storing and Organizing Data Files on Disks". Emory University.
  2. 1 2 3 4 5 6 7 Zumwalt, Matt. "The Power of Content-addressing".
  3. Connor, Deni (29 May 2003). "Fixed content storage grabs users' attention". ComputerWorld.
  4. Content-addressable storage – Storage as I See it, by Mark Ferelli, Oct, 2002, BNET.com
  5. USENIX Annual Technical Conference 2003, General Track – Abstract
  6. CAS Industry standardization activities – XAM: http://www.snia.org/forums/xam
  7. Sheldon, Robert. "Content Addressed Storage".
  8. Making a hash of file content Content-addressable storage uses hash algorithms., By Chris Mellor, Published: 9 December 2003, Techworld Archived 28 September 2007 at the Wayback Machine Article moved to https://www.techworld.com/data/making-a-hash-of-file-content-235/
  9. "Venti: a new approach to archival storage". doc.cat-v.org. Retrieved 30 June 2019.
  10. "Twisted Storage". twistedstorage.sourceforge.net. Retrieved 30 June 2019.
  11. "HoneyComb Fixed Content Storage at OpenSolaris.org". Archived from the original on 12 October 2007. Retrieved 1 October 2007.
  12. "The XAM (eXtensible Access Method) Interface specification".
  13. A simple content-addressable storage system for .NET 4.5 and .NET Core: point-platform/cassette, Point Platform, 6 May 2019, retrieved 30 June 2019
  14. "Keep – Arvados". dev.arvados.org. Retrieved 30 June 2019.
  15. "Lennart Poettering Announces New Project: casync – Phoronix". Phoronix .