Append-only

Last updated

Append-only is a property of computer data storage such that new data can be appended to the storage, but where existing data is immutable.

Contents

Access control

Many file systems' Access Control Lists implement an "append-only" permission:

Many cloud storage providers provide the ability to limit access as append-only. [3] This feature is especially important to mitigate the risk of data loss for backup policies in the event that the computer being backed-up becomes infected with ransomware capable of deleting or encrypting the computer's backups. [4] [5]

Data structures

Many data structures and databases implement immutable objects, effectively making their data structures append-only. Implementing an append-only data structure has many benefits, such as ensuring data consistency, improving performance, [6] and permitting rollbacks. [7] [8]

The prototypical append-only data structure is the log file. Log-structured data structures found in Log-structured file systems and databases work in a similar way: every change (transaction) that happens to the data is logged by the program, and on retrieval the program must combine the pieces of data found in this log file. [9] Blockchains add cryptography to the logs so that every transaction is verifiable.

Append-only data structures may also be mandated by the hardware or software environment:

Append-only data structures grow over time, with more and more space dedicated to "stale" data found only in the history and more time wasted on parsing these data. A number of append-only systems implement rewriting (copying garbage collection), so that a new structure is created only containing the current version and optionally a few older ones. [7] [13]

See also

Related Research Articles

In computing, a computer file is a resource for recording data on a computer storage device, primarily identified by its filename. Just as words can be written on paper, so can data be written to a computer file. Files can be shared with and transferred between computers and mobile devices via removable media, networks, or the Internet.

In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence. In its most basic form, each node contains data, and a reference to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. More complex variants add additional links, allowing more efficient insertion or removal of nodes at arbitrary positions. A drawback of linked lists is that data access time is a linear function of the number of nodes for each linked list because nodes are serially linked so a node needs to be accessed first to access the next node. Faster access, such as random access, is not feasible. Arrays have better cache locality compared to linked lists.

In computer programming, a reference is a value that enables a program to indirectly access a particular data, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

A log-structured filesystem is a file system in which data and metadata are written sequentially to a circular buffer, called a log. The design was first proposed in 1988 by John K. Ousterhout and Fred Douglis and first implemented in 1992 by Ousterhout and Mendel Rosenblum for the Unix-like Sprite distributed operating system.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

<span class="mw-page-title-main">Disk editor</span> Computer software

A disk editor is a computer program that allows its user to read, edit, and write raw data on disk drives ; as such, they are sometimes called sector editors, since the read/write routines built into the electronics of most disk drives require to read/write data in chunks of sectors. Many disk editors can also be used to edit the contents of a running computer's memory or a disk image.

Utility software is a program specifically designed to help manage and tune system or application software. It is used to support the computer infrastructure - in contrast to application software, which is aimed at directly performing tasks that benefit ordinary users. However, utilities often form part of the application systems. For example, a batch job may run user-written code to update a database and may then include a step that runs a utility to back up the database, or a job may run a utility to compress a disk before copying files..

<span class="mw-page-title-main">File system</span> Format or program for storing files and directories

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

Filesystem in Userspace (FUSE) is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

Data remanence is the residual representation of digital data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written to the media, or through physical properties of the storage media that allow previously written data to be recovered. Data remanence may make inadvertent disclosure of sensitive information possible should the storage media be released into an uncontrolled environment.

A file-hosting service, also known as cloud-storage service, online file-storage provider, or cyberlocker is an internet hosting service specifically designed to host user files. These services allows users to upload files that can be accessed over the internet after providing a username and password or other authentication. Typically, file hosting services allow HTTP access, and in some cases, FTP access. Other related services include content-displaying hosting services, virtual storage, and remote backup solutions.

<span class="mw-page-title-main">Ransomware</span> Malicious software used in ransom demands

Ransomware is a type of malware from cryptovirology that threatens to publish the victim's personal data or permanently block access to it unless a ransom is paid off. While some simple ransomware may lock the system without damaging any files, more advanced malware uses a technique called cryptoviral extortion. It encrypts the victim's files, making them inaccessible, and demands a ransom payment to decrypt them. In a properly implemented cryptoviral extortion attack, recovering the files without the decryption key is an intractable problem – and difficult to trace digital currencies such as paysafecard or Bitcoin and other cryptocurrencies are used for the ransoms, making tracing and prosecuting the perpetrators difficult.

Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data acts as the address of the data. This enforces a write-once policy since no other data block can be found with the same address: the addresses of multiple writes of the same data are identical, so duplicate data is easily identified and the data block is stored only once. Data blocks cannot be removed, making it ideal for permanent or backup storage. Venti is typically used with Fossil to provide a file system with permanent snapshots.

<span class="mw-page-title-main">Cryptovirology</span> Study of securing and encrypting virology

Cryptovirology refers to the use of cryptography to devise particularly powerful malware, such as ransomware and asymmetric backdoors. Traditionally, cryptography and its applications are defensive in nature, and provide privacy, authentication, and security to users. Cryptovirology employs a twist on cryptography, showing that it can also be used offensively. It can be used to mount extortion based attacks that cause loss of access to information, loss of confidentiality, and information leakage, tasks which cryptography typically prevents.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

<span class="mw-page-title-main">Redis</span> Open-source in-memory key–value database

Redis is an open-source in-memory storage, used as a distributed, in-memory key–value database, cache and message broker, with optional durability. Because it holds all data in memory and because of its design, Redis offers low-latency reads and writes, making it particularly suitable for use cases that require a cache. Redis is the most popular NoSQL database, and one of the most popular databases overall. Redis is used in companies like Twitter, Airbnb, Tinder, Yahoo, Adobe, Hulu, and Amazon.

Higher performance in hard disk drives comes from devices which have better performance characteristics. These performance characteristics can be grouped into two categories: access time and data transfer time.

Shingled magnetic recording (SMR) is a magnetic storage data recording technology used in hard disk drives (HDDs) to increase storage density and overall per-drive storage capacity. Conventional hard disk drives record data by writing non-overlapping magnetic tracks parallel to each other, while shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing higher track density. Thus, the tracks partially overlap similar to roof shingles. This approach was selected because, if the writing head is made too narrow, it cannot provide the very high fields required in the recording layer of the disk.

Object storage is a computer data storage that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier. Object storage can be implemented at multiple levels, including the device level, the system level, and the interface level. In each case, object storage seeks to enable capabilities not addressed by other storage architectures, like interfaces that are directly programmable by the application, a namespace that can span multiple instances of physical hardware, and data-management functions like data replication and data distribution at object-level granularity.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

References

  1. chattr(1)    Linux User Manual – User Commands
  2. "powershell - How to give "only append" access to user in windows , for logging purposes". Server Fault.
  3. Jim Donovan (September 11, 2018). "Why Use Immutable Storage?". Wasabi.
  4. Eugene Kolodenker; William Koch; Gianluca Stringhini; Manuel Egele (April 2017). "PayBreak: Defense Against Cryptographic Ransomware". Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. pp. 599–611. doi: 10.1145/3052973.3053035 . Due to the threat of ransomware targeting the key vault, our implementation stores the harvested key material into an append-only file protected with Administrator privileges.
  5. Pont, Jamie; Abu Oun, Osama; Brierley, Calvin; Arief, Budi; Hernandez-Castro, Julio (2019). "A Roadmap for Improving the Impact of Anti-ransomware Research". Secure IT Systems, Proceedings of 24th Nordic Conference, NordSec 2019. Springer International Publishing. pp. 137–154. ISBN   978-3-030-35055-0.
  6. 1 2 Magic Pocket Hardware Engineering Teams. "Extending Magic Pocket Innovation with the first petabyte scale SMR drive deployment". dropbox.tech.
  7. 1 2 "Redis Persistence". Redis.
  8. "Additional Notes". Borg Deduplicating Archiver 1.1.11 documentation.
  9. 1 2 Reid, Colin; Bernstein, Phil (1 January 2010). "Implementing an Append-Only Interface for Semiconductor Storage" (PDF). IEEE Data Eng. Bull. 33: 14–20.
  10. "Thirteen ways of looking at a turtle". F# for fun and profit. Retrieved 2018-11-13.
  11. "NVMe Zoned Namespace". ZonedStorage.io. Archived from the original on 2020-01-29. Retrieved 2020-04-25. The internals of Solid State Drives are such that they implement a log-structured data structure, where data is written sequentially to the media.
  12. Jake Edge (March 26, 2014). "Support for shingled magnetic recording devices". LWN.net . Retrieved December 14, 2014.
  13. Brewer, Eric; Ying, Lawrence; Greenfield, Lawrence; Cypher, Robert; T'so, Theodore (2016). "Disks for Data Centers". Proceedings of USENIX FAST 2016. Because of the write restrictions imposed by SMR, when data is deleted, that deleted capacity can not be reused until the system copies the remaining live data in that SMR zone to another part of the disk, a form of garbage collection (GC).