Erasure code

Last updated

In coding theory, an erasure code is a forward error correction (FEC) code under the assumption of bit erasures (rather than bit errors), which transforms a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols. The fraction r = k/n is called the code rate. The fraction k’/k, where k’ denotes the number of symbols required for recovery, is called reception efficiency. The recovery algorithm expects that it is known which of the n symbols are lost unlike forward error correction codes.

Contents

Optimal erasure codes

Optimal erasure codes have the property that any k out of the n code word symbols are sufficient to recover the original message (i.e., they have optimal reception efficiency). Optimal erasure codes are maximum distance separable codes (MDS codes).

Parity check

Parity check is the special case where n = k + 1. From a set of k values , a checksum is computed and appended to the k source values:

The set of k + 1 values is now consistent with regard to the checksum. If one of these values, , is erased, it can be easily recovered by summing the remaining variables:

Polynomial oversampling

Example: Err-mail (k = 2)

In the simple case where k = 2, redundancy symbols may be created by sampling different points along the line between the two original symbols. This is pictured with a simple example, called err-mail:

Alice wants to send her telephone number (555629) to Bob using err-mail. Err-mail works just like e-mail, except

  1. About half of all the mail gets lost. [1]
  2. Messages longer than 5 characters are illegal.
  3. It is very expensive (similar to air-mail).

Instead of asking Bob to acknowledge the messages she sends, Alice devises the following scheme.

  1. She breaks her telephone number up into two parts a = 555, b = 629, and sends 2 messages – "A=555" and "B=629" – to Bob.
  2. She constructs a linear function, , in this case , such that and .

Code d'effacement optimal 1.gif

  1. She computes the values f(3), f(4), and f(5), and then transmits three redundant messages: "C=703", "D=777" and "E=851".

Bob knows that the form of f(k) is , where a and b are the two parts of the telephone number. Now suppose Bob receives "D=777" and "E=851".

Code d'effacement optimal 2.gif

Bob can reconstruct Alice's phone number by computing the values of a and b from the values (f(4) and f(5)) he has received. Bob can perform this procedure using any two err-mails, so the erasure code in this example has a rate of 40%.

Note that Alice cannot encode her telephone number in just one err-mail, because it contains six characters, and that the maximum length of one err-mail message is five characters. If she sent her phone number in pieces, asking Bob to acknowledge receipt of each piece, at least four messages would have to be sent anyway (two from Alice, and two acknowledgments from Bob). So the erasure code in this example, which requires five messages, is quite economical.

This example is a little bit contrived. For truly generic erasure codes that work over any data set, we would need something other than the f(i) given.

General case

The linear construction above can be generalized to polynomial interpolation. Additionally, points are now computed over a finite field.

First we choose a finite field F with order of at least n, but usually a power of 2. The sender numbers the data symbols from 0 to k  1 and sends them. He then constructs a (Lagrange) polynomial p(x) of order k such that p(i) is equal to data symbol i. He then sends p(k), ..., p(n  1). The receiver can now also use polynomial interpolation to recover the lost packets, provided he receives k symbols successfully. If the order of F is less than 2b, where b is the number of bits in a symbol, then multiple polynomials can be used.

The sender can construct symbols k to n  1 'on the fly', i.e., distribute the workload evenly between transmission of the symbols. If the receiver wants to do his calculations 'on the fly', he can construct a new polynomial q, such that q(i) = p(i) if symbol i < k was received successfully and q(i) = 0 when symbol i < k was not received. Now let r(i) = p(i)  q(i). Firstly we know that r(i) = 0 if symbol i < k has been received successfully. Secondly, if symbol i  k has been received successfully, then r(i) = p(i)  q(i) can be calculated. So we have enough data points to construct r and evaluate it to find the lost packets. So both the sender and the receiver require O(n (n  k)) operations and only O(n  k) space for operating 'on the fly'.

Real world implementation

This process is implemented by Reed–Solomon codes, with code words constructed over a finite field using a Vandermonde matrix.

Most practical erasure codes are systematic codes -- each one of the original k symbols can be found copied, unencoded, as one of the n message symbols. [2] (Erasure codes that support secret sharing never use a systematic code).

Near-optimal erasure codes

Near-optimal erasure codes require (1 + ε)k symbols to recover the message (where ε>0). Reducing ε can be done at the cost of CPU time. Near-optimal erasure codes trade correction capabilities for computational complexity: practical algorithms can encode and decode with linear time complexity.

Fountain codes (also known as rateless erasure codes) are notable examples of near-optimal erasure codes. They can transform a k symbol message into a practically infinite encoded form, i.e., they can generate an arbitrary amount of redundancy symbols that can all be used for error correction. Receivers can start decoding after they have received slightly more than k encoded symbols.

Regenerating codes address the issue of rebuilding (also called repairing) lost encoded fragments from existing encoded fragments. This issue occurs in distributed storage systems where communication to maintain encoded redundancy is a problem. [2]

Applications of erasure coding in storage systems

Erasure coding is now standard practice for reliable data storage. [3] [4] [5] In particular, various implementations of Reed-Solomon erasure coding are used by Apache Hadoop, the RAID-6 built into Linux, Microsoft Azure, Facebook cold storage, and Backblaze Vaults. [5] [2]

The classical way to recover from failures in storage systems was to use replication. However, replication incurs significant overhead in terms of wasted bytes. Therefore, increasingly large storage systems, such as those used in data centers use erasure-coded storage. The most common form of erasure coding used in storage systems is Reed-Solomon (RS) code, an advanced mathematics formula used to enable regeneration of missing data from pieces of known data, called parity blocks. In a (k, m) RS code, a given set of k data blocks, called "chunks", are encoded into (k + m) chunks. The total set of chunks comprises a stripe. The coding is done such that as long as at least k out of (k + m) chunks are available, one can recover the entire data. This means a (k, m) RS-encoded storage can tolerate up to m failures.

Example: In RS (10, 4) code, which is used in Facebook for their HDFS, [6] 10 MB of user data is divided into ten 1MB blocks. Then, four additional 1 MB parity blocks are created to provide redundancy. This can tolerate up to 4 concurrent failures. The storage overhead here is 14/10 = 1.4X.

In the case of a fully replicated system, the 10 MB of user data will have to be replicated 4 times to tolerate up to 4 concurrent failures. The storage overhead in that case will be 50/10 = 5 times.

This gives an idea of the lower storage overhead of erasure-coded storage compared to full replication and thus the attraction in today's storage systems.

Initially, erasure codes were used to reduce the cost of storing "cold" (rarely-accessed) data efficiently; but erasure codes can also be used to improve performance serving "hot" (more-frequently-accessed) data. [2]

Examples

Here are some examples of implementations of the various codes:

Near optimal erasure codes

Near optimal fountain (rateless erasure) codes

Optimal erasure codes

See also

Related Research Articles

<span class="mw-page-title-main">Error detection and correction</span> Techniques that enable reliable delivery of digital data over unreliable communication channels

In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases.

In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of the encoder over the data, which gives rise to the term 'convolutional coding'. The sliding nature of the convolutional codes facilitates trellis decoding using a time-invariant trellis. Time invariant trellis decoding allows convolutional codes to be maximum-likelihood soft-decision decoded with reasonable complexity.

Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, CDs, DVDs, Blu-ray discs, QR codes, data transmission technologies such as DSL and WiMAX, broadcast systems such as satellite communications, DVB and ATSC, and storage systems such as RAID 6.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

<span class="mw-page-title-main">Coding theory</span> Study of the properties of codes and their fitness

Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are studied by various scientific disciplines—such as information theory, electrical engineering, mathematics, linguistics, and computer science—for the purpose of designing efficient and reliable data transmission methods. This typically involves the removal of redundancy and the correction or detection of errors in the transmitted data.

In information theory, a low-density parity-check (LDPC) code is a linear error correcting code, a method of transmitting a message over a noisy transmission channel. An LDPC code is constructed using a sparse Tanner graph. LDPC codes are capacity-approaching codes, which means that practical constructions exist that allow the noise threshold to be set very close to the theoretical maximum for a symmetric memoryless channel. The noise threshold defines an upper bound for the channel noise, up to which the probability of lost information can be made as small as desired. Using iterative belief propagation techniques, LDPC codes can be decoded in time linear in their block length.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

<span class="mw-page-title-main">QR code</span> Type of matrix barcode

A QR code is a type of two-dimensional matrix barcode, invented in 1994, by Japanese company Denso Wave for labelling automobile parts. A QR code consists of black squares arranged in a square grid on a white background, including some fiducial markers, which can be read by an imaging device, such as a camera, and processed using Reed–Solomon error correction until the image can be appropriately interpreted. The required data are then extracted from patterns that are present in both the horizontal and the vertical components of the QR image.

In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract definition of block codes is conceptually useful because it allows coding theorists, mathematicians, and computer scientists to study the limitations of all block codes in a unified way. Such limitations often take the form of bounds that relate different parameters of the block code to each other, such as its rate and its ability to detect and correct errors.

<span class="mw-page-title-main">Data Matrix</span> Two-dimensional matrix barcode

A Data Matrix is a two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, also known as a matrix. The information to be encoded can be text or numeric data. Usual data size is from a few bytes up to 1556 bytes. The length of the encoded data depends on the number of cells in the matrix. Error correction codes are often used to increase reliability: even if one or more cells are damaged so it is unreadable, the message can still be read. A Data Matrix symbol can store up to 2,335 alphanumeric characters.

In coding theory, fountain codes are a class of erasure codes with the property that a potentially limitless sequence of encoding symbols can be generated from a given set of source symbols such that the original source symbols can ideally be recovered from any subset of the encoding symbols of size equal to or only slightly larger than the number of source symbols. The term fountain or rateless refers to the fact that these codes do not exhibit a fixed code rate.

In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.

In computer storage, the standard RAID levels comprise a basic set of RAID configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives (HDDs). The most common types are RAID 0 (striping), RAID 1 (mirroring) and its variants, RAID 5, and RAID 6. Multiple RAID levels can also be combined or nested, for instance RAID 10 or RAID 01. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard. The numerical values only serve as identifiers and do not signify performance, reliability, generation, or any other metric.

In coding theory, a systematic code is any error-correcting code in which the input data are embedded in the encoded output. Conversely, in a non-systematic code the output does not contain the input symbols.

In coding theory, concatenated codes form a class of error-correcting codes that are derived by combining an inner code and an outer code. They were conceived in 1966 by Dave Forney as a solution to the problem of finding a code that has both exponentially decreasing error probability with increasing block length and polynomial-time decoding complexity. Concatenated codes became widely used in space communications in the 1970s.

In coding theory, list decoding is an alternative to unique decoding of error-correcting codes for large error rates. The notion was proposed by Elias in the 1950s. The main idea behind list decoding is that the decoding algorithm instead of outputting a single possible message outputs a list of possibilities one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

In coding theory, folded Reed–Solomon codes are like Reed–Solomon codes, which are obtained by mapping Reed–Solomon codewords over a larger alphabet by careful bundling of codeword symbols.

In coding theory, burst error-correcting codes employ methods of correcting burst errors, which are errors that occur in many consecutive bits rather than occurring in bits independently of each other.

<span class="mw-page-title-main">Binary Reed–Solomon encoding</span>

Binary Reed–Solomon coding (BRS), which belongs to a RS code, is a way of encoding that can fix node data loss in a distributed storage environment. It has maximum distance separable (MDS) encoding properties. Its encoding and decoding rate outperforms conventional RS coding and optimum CRS coding.

References

  1. Some versions of this story refer to the err-mail daemon.
  2. 1 2 3 4 Rashmi Vinayak. "Erasure Coding for Big-data Systems: Theory and Practice". 2016. p. 2: section "Abstract". p. 9: section "Systematic codes". p. 12: section "Regenerating codes".
  3. "Erasure Encoding—Practice and Principles". 2016.
  4. Matt Sarrel. "Erasure Coding 101". 2022.
  5. 1 2 Brian Beach. "Backblaze Open-sources Reed-Solomon Erasure Coding Source Code". 2015.
  6. Xia, Mingyuan; Saxena, Mohit; Blaum, Mario; Pease, David A. (2015). A Tale of Two Erasure Codes in {HDFS}. pp. 213–226. ISBN   978-1-931971-20-1.
  7. Dimakis, Alexandros G.; Godfrey, P. Brighten; Wu, Yunnan; Wainwright, Martin J.; Ramchandran, Kannan (September 2010). "Network Coding for Distributed Storage Systems". IEEE Transactions on Information Theory. 56 (9): 4539–4551. arXiv: cs/0702015 . CiteSeerX   10.1.1.117.6892 . doi:10.1109/TIT.2010.2054295. S2CID   260559901.
  8. "home [Erasure Coding for Distributed Storage Wiki]". 2017-07-31. Archived from the original on 2017-07-31. Retrieved 2023-08-20.