Locally decodable code

Last updated

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining (or querying) a small number of bits of a possibly corrupted codeword. [1] [2] [3] This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two. [4]

Contents

Codewords are generated from the original message using an algorithm that introduces a certain amount of redundancy into the codeword; thus, the codeword is always longer than the original message. This redundancy is distributed across the codeword and allows the original message to be recovered with good probability even in the presence of errors. The more redundant the codeword, the more resilient it is against errors, and the fewer queries required to recover a bit of the original message.

Overview

More formally, a -locally decodable code encodes an -bit message to an -bit codeword such that any bit of the message can be recovered with probability by using a randomized decoding algorithm that queries only bits of the codeword , even if up to locations of the codeword have been corrupted.

Furthermore, a perfectly smooth local decoder is a decoder such that, in addition to always generating the correct output given access to an uncorrupted codeword, for every and the query to recover the bit is uniform over . [5] (The notation denotes the set ). Informally, this means that the set of queries required to decode any given bit are uniformly distributed over the codeword.

Local list decoders are another interesting subset of local decoders. List decoding is useful when a codeword is corrupted in more than places, where is the minimum Hamming distance between two codewords. In this case, it is no longer possible to identify exactly which original message has been encoded, since there could be multiple codewords within distance of the corrupted codeword. However, given a radius , it is possible to identify the set of messages that encode to codewords that are within of the corrupted codeword. An upper bound on the size of the set of messages can be determined by and . [6]

Locally decodable codes can also be concatenated, where a message is encoded first using one scheme, and the resulting codeword is encoded again using a different scheme. (Note that, in this context, concatenation is the term used by scholars to refer to what is usually called composition; see [5] ). This might be useful if, for example, the first code has some desirable properties with respect to rate, but it has some undesirable property, such as producing a codeword over a non-binary alphabet. The second code can then transform the result of the first encoding over a non-binary alphabet to a binary alphabet. The final encoding is still locally decodable, and requires additional steps to decode both layers of encoding. [7]

Length of codeword and query complexity

The rate of a code refers to the ratio between its message length and codeword length: , and the number of queries required to recover 1 bit of the message is called the query complexity of a code.

The rate of a code is inversely related to the query complexity, but the exact shape of this tradeoff is a major open problem. [8] [9] It is known that there are no LDCs that query the codeword in only one position, and that the optimal codeword size for query complexity 2 is exponential in the size of the original message. [8] However, there are no known tight lower bounds for codes with query complexity greater than 2. Approaching the tradeoff from the side of codeword length, the only known codes with codeword length proportional to message length have query complexity for [8] [ needs update ] There are also codes in between, that have codewords polynomial in the size of the original message and polylogarithmic query complexity. [8]

Applications

Locally decodable codes have applications to data transmission and storage, complexity theory, data structures, derandomization, theory of fault tolerant computation, and private information retrieval schemes. [9]

Data transmission and storage

Locally decodable codes are especially useful for data transmission over noisy channels. The Hadamard code (a special case of Reed Muller codes) was used in 1971 by Mariner 9 to transmit pictures of Mars back to Earth. It was chosen over a 5-repeat code (where each bit is repeated 5 times) because, for roughly the same number of bits transmitted per pixel, it had a higher capacity for error correction. (The Hadamard code falls under the general umbrella of forward error correction, and just happens to be locally decodable; the actual algorithm used to decode the transmission from Mars was a generic error-correction scheme.) [10]

LDCs are also useful for data storage, where the medium may become partially corrupted over time, or the reading device is subject to errors. In both cases, an LDC will allow for the recovery of information despite errors, provided that there are relatively few. In addition, LDCs do not require that the entire original message be decoded; a user can decode a specific portion of the original message without needing to decode the entire thing. [11]

Complexity theory

One of the applications of locally decodable codes in complexity theory is hardness amplification. Using LDCs with polynomial codeword length and polylogarithmic query complexity, one can take a function that is hard to solve on worst case inputs and design a function that is hard to compute on average case inputs.

Consider limited to only length inputs. Then we can see as a binary string of length , where each bit is for each . We can use a polynomial length locally decodable code with polylogarithmic query complexity that tolerates some constant fraction of errors to encode the string that represents to create a new string of length . We think of this new string as defining a new problem on length inputs. If is easy to solve on average, that is, we can solve correctly on a large fraction of inputs, then by the properties of the LDC used to encode it, we can use to probabilistically compute on all inputs. Thus, a solution to for most inputs would allow us to solve on all inputs, contradicting our assumption that is hard on worst case inputs. [5] [8] [12]

Private information retrieval schemes

A private information retrieval scheme allows a user to retrieve an item from a server in possession of a database without revealing which item is retrieved. One common way of ensuring privacy is to have separate, non-communicating servers, each with a copy of the database. Given an appropriate scheme, the user can make queries to each server that individually do not reveal which bit the user is looking for, but which together provide enough information that the user can determine the particular bit of interest in the database. [3] [11]

One can easily see that locally decodable codes have applications in this setting. A general procedure to produce a -server private information scheme from a perfectly smooth -query locally decodable code is as follows:

Let be a perfectly smooth LDC that encodes -bit messages to -bit codewords. As a preprocessing step, each of the servers encodes the -bit database with the code , so each server now stores the -bit codeword . A user interested in obtaining the bit of randomly generates a set of queries such that can be computed from using the local decoding algorithm for . The user sends each query to a different server, and each server responds with the bit requested. The user then uses to compute from the responses. [8] [11] Because the decoding algorithm is perfectly smooth, each query is uniformly distributed over the codeword; thus, no individual server can gain any information about the user's intentions, so the protocol is private as long as the servers do not communicate. [11]

Examples

The Hadamard code

The Hadamard (or Walsh-Hadamard) code is an example of a simple locally decodable code that maps a string of length to a codeword of length . The codeword for a string is constructed as follows: for every , the bit of the codeword is equal to , where (mod 2). It is easy to see that every codeword has a Hamming distance of from every other codeword.

The local decoding algorithm has query complexity 2, and the entire original message can be decoded with good probability if the codeword is corrupted in less than of its bits. For , if the codeword is corrupted in a fraction of places, a local decoding algorithm can recover the bit of the original message with probability .

Proof: Given a codeword and an index , the algorithm to recover the bit of the original message works as follows:

Let refer to the vector in that has 1 in the position and 0s elsewhere. For , denotes the single bit in that corresponds to . The algorithm chooses a random vector and the vector (where denotes bitwise XOR). The algorithm outputs (mod 2).

Correctness: By linearity,

But , so we just need to show that and with good probability.

Since and are uniformly distributed (even though they are dependent), the union bound implies that and with probability at least . Note: to amplify the probability of success, one can repeat the procedure with different random vectors and take the majority answer. [13]

The Reed–Muller code

The main idea behind local decoding of Reed-Muller codes is polynomial interpolation. The key concept behind a Reed-Muller code is a multivariate polynomial of degree on variables. The message is treated as the evaluation of a polynomial at a set of predefined points. To encode these values, a polynomial is extrapolated from them, and the codeword is the evaluation of that polynomial on all possible points. At a high level, to decode a point of this polynomial, the decoding algorithm chooses a set of points on a line that passes through the point of interest . It then queries the codeword for the evaluation of the polynomial on points in and interpolates that polynomial. Then it is simple to evaluate the polynomial at the point that will yield . This roundabout way of evaluating is useful because (a) the algorithm can be repeated using different lines through the same point to improve the probability of correctness, and (b) the queries are uniformly distributed over the codeword.

More formally, let be a finite field, and let be numbers with . The Reed-Muller code with parameters is the function RM : that maps every -variable polynomial over of total degree to the values of on all the inputs in . That is, the input is a polynomial of the form specified by the interpolation of the values of the predefined points and the output is the sequence for every . [14]

To recover the value of a degree polynomial at a point , the local decoder shoots a random affine line through . Then it picks points on that line, which it uses to interpolate the polynomial and then evaluate it at the point where the result is . To do so, the algorithm picks a vector uniformly at random and considers the line through . The algorithm picks an arbitrary subset of , where , and queries coordinates of the codeword that correspond to points for all and obtains values . Then it uses polynomial interpolation to recover the unique univariate polynomial with degree less than or equal to such that for all . Then, to get the value of , it just evaluates . To recover a single value of the original message, one chooses to be one of the points that defines the polynomial. [8] [14]

Each individual query is distributed uniformly at random over the codeword. Thus, if the codeword is corrupted in at most a fraction of locations, by the union bound, the probability that the algorithm samples only uncorrupted coordinates (and thus correctly recovers the bit) is at least . [8] For other decoding algorithms, see. [8]

See also

Related Research Articles

In computational complexity theory, the complexity class #P (pronounced "sharp P" or, sometimes "number P" or "hash P") is the set of the counting problems associated with the decision problems in the set NP. More formally, #P is the class of function problems of the form "compute f(x)", where f is the number of accepting paths of a nondeterministic Turing machine running in polynomial time. Unlike most well-known complexity classes, it is not a class of decision problems but a class of function problems. The most difficult, representative problems of this class are #P-complete.

In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of the encoder over the data, which gives rise to the term 'convolutional coding'. The sliding nature of the convolutional codes facilitates trellis decoding using a time-invariant trellis. Time invariant trellis decoding allows convolutional codes to be maximum-likelihood soft-decision decoded with reasonable complexity.

In information theory and coding theory, Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, including consumer technologies such as MiniDiscs, CDs, DVDs, Blu-ray discs, QR codes, Data Matrix, data transmission technologies such as DSL and WiMAX, broadcast systems such as satellite communications, DVB and ATSC, and storage systems such as RAID 6.

<span class="mw-page-title-main">Arithmetic coding</span> Form of entropy encoding used in data compression

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where 0.0 ≤ q < 1.0. It represents the current information as a range, defined by two numbers. A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.

A binary symmetric channel is a common communications channel model used in coding theory and information theory. In this model, a transmitter wishes to send a bit, and the receiver will receive a bit. The bit will be "flipped" with a "crossover probability" of p, and otherwise is received correctly. This model can be applied to varied communication channels such as telephone lines or disk drive storage.

In coding theory, block codes are a large and important family of error-correcting codes that encode data in blocks. There is a vast number of examples for block codes, many of which have a wide range of practical applications. The abstract definition of block codes is conceptually useful because it allows coding theorists, mathematicians, and computer scientists to study the limitations of all block codes in a unified way. Such limitations often take the form of bounds that relate different parameters of the block code to each other, such as its rate and its ability to detect and correct errors.

In coding theory, a linear code is an error-correcting code for which any linear combination of codewords is also a codeword. Linear codes are traditionally partitioned into block codes and convolutional codes, although turbo codes can be seen as a hybrid of these two types. Linear codes allow for more efficient encoding and decoding algorithms than other codes.

In coding theory, decoding is the process of translating received messages into codewords of a given code. There have been many common methods of mapping messages to codewords. These are often used to recover messages sent over a noisy channel, such as a binary symmetric channel.

Reed–Muller codes are error-correcting codes that are used in wireless communications applications, particularly in deep-space communication. Moreover, the proposed 5G standard relies on the closely related polar codes for error correction in the control channel. Due to their favorable theoretical and mathematical properties, Reed–Muller codes have also been extensively studied in theoretical computer science.

<span class="mw-page-title-main">Hadamard code</span> Error-correcting code

The Hadamard code is an error-correcting code named after the French mathematician Jacques Hadamard that is used for error detection and correction when transmitting messages over very noisy or unreliable channels. In 1971, the code was used to transmit photos of Mars back to Earth from the NASA space probe Mariner 9. Because of its unique mathematical properties, the Hadamard code is not only used by engineers, but also intensely studied in coding theory, mathematics, and theoretical computer science. The Hadamard code is also known under the names Walsh code, Walsh family, and Walsh–Hadamard code in recognition of the American mathematician Joseph Leonard Walsh.

<span class="mw-page-title-main">Circuit complexity</span> Model of computational complexity

In theoretical computer science, circuit complexity is a branch of computational complexity theory in which Boolean functions are classified according to the size or depth of the Boolean circuits that compute them. A related notion is the circuit complexity of a recursive language that is decided by a uniform family of circuits .

In coding theory, list decoding is an alternative to unique decoding of error-correcting codes for large error rates. The notion was proposed by Elias in the 1950s. The main idea behind list decoding is that the decoding algorithm instead of outputting a single possible message outputs a list of possibilities one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.

A locally testable code is a type of error-correcting code for which it can be determined if a string is a word in that code by looking at a small number of bits of the string. In some situations, it is useful to know if the data is corrupted without decoding all of it so that appropriate action can be taken in response. For example, in communication, if the receiver encounters a corrupted code, it can request the data be re-sent, which could increase the accuracy of said data. Similarly, in data storage, these codes can allow for damaged data to be recovered and rewritten properly.

In theoretical computer science and coding theory, the long code is an error-correcting code that is locally decodable. Long codes have an extremely poor rate, but play a fundamental role in the theory of hardness of approximation.

<span class="mw-page-title-main">Decision tree model</span> Model of computational complexity

In computational complexity theory, the decision tree model is the model of computation in which an algorithm can be considered to be a decision tree, i.e. a sequence of queries or tests that are done adaptively, so the outcome of previous tests can influence the tests performed next.

In coding theory, generalized minimum-distance (GMD) decoding provides an efficient algorithm for decoding concatenated codes, which is based on using an errors-and-erasures decoder for the outer code.

In coding theory, folded Reed–Solomon codes are like Reed–Solomon codes, which are obtained by mapping Reed–Solomon codewords over a larger alphabet by careful bundling of codeword symbols.

In coding theory, list decoding is an alternative to unique decoding of error-correcting codes in the presence of many errors. If a code has relative distance , then it is possible in principle to recover an encoded message when up to fraction of the codeword symbols are corrupted. But when error rate is greater than , this will not in general be possible. List decoding overcomes that issue by allowing the decoder to output a short list of messages that might have been encoded. List decoding can correct more than fraction of errors.

In coding theory, burst error-correcting codes employ methods of correcting burst errors, which are errors that occur in many consecutive bits rather than occurring in bits independently of each other.

In coding theory, Zemor's algorithm, designed and developed by Gilles Zemor, is a recursive low-complexity approach to code construction. It is an improvement over the algorithm of Sipser and Spielman.

References

  1. Sergey Yekhanin. "Locally decodable codes: a brief survey" (PDF).
  2. Rafail Ostrovsky; Omkant Pandey; Amit Sahai. "Private Locally Decodable Codes" (PDF).
  3. 1 2 Sergey Yekhanin. New locally decodable codes and private information retrieval schemes, Technical Report ECCC TR06-127, 2006.
  4. Kaufman, Tali; Viderman, Michael. "Locally Testable vs. Locally Decodable Codes".
  5. 1 2 3 Luca Trevisan. "Some Applications of Coding Theory in Computational Complexity" (PDF).
  6. Arora, Sanjeev; Barak, Boaz (2009). "Section 19.5". Computational Complexity: A Modern Approach. Cambridge. ISBN   978-0-521-42426-4.
  7. Arora & Barak 2009 , Section 19.4.3
  8. 1 2 3 4 5 6 7 8 9 Sergey Yekhanin. "Locally Decodable Codes" (PDF).
  9. 1 2 Sergey Yekhanin. "Locally Decodable Codes" (PDF).
  10. "Combinatorics in Space The Mariner 9 Telemetry System" (PDF).
  11. 1 2 3 4 Sergey Yekhanin. "Private Information retrieval" (PDF).
  12. Arora & Barak 2009 , Section 19.4
  13. Arora & Barak 2009 , Section 11.5.2
  14. 1 2 Arora & Barak 2009 , Section 19.4.2