Levenshtein coding

Last updated

Levenshtein coding is a universal code encoding the non-negative integers developed by Vladimir Levenshtein. [1] [2]

Contents

Encoding

The code of zero is "0"; to code a positive number:

  1. Initialize the step count variable C to 1.
  2. Write the binary representation of the number without the leading "1" to the beginning of the code.
  3. Let M be the number of bits written in step 2.
  4. If M is not 0, increment C, repeat from step 2 with M as the new number.
  5. Write C "1" bits and a "0" to the beginning of the code.

The code begins:

NumberEncodingImplied probability
001/2
1101/4
2110 01/16
3110 11/16
41110 0 001/128
51110 0 011/128
61110 0 101/128
71110 0 111/128
81110 1 0001/256
91110 1 0011/256
101110 1 0101/256
111110 1 0111/256
121110 1 1001/256
131110 1 1011/256
141110 1 1101/256
151110 1 1111/256
1611110 0 00 00001/4096
1711110 0 00 00011/4096

To decode a Levenshtein-coded integer:

  1. Count the number of "1" bits until a "0" is encountered.
  2. If the count is zero, the value is zero, otherwise
  3. Discard the "1" bits just counted and the first "0" encountered
  4. Start with a variable N, set it to a value of 1 and repeat count minus 1 times:
  5. Read N bits (and remove them from the encoded integer), prepend "1", assign the resulting value to N

The Levenshtein code of a positive integer is always one bit longer than the Elias omega code of that integer. However, there is a Levenshtein code for zero, whereas Elias omega coding would require the numbers to be shifted so that a zero is represented by the code for one instead.

Example code

Encoding

voidlevenshteinEncode(char*source,char*dest){IntReaderintreader(source);BitWriterbitwriter(dest);while(intreader.hasLeft()){intnum=intreader.getInt();if(num==0)bitwriter.outputBit(0);else{intc=0;BitStackbits;do{intm=0;for(inttemp=num;temp>1;temp>>=1)// calculate floor(log2(num))++m;for(inti=0;i<m;++i)bits.pushBit((num>>i)&1);num=m;++c;}while(num>0);for(inti=0;i<c;++i)bitwriter.outputBit(1);bitwriter.outputBit(0);while(bits.length()>0)bitwriter.outputBit(bits.popBit());}}}

Decoding

voidlevenshteinDecode(char*source,char*dest){BitReaderbitreader(source);IntWriterintwriter(dest);while(bitreader.hasLeft()){intn=0;while(bitreader.inputBit())// potentially dangerous with malformed files.++n;intnum;if(n==0)num=0;else{num=1;for(inti=0;i<n-1;++i){intval=1;for(intj=0;j<num;++j)val=(val<<1)|bitreader.inputBit();num=val;}}intwriter.putInt(num);// write out the value}bitreader.close();intwriter.close();}

See also

Related Research Articles

PackBits is a fast, simple lossless compression scheme for run-length encoding of data.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

Elias code or Elias gamma code is a universal code encoding positive integers developed by Peter Elias. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.

Elias δ code or Elias delta code is a universal code encoding the positive integers developed by Peter Elias.

A prefix code is a type of code system distinguished by its possession of the "prefix property", which requires that there is no whole code word in the system that is a prefix of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for variable-length codes.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer science, a union is a value that may have any of multiple representations or formats within the same area of memory; that consists of a variable that may hold such a data structure. Some programming languages support a union type for such a data type. In other words, a union type specifies the permitted types that may be stored in its instances, e.g., float and integer. In contrast with a record, which could be defined to contain both a float and an integer; a union would hold only one at a time.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification of the programming language in which the source code is written. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

Elias ω coding or Elias omega coding is a universal code encoding the positive integers developed by Peter Elias. Like Elias gamma coding and Elias delta coding, it works by prefixing the positive integer with a representation of its order of magnitude in a universal code. Unlike those other two codes, however, Elias omega recursively encodes that prefix; thus, they are sometimes known as recursive Elias codes.

<span class="mw-page-title-main">Universal code (data compression)</span>

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic (i.e., p(i) ≥ p(i + 1) for all positive i), the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. A universal code is asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity.

Color BASIC is the implementation of Microsoft BASIC that is included in the ROM of the Tandy/Radio Shack TRS-80 Color Computers manufactured between 1980 and 1991. BASIC is a high level language with simple syntax that makes it easy to write simple programs. Color BASIC is interpreted, that is, decoded as it is run.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computer science and information theory, a canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

A negative base may be used to construct a non-standard positional numeral system. Like other place-value systems, each position holds multiples of the appropriate power of the system's base; but that base is negative—that is to say, the base b is equal to −r for some natural number r.

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format and the WebAssembly binary encoding for all integer literals.

In the C programming language, operations can be performed on a bit level using bitwise operators.

The Hack Computer is a theoretical computer design created by Noam Nisan and Shimon Schocken and described in their book, The Elements of Computing Systems: Building a Modern Computer from First Principles.  In using the term “modern”, the authors refer to a digital, binary machine that is patterned according to the von Neumann architecture model.

References

  1. "1968 paper by V. I. Levenshtein (in Russian)" (PDF).
  2. David Salomon (2007). Variable-length codes for data compression. Springer. p. 80. ISBN   978-1-84628-958-3.