Levenshtein coding

Last updated

Levenshtein coding is a universal code encoding the non-negative integers developed by Vladimir Levenshtein. [1] [2]

Contents

Encoding

The code of zero is "0"; to code a positive number:

  1. Initialize the step count variable C to 1.
  2. Write the binary representation of the number without the leading "1" to the beginning of the code.
  3. Let M be the number of bits written in step 2.
  4. If M is not 0, increment C, repeat from step 2 with M as the new number.
  5. Write C "1" bits and a "0" to the beginning of the code.

The code begins:

NumberEncodingImplied probability
001/2
1101/4
2110 01/16
3110 11/16
41110 0 001/128
51110 0 011/128
61110 0 101/128
71110 0 111/128
81110 1 0001/256
91110 1 0011/256
101110 1 0101/256
111110 1 0111/256
121110 1 1001/256
131110 1 1011/256
141110 1 1101/256
151110 1 1111/256
1611110 0 00 00001/4096
1711110 0 00 00011/4096

To decode a Levenshtein-coded integer:

  1. Count the number of "1" bits until a "0" is encountered.
  2. If the count is zero, the value is zero, otherwise
  3. Start with a variable N, set it to a value of 1 and repeat count minus 1 times:
  4. Read N bits, prepend "1", assign the resulting value to N

The Levenshtein code of a positive integer is always one bit longer than the Elias omega code of that integer. However, there is a Levenshtein code for zero, whereas Elias omega coding would require the numbers to be shifted so that a zero is represented by the code for one instead.

Example code

Encoding

voidlevenshteinEncode(char*source,char*dest){IntReaderintreader(source);BitWriterbitwriter(dest);while(intreader.hasLeft()){intnum=intreader.getInt();if(num==0)bitwriter.outputBit(0);else{intc=0;BitStackbits;do{intm=0;for(inttemp=num;temp>1;temp>>=1)// calculate floor(log2(num))++m;for(inti=0;i<m;++i)bits.pushBit((num>>i)&1);num=m;++c;}while(num>0);for(inti=0;i<c;++i)bitwriter.outputBit(1);bitwriter.outputBit(0);while(bits.length()>0)bitwriter.outputBit(bits.popBit());}}}

Decoding

voidlevenshteinDecode(char*source,char*dest){BitReaderbitreader(source);IntWriterintwriter(dest);while(bitreader.hasLeft()){intn=0;while(bitreader.inputBit())// potentially dangerous with malformed files.++n;intnum;if(n==0)num=0;else{num=1;for(inti=0;i<n-1;++i){intval=1;for(intj=0;j<num;++j)val=(val<<1)|bitreader.inputBit();num=val;}}intwriter.putInt(num);// write out the value}bitreader.close();intwriter.close();}

See also

Related Research Articles

PackBits is a fast, simple lossless compression scheme for run-length encoding of data.

<span class="mw-page-title-main">Hamming distance</span> Number of bits that differ between two strings

In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

Elias code or Elias gamma code is a universal code encoding positive integers developed by Peter Elias. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.

Elias δ code or Elias delta code is a universal code encoding the positive integers developed by Peter Elias.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain both a float and an integer; in a union, there is only one value at any given time.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification to which the computer code adheres. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

Elias ω coding or Elias omega coding is a universal code encoding the positive integers developed by Peter Elias. Like Elias gamma coding and Elias delta coding, it works by prefixing the positive integer with a representation of its order of magnitude in a universal code. Unlike those other two codes, however, Elias omega recursively encodes that prefix; thus, they are sometimes known as recursive Elias codes.

In mathematics and in computer programming, a variadic function is a function of indefinite arity, i.e., one which accepts a variable number of arguments. Support for variadic functions differs widely among programming languages.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

Color BASIC is the implementation of Microsoft BASIC that is included in the ROM of the Tandy/Radio Shack TRS-80 Color Computers manufactured between 1980 and 1991. BASIC is a high level language with simple syntax that makes it easy to write simple programs. Color BASIC is interpreted, that is, decoded as it is run.

In number theory, a narcissistic number in a given number base is a number that is the sum of its own digits each raised to the power of the number of digits.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

A negative base may be used to construct a non-standard positional numeral system. Like other place-value systems, each position holds multiples of the appropriate power of the system's base; but that base is negative—that is to say, the base b is equal to −r for some natural number r.

<span class="mw-page-title-main">ATS (programming language)</span> Programming language

In computing, ATS is a programming language designed to unify programming with formal specification. ATS has support for combining theorem proving with practical programming through the use of advanced type systems. A past version of The Computer Language Benchmarks Game has demonstrated that the performance of ATS is comparable to that of the C and C++ programming languages. By using theorem proving and strict type checking, the compiler can detect and prove that its implemented functions are not susceptible to bugs such as division by zero, memory leaks, buffer overflow, and other forms of memory corruption by verifying pointer arithmetic and reference counting before the program compiles. Additionally, by using the integrated theorem-proving system of ATS (ATS/LF), the programmer may make use of static constructs that are intertwined with the operative code to prove that a function conforms to its specification.

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format and the WebAssembly binary encoding for all integer literals.

In the C programming language, operations can be performed on a bit level using bitwise operators.

The Hack Computer is a theoretical computer design created by Noam Nisan and Shimon Schocken and described in their book, The Elements of Computing Systems: Building a Modern Computer from First Principles.  In using the term “modern”, the authors refer to a digital, binary machine that is patterned according to the von Neumann architecture model.

References

  1. "1968 paper by V. I. Levenshtein (in Russian)" (PDF).
  2. David Salomon (2007). Variable-length codes for data compression. Springer. p. 80. ISBN   978-1-84628-958-3.