Comma code

Last updated

A comma code is a type of prefix-free code in which a comma, a particular symbol or sequence of symbols, occurs at the end of a code word and never occurs otherwise. [1] This is an intuitive way to express arrays.

Contents

For example, Fibonacci coding is a comma code in which the comma is 11. 11 and 1011 are valid Fibonacci code words, but 101, 0111, and 11011 are not.

Examples

SymbolCodeComma Code
Comma- (N/A)0
000100
101101
210110
311111

The definition of word being a number of symbols ending in a comma, the equivalent of a space character.

All scrambled data or suitably curated same-length data exhibits so called implied probability.

Such data that can be termed 'generic data' can be analysed using any interleaving unary code as headers where additional bijective bits (equal to the length of the unary code just read) are read as data while the unary code serves as an introduction or header for the data. This header serves as a comma. The data can be read in an interleaving fashion between each bit of the header or in post read fashion when the data is only read after the entire unary header code is read like Chen-Ho encoding.

It can be seen by random walk techniques and by statistical summation that all generic data has a header or comma of an average of 2 bits and data of an additional 2 bits (minimum 1).

This also allows for an inexpensive base increase algorithm before transmission in non binary communication channels, like base-3 or base-5 communication channels.

nRL codeNext codeBijective data (non-NULL)Commas
11?0?? (1=1,2=2),
21?1?0?0??? (3,4,5,6=11,12,21,22),,
31?1?1?0?0?0????,,,
41?1?1?1?0?0?0?0?????,,,,
51?1?1?1?1?0?0?0?0?0??????,,,,,
61?1?1?1?1?1?0?0?0?0?0?0???????,,,,,,
71?1?1?1?1?1?1?0?0?0?0?0?0?0????????,,,,,,,
81?1?1?1?1?1?1?1?0?0?0?0?0?0?0?0?????????,,,,,,,,
91?1?1?1?1?1?1?1?1?0?0?0?0?0?0?0?0?0??????????,,,,,,,,,
101?1?1?1?1?1?1?1?1?1?0?0?0?0?0?0?0?0?0?0???????????,,,,,,,,,,
...

Where '?' is '1' or '2' for the value of the bijective digit that requires no further processing.

Of course we use a single comma to separate each field of data, therefore showing that all the data consists of 50% of commas. This is quite visible from an implied probability of 50% for the 0 code in Huffman base 3 codes: 0,10,11 (net 2/3 or 66.66% commas) or the base-5 comma code shown above. The cost-per-character quotient of higher base communication has to maintain near logarithmic values for the data and less than 2-bits for the comma character to maintain cost effectiveness.

This method has an assurance of a '1' or '2' after every '0' (comma) and this property can be useful when designing around timing concerns in transmission. It can be somewhat expensive to convert a known binary value to ternary unless ternary bit costs are reduced to similar to binary bit costs, so this bit can be multiplexed in a separate binary channel if costs agree (this may require a read of an additional 'tail'/trailing portion of 2-bits pure data for the binary channel (from after the first bit of the first change as this is not an instantly-decodable code, simply read if using an instantly decodable unary code) to be similar to the 2 average ternary bits remaining on the primary channel equivalent to bits before cost comparisons are factored in).

Not considering multiplexing, this method has a read efficiency of 3 ternary digits for a read of 4 binary bits or 1.33 bits. Or

nRL codeNext codeBijective data (has NULL)Commas
110NULL (or 0),
21?10?0? (1=1,2=2),,
31?1?10?0?0?? (3,4,5,6=11,12,21,22),,,
41?1?1?10?0?0?0???,,,,
51?1?1?1?10?0?0?0?0????,,,,,
61?1?1?1?1?10?0?0?0?0?0?????,,,,,,
71?1?1?1?1?1?10?0?0?0?0?0?0??????,,,,,,,
81?1?1?1?1?1?1?10?0?0?0?0?0?0?0???????,,,,,,,,
91?1?1?1?1?1?1?1?10?0?0?0?0?0?0?0?0????????,,,,,,,,,
101?1?1?1?1?1?1?1?1?10?0?0?0?0?0?0?0?0?0?????????,,,,,,,,,,
...

Where '?' is '1' or '2' for the value of the bijective digit that requires no further processing. This method results in statistical similarity to a simple 'implied read' of Huffman base 3 codes: 0,10,11 (net 2/3 or 66.66% commas).

It can be seen by random walk techniques and by statistical summation that all generic data has a header or comma of an average of 2 bits and data of an additional 1 bit (minimum 0).

This has no assurance of a '1' or '2' after every '0' (comma) a property that can be useful when designing around timing concerns in transmission.

This method has a read efficiency of 2 ternary digits for a read of 3 binary bits or 1.5 binary bits/ternary digit. Or

The main advantage to this technique apart from higher efficiency is that there is no base conversion required which would require the entire stream to be read first and then converted. The disadvantage is that the average number length becomes higher and similar to random number generation and timing concerns that govern ternary transmission come to the fore. With m=2 and n=2, we get, not forgetting that a value of '(2)' is essentially 0-bits:

Binary encodingTernary digits
READS - Code space (128 states)b3b2b1b0Values encodedDescriptionWRITES - Occurrences (100 states)
50% (64 states)0ab(0–1) (0–1)Two lower digits44.44% (45 states)
25% (32 states)10a(2) (0–1)One lower digit,

one higher digit

22.22% (22 states)
12.5% (16 states)110b(0–1) (2)22.22% (22 states)
12.5% (16 states)111(2) (2)Two higher digits11.11% (11 states)

This method therefore has a read efficiency of 2 ternary digits for a read of binary bits or 1.5625 binary bits/ternary digit. Or .

A write efficiency of 2 ternary digits for a write of bits or 1.61 binary bits/ternary digit, or

See also

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9, and "A"–"F" to represent values from ten to fifteen.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">Numeral system</span> Notation for expressing numbers

A numeral system is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using digits or other symbols in a consistent manner.

In logic, mathematics, and computer science, arity is the number of arguments or operands taken by a function, operation or relation. In mathematics, arity may also be called rank, but this word can have many other meanings. In logic and philosophy, arity may also be called adicity and degree. In linguistics, it is usually named valency.

Elias code or Elias gamma code is a universal code encoding positive integers developed by Peter Elias. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.

In the field of data compression, Shannon–Fano coding, named after Claude Shannon and Robert Fano, is a name given to two different but related techniques for constructing a prefix code based on a set of symbols and their probabilities.

A ternary numeral system has three as its base. Analogous to a bit, a ternary digit is a trit. One trit is equivalent to log2 3 bits of information.

Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Alphabets following a geometric distribution will have a Golomb code as an optimal prefix code, making Golomb coding highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values.

In musical tuning, the Pythagorean comma (or ditonic comma), named after the ancient mathematician and philosopher Pythagoras, is the small interval (or comma) existing in Pythagorean tuning between two enharmonically equivalent notes such as C and B, or D and C. It is equal to the frequency ratio (1.5)1227 = 531441524288 ≈ 1.01364, or about 23.46 cents, roughly a quarter of a semitone (in between 75:74 and 74:73). The comma that musical temperaments often "temper" is the Pythagorean comma.

Unary coding, or the unary numeral system and also sometimes called thermometer code, is an entropy encoding that represents a natural number, n, with a code of length n + 1, usually n ones followed by a zero or with n − 1 ones followed by a zero. For example 5 is represented as 111110 or 11110. Some representations use n or n − 1 zeros followed by a one. The ones and zeros are interchangeable without loss of generality. Unary coding is both a prefix-free code and a self-synchronizing code.

A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" (one).

<span class="mw-page-title-main">Cantor function</span> Continuous function that is not absolutely continuous

In mathematics, the Cantor function is an example of a function that is continuous, but not absolutely continuous. It is a notorious counterexample in analysis, because it challenges naive intuitions about continuity, derivative, and measure. Though it is continuous everywhere and has zero derivative almost everywhere, its value still goes from 0 to 1 as its argument reaches from 0 to 1. Thus, in one sense the function seems very much like a constant one which cannot grow, and in another, it does indeed monotonically grow.

Mach-O, short for Mach object file format, is a file format for executables, object code, shared libraries, dynamically loaded code, and core dumps. It was developed to replace the a.out format.

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

Bijective numeration is any numeral system in which every non-negative integer can be represented in exactly one way using a finite string of digits. The name refers to the bijection that exists in this case between the set of non-negative integers and the set of finite strings using a finite set of symbols.

<span class="mw-page-title-main">Universal code (data compression)</span>

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic (i.e., p(i) ≥ p(i + 1) for all positive i), the expected lengths of the codewords are within a constant factor of the expected lengths that the optimal code for that probability distribution would have assigned. A universal code is asymptotically optimal if the ratio between actual and optimal expected lengths is bounded by a function of the information entropy of the code that, in addition to being bounded, approaches 1 as entropy approaches infinity.

Non-standard positional numeral systems here designates numeral systems that may loosely be described as positional systems, but that do not entirely comply with the following description of standard positional systems:

<span class="mw-page-title-main">Fast inverse square root</span> Root-finding algorithm

Fast inverse square root, sometimes referred to as Fast InvSqrt or by the hexadecimal constant 0x5F3759DF, is an algorithm that estimates , the reciprocal of the square root of a 32-bit floating-point number in IEEE 754 floating-point format. The algorithm is best known for its implementation in 1999 in Quake III Arena, a first-person shooter video game heavily based on 3D graphics. With subsequent hardware advancements, especially the x86 SSE instruction rsqrtss, this algorithm is not generally the best choice for modern computers, though it remains an interesting historical example.

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.

References

  1. Wade, Graham (8 September 1994). Signal Coding and Processing. Cambridge University Press. p. 56. ISBN   978-0-521-42336-6.