Normalized compression distance

Last updated

Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Such a measurement should not be application dependent or arbitrary. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other.

Contents

It can be used in information retrieval and data mining for cluster analysis.

Information distance

We assume that the objects one talks about are finite strings of 0s and 1s. Thus we mean string similarity. Every computer file is of this form, that is, if an object is a file in a computer it is of this form. One can define the information distance between strings and as the length of the shortest program that computes from and vice versa. This shortest program is in a fixed programming language. For technical reasons one uses the theoretical notion of Turing machines. Moreover, to express the length of one uses the notion of Kolmogorov complexity. Then, it has been shown [1]

up to logarithmic additive terms which can be ignored. This information distance is shown to be a metric (it satisfies the metric inequalities up to a logarithmic additive term), is universal (it minorizes every computable distance as computed for example from features up to a constant additive term). [1]

Normalized information distance (similarity metric)

The information distance is absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we consider that those strings are relatively more similar than two strings of 1000 bits that differ by 1000 bits. Hence we need to normalize to obtain a similarity metric. This way one obtains the normalized information distance (NID),

where is algorithmic information of given as input. The NID is called `the similarity metric.' since the function has been shown to satisfy the basic requirements for a metric distance measure. [2] [3] However, it is not computable or even semicomputable. [4]

Normalized compression distance

While the NID metric is not computable, it has an abundance of applications. Simply approximating by real-world compressors, with is the binary length of the file compressed with compressor Z (for example "gzip", "bzip2", "PPMZ") in order to make NID easy to apply. [2] Vitanyi and Cilibrasi rewrote the NID to obtain the Normalized Compression Distance (NCD)

[3]

The NCD is actually a family of distances parametrized with the compressor Z. The better Z is, the closer the NCD approaches the NID, and the better the results are. [3]

Applications

The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees. [2] [3] It can also be used for new applications of general clustering and classification of natural data in arbitrary domains, [3] for clustering of heterogeneous data, [3] and for anomaly detection across domains. [5] The NID and NCD have been applied to numerous subjects, including music classification, [3] to analyze network traffic and cluster computer worms and viruses, [6] authorship attribution, [7] gene expression dynamics, [8] predicting useful versus useless stem cells, [9] critical networks, [10] image registration, [11] question-answer systems. [12]

Performance

Researchers from the datamining community use NCD and variants as "parameter-free, feature-free" data-mining tools. [5] One group have experimentally tested a closely related metric on a large variety of sequence benchmarks. Comparing their compression method with 51 major methods found in 7 major data-mining conferences over the past decade, they established superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data.

NCD has an advantage of being robust to noise. [13] However, although NCD appears "parameter-free", practical questions include which compressor to use in computing the NCD and other possible problems. [14]

Comparison with the Normalized Relative Compression (NRC)

In order to measure the information of a string relative to another there is the need to rely on relative semi-distances (NRC). [15] These are measures that do not need to respect symmetry and triangle inequality distance properties. Although the NCD and the NRC seem very similar, they address different questions. The NCD measures how similar both strings are, mostly using the information content, while the NRC indicates the fraction of a target string that cannot be constructed using information from another string. For a comparison, with application to the evolution of primate genomes, see. [16]

Normalized Google distance

Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy." There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red." We are interested in semantic similarity. Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation. The associated NCD, called the normalized Google distance (NGD) can be rewritten as

where denotes the number of pages containing the search term , and denotes the number of pages containing both and ,) as returned by Google or any search engine capable of returning an aggregate page count. The number can be set to the number of pages indexed although it is more proper to count each page according to the number of search terms or phrases it contains. As rule of the thumb one can multiply the number of pages by, say, a thousand... [17]

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Metric space</span> Mathematical space with a notion of distance

In mathematics, a metric space is a set together with a notion of distance between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting for studying many of the concepts of mathematical analysis and geometry.

<span class="mw-page-title-main">Quantization (signal processing)</span> Process of mapping a continuous set to a countable set

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. Quantization is involved to some degree in nearly all digital signal processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.

<span class="mw-page-title-main">Dynamic time warping</span> An algorithm for measuring similarity between two temporal sequences, which may vary in speed

In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

<span class="mw-page-title-main">Jaccard index</span> Measure of similarity and diversity between sets

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.

The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It is also used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

The Sørensen–Dice coefficient is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

The normalized Google distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of normalized Google distance, while words with dissimilar meanings tend to be farther apart.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Information distance is the distance between two finite objects expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity. The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in based on thermodynamic principles, see also. Subsequently, it achieved final form in. It is applied in the normalized compression distance and the normalized Google distance.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

References

  1. 1 2 C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitányi, and W. Zurek, Information Distance, IEEE Trans. Inform. Theory, IT-44:4(1998) 1407–1423
  2. 1 2 3 Li, Ming; Chen, Xin; Li, Xin; Ma, Bin; Vitanyi, P. M. B. (2011-09-27). "M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi, The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250–3264". IEEE Transactions on Information Theory. 50 (12): 3250–3264. doi:10.1109/TIT.2004.838101. S2CID   221927.
  3. 1 2 3 4 5 6 7 Cilibrasi, R.; Vitanyi, P. M. B. (2011-09-27). "R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression". IEEE Transactions on Information Theory. 51 (4): 1523–1545. arXiv: cs/0312044 . doi:10.1109/TIT.2005.844059. S2CID   911.
  4. Terwijn, Sebastiaan A.; Torenvliet, Leen; Vitányi, Paul M.B. (2011). "Nonapproximability of the normalized information distance". Journal of Computer and System Sciences. 77 (4): 738–742. doi: 10.1016/j.jcss.2010.06.018 . S2CID   10831035.
  5. 1 2 Keogh, Eamonn; Lonardi, Stefano; Ratanamahatana, Chotirat Ann (2004-08-22). "Towards parameter-free data mining". E. Keogh, S. Lonardi, and C.A. Ratanamahatana. "Towards parameter-free data mining." In Conference on Knowledge Discovery in Data: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, vol. 22, no. 25, pp. 206–215. 2004. Dl.acm.org. p. 206. doi:10.1145/1014052.1014077. ISBN   978-1581138887. S2CID   1616057.
  6. "S. Wehner, Analyzing worms and network traffic using compression, Journal of Computer Security, 15:3(2007), 303–320". Iospress.metapress.com. Retrieved 2012-11-03.{{cite journal}}: Cite journal requires |journal= (help)
  7. Stamatatos, Efstathios (2009). "A survey of modern authorship attribution methods". Journal of the American Society for Information Science and Technology. 60 (3): 538–556. CiteSeerX   10.1.1.207.3310 . doi:10.1002/asi.21001.
  8. Nykter, M. (2008). "Gene expression dynamics in the macrophage exhibit criticality". Proceedings of the National Academy of Sciences. 105 (6): 1897–1900. Bibcode:2008PNAS..105.1897N. doi: 10.1073/pnas.0711525105 . PMC   2538855 . PMID   18250330.
  9. Cohen, Andrew R (2010). "Computational prediction of neural progenitor cell fates". Nature Methods. 7 (3): 213–218. doi:10.1038/nmeth.1424. hdl: 1866/4484 . PMID   20139969. S2CID   18652440.
  10. Nykter, Matti; Price, Nathan D.; Larjo, Antti; Aho, Tommi; Kauffman, Stuart A.; Yli-Harja, Olli; Shmulevich, Ilya (2008). "M. Nykter, N.D. Price, A. Larjo, T. Aho, S.A. Kauffman, O. Yli-Harja1, and I. Shmulevich, Critical networks exhibit maximal information diversity in structure-dynamics relationships, Phys. Rev. Lett. 100, 058702 (2008)". Physical Review Letters. 100 (5): 058702. arXiv: 0801.3699 . doi:10.1103/PhysRevLett.100.058702. PMID   18352443. S2CID   5760862.
  11. Bardera, Anton; Feixas, Miquel; Boada, Imma; Sbert, Mateu (July 2006). "Compression-based Image Registration". M. Feixas, I. Boada, M. Sbert, Compression-based Image Registration. Proc. IEEE International Symposium on Information Theory, 2006. 436–440. Ieeexplore.ieee.org. pp. 436–440. doi:10.1109/ISIT.2006.261706. hdl:10256/3052. ISBN   978-1-4244-0505-3. S2CID   12549455.
  12. Zhang, Xian; Hao, Yu; Zhu, Xiaoyan; Li, Ming; Cheriton, David R. (2007). "Information distance from a question to an answer". X Zhang, Y Hao, X Zhu, M Li, Information distance from a question to an answer, Proc. 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, 874–883. Dl.acm.org. p. 874. doi:10.1145/1281192.1281285. ISBN   9781595936097. S2CID   3040254.
  13. Cebrian, M.; Alfonseca, M.; Ortega, A. (2011-09-27). "The normalized compression distance is resistant to noise". IEEE Transactions on Information Theory. 53 (5): 1895–1900. CiteSeerX   10.1.1.158.2463 . doi:10.1109/TIT.2007.894669. S2CID   15691266.
  14. "M. Cebrián, M. Alfonseca, and A. Ortega, Common pitfalls using the normalized compression distance: what to watch out for in a compressor, Commun. Inf. Syst. 5:4(2005), 367–384". Projecteuclid.org. Retrieved 2012-11-03.
  15. Ziv, J.; Merhav, N. (1993). "A measure of relative entropy between individual sequences with application to universal classification". IEEE Transactions on Information Theory. 39 (4): 1270–1279. doi:10.1109/18.243444.
  16. Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J. (2018). "Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes". Entropy. 20 (6): 393. Bibcode:2018Entrp..20..393P. doi: 10.3390/e20060393 . PMC   7512912 . PMID   33265483. CC-BY icon.svg Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.
  17. Cilibrasi, R. L.; Vitanyi, P. M. B. (2011-09-27). "R.L. Cilibrasi, P.M.B. Vitanyi, The Google Similarity Distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370-383". IEEE Transactions on Knowledge and Data Engineering. 19 (3): 370–383. arXiv: cs/0412098 . doi:10.1109/TKDE.2007.48. S2CID   59777.