Count-distinct problem

Last updated

In computer science, the count-distinct problem [1] (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

Contents

Formal definition

Instance: Consider a stream of elements with repetitions. Let denote the number of distinct elements in the stream, with the set of distinct elements represented as .
Objective: Find an estimate of using only storage units, where .

An example of an instance for the cardinality estimation problem is the stream: . For this instance, .

Naive solution

The naive solution to the problem is as follows:

 Initialize a counter, c, to zero, .  Initialize an efficient dictionary data structure, D, such as hash table or search tree in which insertion and membership can be performed quickly.    For each element , a membership query is issued.       If  is not a member of D()Add  to D          Increase c by one,       Otherwise () do nothing.  Output .

As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element should be minimized. In such a case, several streaming algorithms have been proposed that use a fixed number of storage units.

HyperLogLog algorithm

Streaming algorithms

To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements, . State-of-the-art estimators hash every element into a low-dimensional data sketch using a hash function, . The different techniques can be classified according to the data sketches they store.

Min/max sketches

Min/max sketches [2] [3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. [4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator [5] is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm. [6]

The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element is associated with a uniform RV, , the expected minimum value of is . The hash function guarantees that is identical for all the appearances of . Thus, the existence of duplicates does not affect the value of the extreme order statistics.

There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation [7] describes the Flajolet–Martin algorithm, a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff. [8]

Bottom-m sketches

Bottom-m sketches [9] are a generalization of min sketches, which maintain the minimal values, where . See Cosma et al. [2] for a theoretical overview of count-distinct estimation algorithms, and Metwally [10] for a practical overview with comparative simulation results.

CVM Algorithm

Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm [11] (named by Donald Knuth after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream, [12] in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth, that adds the while loop to ensure B is reduced. [12]

Initialize   Initialize an empty buffer, BFor each element  in data stream  of size  do:     If  is in B thenDelete  from B random number in If  then        Insert  into BWhile  then        Remove every element of  with probability End WhileEnd Forreturn .

Weighted count-distinct problem

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,

Instance: A stream of weighted elements with repetitions, and an integer . Let be the number of distinct elements, namely , and let these elements be . Finally, let be the weight of .
Objective: Find an estimate of using only storage units, where .

An example of an instance for the weighted problem is: . For this instance, , the weights are and .

As an application example, could be IP packets received by a server. Each packet belongs to one of IP flows . The weight can be the load imposed by flow on the server. Thus, represents the total load imposed on the server by all the flows to which packets belong.

Solving the weighted count-distinct problem

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem . [13] For example, the weighted estimator proposed by Cohen et al. [5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm [6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

See also

Related Research Articles

In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, but its precursors can be found in statistical physics as early as 1949. Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

<span class="mw-page-title-main">Jackknife resampling</span> Statistical method for resampling

In statistics, the jackknife is a cross-validation technique and, therefore, a form of resampling. It is especially useful for bias and variance estimation. The jackknife pre-dates other common resampling methods such as the bootstrap. Given a sample of size , a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size obtained by omitting one observation.

<span class="mw-page-title-main">Geometric median</span> Point minimizing sum of distances to given points

In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as the spatial median, Euclidean minisum point, Torricelli point, or 1-median.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

An important aspect in the study of elliptic curves is devising effective ways of counting points on the curve. There have been several approaches to do so, and the algorithms devised have proved to be useful tools in the study of various fields such as number theory, and more recently in cryptography and Digital Signature Authentication. While in number theory they have important consequences in the solving of Diophantine equations, with respect to cryptography, they enable us to make effective use of the difficulty of the discrete logarithm problem (DLP) for the group , of elliptic curves over a finite field , where q = pk and p is a prime. The DLP, as it has come to be known, is a widely used approach to public key cryptography, and the difficulty in solving this problem determines the level of security of the cryptosystem. This article covers algorithms to count points on elliptic curves over fields of large characteristic, in particular p > 3. For curves over fields of small characteristic more efficient algorithms based on p-adic methods exist.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.

Non-commutative cryptography is the area of cryptology where the cryptographic primitives, methods and systems are based on algebraic structures like semigroups, groups and rings which are non-commutative. One of the earliest applications of a non-commutative algebraic structure for cryptographic purposes was the use of braid groups to develop cryptographic protocols. Later several other non-commutative structures like Thompson groups, polycyclic groups, Grigorchuk groups, and matrix groups have been identified as potential candidates for cryptographic applications. In contrast to non-commutative cryptography, the currently widely used public-key cryptosystems like RSA cryptosystem, Diffie–Hellman key exchange and elliptic curve cryptography are based on number theory and hence depend on commutative algebraic structures.

LSH is a cryptographic hash function designed in 2014 by South Korea to provide integrity in general-purpose software environments such as PCs and smart devices. LSH is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP). And it is the national standard of South Korea.

Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams.

References

  1. Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  2. 1 2 Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics. arXiv: 0801.3552 .
  3. Giroire, Frederic; Fusy, Eric (2007). 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). pp. 223–231. CiteSeerX   10.1.1.214.270 . doi:10.1137/1.9781611972979.9. ISBN   978-1-61197-297-9.
  4. Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science. arXiv: math/0701347 . Bibcode:2007math......1347C.
  5. 1 2 Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci. 55 (3): 441–453. doi: 10.1006/jcss.1997.1534 .
  6. 1 2 Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). Analysis of Algorithms.
  7. Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). J. Comput. Syst. Sci. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.
  8. Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An Optimal Algorithm for the Distinct Elements Problem". Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS).
  9. Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB.
  10. Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology, pp. 618–629, CiteSeerX   10.1.1.377.4771
  11. Chakraborty, Sourav; Vinodchandran, N. V.; Meel, Kuldeep S. (2022). "Distinct Elements in Streams: An Algorithm for the (Text) Book": 6 pages, 727571 bytes. doi:10.4230/LIPIcs.ESA.2022.34. ISSN   1868-8969.{{cite journal}}: Cite journal requires |journal= (help)
  12. 1 2 Knuth, Donald (May 2023). "The CVM Algorithm for Estimating Distinct Elements in Streams" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  13. Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters. 115 (2): 336–342. doi:10.1016/j.ipl.2014.10.009.