HyperLogLog

Last updated

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. [1] Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. [1] HyperLogLog is an extension of the earlier LogLog algorithm, [2] itself deriving from the 1984 Flajolet–Martin algorithm. [3]

Contents

Terminology

In the original paper by Flajolet et al. [1] and in related literature on the count-distinct problem, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements. However in the theory of multisets the term refers to the sum of multiplicities of each member of a multiset. This article chooses to use Flajolet's definition for consistency with the sources.

Algorithm

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2n. [1]

In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set. [4]

Operations

The HyperLogLog has three main operations: add to add a new element to the set, count to obtain the cardinality of the set and merge to obtain the union of two sets. Some derived operations can be computed using the inclusion–exclusion principle like the cardinality of the intersection or the cardinality of the difference between two HyperLogLogs combining the merge and count operations.

The data of the HyperLogLog is stored in an array M of m counters (or "registers") that are initialized to 0. Array M initialized from a multiset S is called HyperLogLogsketch of S.

Add

The add operation consists of computing the hash of the input data v with a hash function h, getting the first b bits (where b is ), and adding 1 to them to obtain the address of the register to modify. With the remaining bits compute which returns the position of the leftmost 1, where leftmost position is 1 (in other words: number of leading zeros plus 1). The new value of the register will be the maximum between the current value of the register and .

Count

The count algorithm consists in computing the harmonic mean of the m registers, and using a constant to derive an estimate of the count:

The intuition is that n being the unknown cardinality of M, each subset will have elements. Then should be close to . The harmonic mean of 2 to these quantities is which should be near . Thus, should be n approximately.

Finally, the constant is introduced to correct a systematic multiplicative bias present in due to hash collisions.

Practical considerations

The constant is not simple to calculate, and can be approximated with the formula [1]

The HyperLogLog technique, though, is biased for small cardinalities below a threshold of . The original paper proposes using a different algorithm for small cardinalities known as Linear Counting. [5] In the case where the estimate provided above is less than the threshold , the alternative calculation can be used:

  1. Let be the count of registers equal to 0.
  2. If , use the standard HyperLogLog estimator above.
  3. Otherwise, use Linear Counting:

Additionally, for very large cardinalities approaching the limit of the size of the registers ( for 32-bit registers), the cardinality can be estimated with:

With the above corrections for lower and upper bounds, the error can be estimated as .

Merge

The merge operation for two HLLs () consists in obtaining the maximum for each pair of registers

Complexity

To analyze the complexity, the data streaming model [6] is used, which analyzes the space necessary to get a approximation with a fixed success probability . The relative error of HLL is and it needs space, where n is the set cardinality and m is the number of registers (usually less than one byte size).

The add operation depends on the size of the output of the hash function. As this size is fixed, we can consider the running time for the add operation to be .

The count and merge operations depend on the number of registers m and have a theoretical cost of . In some implementations (Redis) [7] the number of registers is fixed and the cost is considered to be in the documentation.

HLL++

The HyperLogLog++ algorithm proposes several improvements in the HyperLogLog algorithm to reduce memory requirements and increase accuracy in some ranges of cardinalities: [6]

Streaming HLL

When the data arrives in a single stream, the Historic Inverse Probability or martingale estimator [8] [9] significantly improves the accuracy of the HLL sketch and uses 36% less memory to achieve a given error level. This estimator is provably optimal for any duplicate insensitive approximate distinct counting sketch on a single stream.

The single stream scenario also leads to variants in the HLL sketch construction. HLL-TailCut+ uses 45% less memory than the original HLL sketch but at the cost of being dependent on the data insertion order and not being able to merge sketches. [10]

Further reading

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Binomial coefficient</span> Number of subsets of a given size

In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers nk ≥ 0 and is written It is the coefficient of the xk term in the polynomial expansion of the binomial power (1 + x)n; this coefficient can be computed by the multiplicative formula

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

Shor's algorithm is a quantum algorithm for finding the prime factors of an integer. It was developed in 1994 by the American mathematician Peter Shor. It is one of the few known quantum algorithms with compelling potential applications and strong evidence of superpolynomial speedup compared to best known classical algorithms. On the other hand, factoring numbers of practical significance requires far more qubits than available in the near future. Another concern is that noise in quantum circuits may undermine results, requiring additional qubits for quantum error correction.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A birthday attack is a bruteforce collision attack that exploits the mathematics behind the birthday problem in probability theory. This attack can be used to abuse communication between two or more parties. The attack depends on the higher likelihood of collisions found between random attack attempts and a fixed degree of permutations (pigeonholes). With a birthday attack, it is possible to find a collision of a hash function with chance in , with being the classical preimage resistance security with the same probability. There is a general result that quantum computers can perform birthday attacks, thus breaking collision resistance, in .

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In mathematics, Machin-like formulae are a popular technique for computing π to a large number of digits. They are generalizations of John Machin's formula from 1706:

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts from a -dimensional multinomial distribution with trials, a "smoothed" version of the counts gives the estimator:

The approximate counting algorithm allows the counting of a large number of events using a small amount of memory. Invented in 1977 by Robert Morris of Bell Labs, it uses probabilistic techniques to increment the counter. It was fully analyzed in the early 1980s by Philippe Flajolet of INRIA Rocquencourt, who coined the name approximate counting, and strongly contributed to its recognition among the research community. When focused on high quality of approximation and low probability of failure, Nelson and Yu showed that a very slight modification to the Morris Counter is asymptotically optimal amongst all algorithms for the problem. The algorithm is considered one of the precursors of streaming algorithms, and the more general problem of determining the frequency moments of a data stream has been central to the field.

Although the term well-behaved statistic often seems to be used in the scientific literature in somewhat the same way as is well-behaved in mathematics it can also be assigned precise mathematical meaning, and in more than one way. In the former case, the meaning of this term will vary from context to context. In the latter case, the mathematical conditions can be used to derive classes of combinations of distributions with statistics which are well-behaved in each sense.

The generalized normal distribution or generalized Gaussian distribution (GGD) is either of two families of parametric continuous probability distributions on the real line. Both families add a shape parameter to the normal distribution. To distinguish the two families, they are referred to below as "symmetric" and "asymmetric"; however, this is not a standard nomenclature.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.

References

  1. 1 2 3 4 5 Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" (PDF). Discrete Mathematics and Theoretical Computer Science Proceedings. Nancy, France. AH: 137–156. CiteSeerX   10.1.1.76.4286 . Retrieved 2016-12-11.
  2. Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick (ed.). Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Vol. 2832. Springer. pp. 605–617.
  3. Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.
  4. S Heule; M Nunkesser; A Hall (2013). "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (PDF). sec 4.
  5. Whang, Kyu-Young; Vander-Zanden, Brad T; Taylor, Howard M (1990). "A linear-time probabilistic counting algorithm for database applications". ACM Transactions on Database Systems. 15 (2): 208–229. doi: 10.1145/78922.78925 . S2CID   2939101.
  6. 1 2 "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" . Retrieved 2014-04-19.
  7. "PFCOUNT – Redis".
  8. Cohen, E. (March 2015). "All-distances sketches, revisited: HIP estimators for massive graphs analysis". IEEE Transactions on Knowledge and Data Engineering. 27 (9): 2320–2334. arXiv: 1306.3284 . doi:10.1109/TKDE.2015.2411606.
  9. Ting, D. (August 2014). "Streamed approximate counting of distinct elements". Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 442–451. doi:10.1145/2623330.2623669. ISBN   978-1-4503-2956-9. S2CID   13179875.
  10. Xiao, Q.; Zhou, Y.; Chen, S. (May 2017). "Better with fewer bits: Improving the performance of cardinality estimation of large data streams". IEEE INFOCOM 2017 - IEEE Conference on Computer Communications. pp. 1–9. doi:10.1109/INFOCOM.2017.8057088. ISBN   978-1-5090-5336-0. S2CID   27159273.