Misra–Gries heavy hitters algorithm

Last updated

Misra and Gries defined the heavy-hitters problem (though they did not introduce the term heavy-hitters [4] ) and described the first algorithm for it in the paper Finding repeated elements. [1] Their algorithm extends the Boyer-Moore majority finding algorithm in a significant way.

Contents

One version of the heavy-hitters problem is as follows: Given is a bag b of n elements and an integer k ≥ 2. Find the values that occur more than n ÷ k times in b. The Misra-Gries algorithm solves the problem by making two passes over the values in b, while storing at most k values from b and their number of occurrences during the course of the algorithm.

Misra-Gries [1] is one of the earliest streaming algorithms, and it is described below in those terms in section #Summaries.

Misra–Gries algorithm

A bag is like a set in which the same value may occur multiple times. Assume that a bag is available as an array b[0:n – 1] of n elements. In the abstract description of the algorithm, we treat b and its segments also as bags. Henceforth, a heavy hitter of bag b is a value that occurs more than n ÷ k times in it, for some integer k, k≥2.

A k-reduced bag for bag b is derived from b by repeating the following operation until no longer possible: Delete k distinct elements from b. From its definition, a k-reduced bag contains fewer than k different values. The following theorem is easy to prove:

Theorem 1. Each heavy-hitter of b is an element of a k-reduced bag for b.

The first pass of the heavy-hitters computation constructs a k-reduced bag t. The second pass declares an element of t to be a heavy-hitter if it occurs more than n ÷ k times in b. According to Theorem 1, this procedure determines all and only the heavy-hitters. The second pass is easy to program, so we describe only the first pass.

In order to construct t, scan the values in b in arbitrary order, for specificity the following algorithm scans them in the order of increasing indices. Invariant P of the algorithm is that t is a k-reduced bag for the scanned values and d is the number of distinct values in t. Initially, no value has been scanned, t is the empty bag, and d is zero.

P: 0 ≤ i ≤ n
  t is a k-reduced bag for b[0:i – 1]
  d is the number of distinct values in t0 ≤ d < k

Whenever element b[i] is scanned, in order to preserve the invariant: (1) if b[i] is not in t, add it to t and increase d by 1, (2) if b[i] is in t, add it to t but don't modify d, and (3) if d becomes equal to k, reduce t by deleting k distinct values from it and update d appropriately.

algorithm Misra–Gries is     t, d := { }, 0     for i from 0 to n-1 doif b[i]  t then             t, d:= t ∪ {b[i]}, d+1         else              t, d:= t ∪ {b[i]}, d         endifif d = k then             Delete k distinct values from t; update dendifendfor

A possible implementation of t is as a set of pairs of the form (vi, ci) where each vi is a distinct value in t and ci is the number of occurrences of vi in t. Then d is the size of this set. The step "Delete k distinct values from t" amounts to reducing each ci by 1 and then removing any pair (vi, ci) from the set if ci becomes 0.

Using an AVL tree implementation of t, the algorithm has a running time of O(n log k). In order to assess the space requirement, assume that the elements of b can have m possible values, so the storage of a value vi needs O(log m) bits. Since each counter ci may have a value as high as n, its storage needs O(log n) bits. Therefore, for O(k) value-counter pairs, the space requirement is O(k (log n + log m)).

Summaries

In the field of streaming algorithms, the output of the Misra-Gries algorithm in the first pass may be called a summary, and such summaries are used to solve the frequent elements problem in the data stream model. A streaming algorithm makes a small, bounded number of passes over a list of data items called a stream. It processes the elements using at most logarithmic amount of extra space in the size of the list to produce an answer.

The term Misra–Gries summary appears to have been coined by Graham Cormode. [5] [6]

Related Research Articles

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

<span class="mw-page-title-main">Heap (data structure)</span> Computer science data structure

In computer science, a heap is a tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key of P is greater than or equal to the key of C. In a min heap, the key of P is less than or equal to the key of C. The node at the "top" of the heap is called the root node.

<span class="mw-page-title-main">Merge sort</span> Divide and conquer sorting algorithm

In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the relative order of equal elements is the same in the input and output. Merge sort is a divide-and-conquer algorithm that was invented by John von Neumann in 1945. A detailed description and analysis of bottom-up merge sort appeared in a report by Goldstine and von Neumann as early as 1948.

In computer science, a priority queue is an abstract data type similar to a regular queue or stack abstract data type. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

A bitmap index is a special kind of database index that uses bitmaps.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

In graph theory, Vizing's theorem states that every simple undirected graph may be edge colored using a number of colors that is at most one larger than the maximum degree Δ of the graph. At least Δ colors are always necessary, so the undirected graphs may be partitioned into two classes: "class one" graphs for which Δ colors suffice, and "class two" graphs for which Δ + 1 colors are necessary. A more general version of Vizing's theorem states that every undirected multigraph without loops can be colored with at most Δ+µ colors, where µ is the multiplicity of the multigraph. The theorem is named for Vadim G. Vizing who published it in 1964.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

In computational complexity theory, the element distinctness problem or element uniqueness problem is the problem of determining whether all the elements of a list are distinct.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.

<span class="mw-page-title-main">Boyer–Moore majority vote algorithm</span> Low-space search for a majority element

The Boyer–Moore majority vote algorithm is an algorithm for finding the majority of a sequence of elements using linear time and a constant number of words of memory. It is named after Robert S. Boyer and J Strother Moore, who published it in 1981, and is a prototypical example of a streaming algorithm.

In the field of streaming algorithms, Misra–Gries summaries are used to solve the frequent elements problem in the data stream model. That is, given a long stream of input that can only be examined once, the Misra-Gries algorithm can be used to compute which value makes up a majority of the stream, or more generally, the set of items that constitute some fixed fraction of the stream.

External memory graph traversal is a type of graph traversal optimized for accessing externally stored memory.

References

  1. 1 2 3 Misra, J.; Gries, David (1982). "Finding repeated elements". Science of Computer Programming. 2 (2): 143–152. doi:10.1016/0167-6423(82)90012-0. hdl: 1813/6345 .
  2. Woodruff, David P. (2016). "New algorithms for Heavy Hitters in data streams". In Wim Martens; Thomas Zeume (eds.). 19th International Conference on Database Theory (ICDT 2016). ICDT 2016. Vol. 48. Dagstuhl, Germany: Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik. doi: 10.4230/LIPIcs.ICDT.2016.4 .
  3. Pandey, Prashant; al., et. (June 2020). "Timely reporting of heavy hitters using external memory". SIGMOD 20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. SIGMOD '20. Portland, Oregon: ACM. doi:10.1145/3318464.
  4. A number of articles cite Misra-Gries [1] as one of the first to provide a heavy hitters algorithm. For example, see David P. Woodruff's article heavy hitter algorithms in data streams [2] and the article by Pandey et. al. on heavy hitters using external memory. [3]
  5. Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.). Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN   9783642278488.
  6. "Misra-Gries Summaries" (PDF). UT News. Retrieved 2022-09-19.