In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.
As a result of these constraints, streaming algorithms often produce approximate answers based on a summary or "sketch" of the data stream.
Though streaming algorithms had already been studied by Munro and Paterson [1] as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83, [2] the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy. [3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.
Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs, [4] in which the space allowed is linear in the number of vertices n, but only logarithmic in the number of edges m. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in space.
In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream". [5] If the stream has length n and the domain has size m, algorithms are generally constrained to use space that is logarithmic in m and n. They can generally make only some small constant number of passes over the stream, sometimes just one. [6]
Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector (initialized to the zero vector ) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of using considerably less space than it would take to represent precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models. [7]
In the cash register model, each update is of the form , so that is incremented by some positive integer . A notable special case is when (only unit insertions are permitted).
In the turnstile model, each update is of the form , so that is incremented by some (possibly negative) integer . In the "strict turnstile" model, no at any time may be less than zero.
Several papers also consider the "sliding window" model.[ citation needed ] In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.
Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.[ citation needed ]
The performance of an algorithm that operates on data streams is measured by three basic factors:
These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.
If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an approximation meaning that the algorithm achieves an error of less than with probability .
Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on. [8] They also have applications in databases, such as estimating the size of a join [ citation needed ].
The kth frequency moment of a set of frequencies is defined as .
The first moment is simply the sum of the frequencies (i.e., the total count). The second moment is useful for computing statistical properties of the data, such as the Gini coefficient of variation. is defined as the frequency of the most frequent items.
The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.[ citation needed ]
A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai ∈ (1,2,3,4,...,N) which requires at least memory of order . [3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation of Fk, where F'k is the (ε,δ)- approximated value of Fk. [9] Where ε is the approximation parameter and δ is the confidence parameter. [10]
Flajolet et al. in [2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris. [11] Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter log n which can be stored in log log n bits. [12] Flajolet et al. in [2] improved this method by using a hash function h which is assumed to uniformly distribute the element in the hash space (a binary string of length L).
Let bit(y,k) represent the kth bit in binary representation of y
Let represents the position of least significant 1-bit in the binary representation of yi with a suitable convention for .
Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP [0...L − 1] be the
hash space where the ρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality of A.
Procedure FM-Sketch: for i in 0 to L − 1 do BITMAP[i] := 0 end for for x in A: do Index := ρ(hash(x)) if BITMAP[index] = 0 then BITMAP[index] := 1 end if end for B := Position of left most 0 bit of BITMAP[] return 2 ^ B
If there are N distinct elements in a data stream.
The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.
Bar-Yossef et al. in [10] introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function h which can be normalized to [0,1] as . But they fixed a limit t to number of values in hash space. The value of t is assumed of the order (i.e. less approximation-value ε requires more t). KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream have arrived, is used to calculate. That is, in a close-to uniform hash space, they expect at-least t elements to be less than .
Procedure 2 K-Minimum Value Initialize first t values of KMV for a in a1 to an do if h(a) < Max(KMV) then Remove Max(KMV) from KMV set Insert h(a) to KMV end if end for return t/Max(KMV)
KMV algorithm can be implemented in memory bits space. Each hash value requires space of order memory bits. There are hash values of the order . The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to .
Alon et al. estimates Fk by defining random variables that can be computed within given space and time. [3] The expected value of random variables gives the approximate value of Fk.
Assume length of sequence m is known in advance. Then construct a random variable X as follows:
Assume S1 be of the order and S2 be of the order . Algorithm takes S2 random variable and outputs the median . Where Yi is the average of Xij where 1 ≤ j ≤ S1.
Now calculate expectation of random variable E(X).
From the algorithm to calculate Fk discussed above, we can see that each random variable X stores value of ap and r. So, to compute X we need to maintain only log(n) bits for storing ap and log(n) bits for storing r. Total number of random variable X will be the .
Hence the total space complexity the algorithm takes is of the order of
The previous algorithm calculates in order of memory bits. Alon et al. in [3] simplified this algorithm using four-wise independent random variable with values mapped to .
This further reduces the complexity to calculate to
In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.
More formally, fix some positive constant c > 1, let the length of the stream be m, and let fi denote the frequency of value i in the stream. The frequent elements problem is to output the set { i | fi > m/c }. [13]
Some notable algorithms are:
Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization. [14]
Counting the number of distinct elements in a stream (sometimes called the F0 moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem. [15] It uses O(ε2 + log d) space, with O(1) worst-case update and reporting times, as well as universal hash functions and a r-wise independent hash family where r = Ω(log(1/ε) / log log(1/ε)).
The (empirical) entropy of a set of frequencies is defined as , where .
Learn a model (e.g. a classifier) by a single pass over a training set.
Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.[ citation needed ]
A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff
utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.
In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.
Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.
In computational complexity theory, the 3SUM problem asks if a given set of real numbers contains three elements that sum to zero. A generalized version, k-SUM, asks the same question on k elements, rather than simply 3. 3SUM can be easily solved in time, and matching lower bounds are known in some specialized models of computation.
In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.
In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.
Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result. These elements then divide the array into p approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.
In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.
In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.
In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.
The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant.
A cuckoo filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters.
In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:
Approximate Membership Query Filters comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of .