In computer science, the count-distinct problem [1] (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.
An example of an instance for the cardinality estimation problem is the stream: . For this instance, .
The naive solution to the problem is as follows:
Initialize a counter, c, to zero, . Initialize an efficient dictionary data structure, D, such as hash table or search tree in which insertion and membership can be performed quickly. For each element , a membership query is issued. If is not a member of D()Add to D Increase c by one, Otherwise () do nothing. Output .
As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element should be minimized. In such a case, several streaming algorithms have been proposed that use a fixed number of storage units.
To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements, . State-of-the-art estimators hash every element into a low-dimensional data sketch using a hash function, . The different techniques can be classified according to the data sketches they store.
Min/max sketches [2] [3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. [4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator [5] is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm. [6]
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element is associated with a uniform RV, , the expected minimum value of is . The hash function guarantees that is identical for all the appearances of . Thus, the existence of duplicates does not affect the value of the extreme order statistics.
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation [7] describes the Flajolet–Martin algorithm, a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff. [8]
Bottom-m sketches [9] are a generalization of min sketches, which maintain the minimal values, where . See Cosma et al. [2] for a theoretical overview of count-distinct estimation algorithms, and Metwally [10] for a practical overview with comparative simulation results.
defalgorithm_d(stream,s):m=len(stream)# We assume that this is given to us in advance.t=-1# Note that Knuth indexes the stream from 1. p=1a=0buffer=[]whilet<(m-1):t+=1a=stream[t]u=uniform(0,1)buffer=list(filter(lambdax:x[1]!=a,buffer))ifu<p:if(len(buffer)<s):buffer.append([u,a])else:buffer=sorted(buffer)p=max(buffer[-1][0],u)buffer.pop()buffer.append([u,a])returnlen(buffer)/p
Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm [11] (named by Donald Knuth after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream, [12] in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. [12]
Initialize Initialize max buffer size , where Initialize an empty buffer, BFor each element in data stream of size do: If is in B thenDelete from B random number in If thenIf then insert in B else such that /* whose is maximum in B */ If then else Replace with End Forreturn .
The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. [12]
Initialize Initialize max buffer size , where Initialize an empty buffer, BFor each element in data stream of size do: If is in B thenDelete from B random number in If then Insert into BWhile then Remove every element of of B with End While If then Insert into BEnd Forreturn .
In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,
An example of an instance for the weighted problem is: . For this instance, , the weights are and .
As an application example, could be IP packets received by a server. Each packet belongs to one of IP flows . The weight can be the load imposed by flow on the server. Thus, represents the total load imposed on the server by all the flows to which packets belong.
Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem . [13] For example, the weighted estimator proposed by Cohen et al. [5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm [6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference.
FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.
In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.
MUSIC is an algorithm used for frequency estimation and radio direction finding.
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.
In statistical theory, a U-statistic is a class of statistics defined as the average over the application of a given function applied to all tuples of a fixed size. The letter "U" stands for unbiased. In elementary statistics, U-statistics arise naturally in producing minimum-variance unbiased estimators.
An important aspect in the study of elliptic curves is devising effective ways of counting points on the curve. There have been several approaches to do so, and the algorithms devised have proved to be useful tools in the study of various fields such as number theory, and more recently in cryptography and Digital Signature Authentication. While in number theory they have important consequences in the solving of Diophantine equations, with respect to cryptography, they enable us to make effective use of the difficulty of the discrete logarithm problem (DLP) for the group , of elliptic curves over a finite field , where q = pk and p is a prime. The DLP, as it has come to be known, is a widely used approach to public key cryptography, and the difficulty in solving this problem determines the level of security of the cryptosystem. This article covers algorithms to count points on elliptic curves over fields of large characteristic, in particular p > 3. For curves over fields of small characteristic more efficient algorithms based on p-adic methods exist.
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.
In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.
In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
ACE is the collection of units, implementing both a public key encryption scheme and a digital signature scheme. Corresponding names for these schemes — «ACE Encrypt» and «ACE Sign». Schemes are based on Cramer-Shoup public key encryption scheme and Cramer-Shoup signature scheme. Introduced variants of these schemes are intended to achieve a good balance between performance and security of the whole encryption system.
Network coding has been shown to optimally use bandwidth in a network, maximizing information flow but the scheme is very inherently vulnerable to pollution attacks by malicious nodes in the network. A node injecting garbage can quickly affect many receivers. The pollution of network packets spreads quickly since the output of honest node is corrupted if at least one of the incoming packets is corrupted.
In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.
Non-commutative cryptography is the area of cryptology where the cryptographic primitives, methods and systems are based on algebraic structures like semigroups, groups and rings which are non-commutative. One of the earliest applications of a non-commutative algebraic structure for cryptographic purposes was the use of braid groups to develop cryptographic protocols. Later several other non-commutative structures like Thompson groups, polycyclic groups, Grigorchuk groups, and matrix groups have been identified as potential candidates for cryptographic applications. In contrast to non-commutative cryptography, the currently widely used public-key cryptosystems like RSA cryptosystem, Diffie–Hellman key exchange and elliptic curve cryptography are based on number theory and hence depend on commutative algebraic structures.
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream. The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al.
LSH is a cryptographic hash function designed in 2014 by South Korea to provide integrity in general-purpose software environments such as PCs and smart devices. LSH is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP). And it is the national standard of South Korea.
Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)