The balls into bins (or balanced allocations) problem is a classic problem in probability theory that has many applications in computer science. The problem involves m balls and n boxes (or "bins"). Each time, a single ball is placed into one of the bins. After all balls are in the bins, we look at the number of balls in each bin; we call this number the load on the bin. The problem can be modelled using a Multinomial distribution, and may involve asking a question such as: What is the expected number of bins with a ball in them? [1]
Obviously, it is possible to make the load as small as m/n by putting each ball into the least loaded bin. The interesting case is when the bin is selected at random, or at least partially at random. A powerful balls-into-bins paradigm is the "power of two random choices [2] " where each ball chooses two (or more) random bins and is placed in the lesser-loaded bin. This paradigm has found wide practical applications in shared-memory emulations, efficient hashing schemes, randomized load balancing of tasks on servers, and routing of packets within parallel networks and data centers. [2]
When the bin for each ball is selected at random, independent of other choices, the maximum load might be as large as . However, it is possible to calculate a tighter bound that holds with high probability. A "high probability" is a probability , i.e. the probability tends to when grows to infinity.
For the case , with probability the maximum load is: [3] [4]
.
Gonnet [5] gave a tight bound for the expected value of the maximum load, which for is , where is the inverse of the gamma function, and it is known [6] that .
The maximum load can also be calculated for , and for example, for it is , and for it is , with high probability. [7]
Exact probabilities for small can be computed as for defined in OEIS A208250.
Instead of just selecting a random bin for each ball, it is possible to select two or more bins for each ball and then put the ball in the least loaded bin. This is a compromise between a deterministic allocation, in which all bins are checked and the least loaded bin is selected, and a totally random allocation, in which a single bin is selected without checking other bins. This paradigm often called the "power of two random choices" has been studied in a number of settings below. [2]
In the simplest case, if one allocates balls into bins (with ) sequentially one by one, and for each ball one chooses random bins at each step and then allocates the ball into the least loaded of the selected bins (ties broken arbitrarily), then with high probability the maximum load is: [8]
which is almost exponentially less than with totally random allocation.
This result can be generalized to the case (with ), when with high probability the maximum load is: [9]
which is tight up to an additive constant. (All the bounds hold with probability at least for any constant .) Note that for , the random allocation process gives only the maximum load of with high probability, so the improvement between these two processes is especially visible for large values of .
Other key variants of the paradigm are "parallel balls-into-bins" where balls choose random bins in parallel, [10] "weighted balls-into-bins" where balls have non-unit weights, [11] [12] [13] and "balls-into-bins with deletions" where balls can be added as well as deleted. [2] [14]
Instead of just putting m balls, it is possible to consider an infinite process in which, at each time step, a single ball is added and a single ball is taken, such that the number of balls remains constant. For m=n, after a sufficiently long time, with high probability the maximum load is similar to the finite version, both with random allocation and with partially random allocation. [8]
In a repeated variant of the process, balls are initially distributed in bins in an arbitrary way and then, in every subsequent step of a discrete-time process, one ball is chosen from each non-empty bin and re-assigned to one of the bins uniformly at random. When , it has been shown that with high probability the process converges to a configuration with maximum load after steps. [15]
Online Load Balancing: [16] consider a set of n identical computers. There are n users who need computing services. The users are not coordinated - each users comes on his own and selects which computer to use. Each user would of course like to select the least loaded computer, but this requires to check the load on each computer, which might take a long time. Another option is to select a computer at random; this leads, with high probability, to a maximum load of . A possible compromise is that the user will check only two computers, and use the lesser loaded of the two. This leads, with high probability, to a much smaller maximum load of .
Hashing: consider a hash table in which all keys mapped to the same location are stored in a linked list. The efficiency of accessing a key depends on the length of its list. If we use a single hash function which selects locations with uniform probability, with high probability the longest chain has keys. A possible improvement is to use two hash functions, and put each new key in the shorter of the two lists. In this case, with high probability the longest chain has only elements. [17]
Fair cake-cutting : consider the problem of creating a partially proportional division of a heterogeneous resource among people, such that each person receives a part of the resource which that person values as at least of the total, where is some sufficiently large constant. The Edmonds–Pruhs protocol is a randomized algorithm whose analysis make use of balls-into-bins arguments. [18]
The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.
In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.
In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.
In graph theory, a maximal independent set (MIS) or maximal stable set is an independent set that is not a subset of any other independent set. In other words, there is no vertex outside the independent set that may join it because it is maximal with respect to the independent set property.
In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.
In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.
2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is needed, unless keys are kept in buckets. The average-case cost of a successful search is , where is the number of keys and is the size of the array. The most collisions is with high probability."
In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each of n trials is not fixed but randomly drawn from a beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.
In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization of a certain objective function, which depends on the data. The general theory of extremum estimators was developed by Amemiya (1985).
In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables. Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.
Edmonds–Pruhs protocol is a protocol for fair cake-cutting. Its goal is to create a partially proportional division of a heterogeneous resource among n people, such that each person receives a subset of the cake which that person values as at least 1/an of the total, where is some sufficiently large constant. It is a randomized algorithm whose running time is O(n) with probability close to 1. The protocol was developed by Jeff Edmonds and Kirk Pruhs, who later improved it in joint work with Jaisingh Solanki.
In computer science, the predecessor problem involves maintaining a set of items to, given an element, efficiently query which element precedes or succeeds that element in an order. Data structures used to solve the problem include balanced binary search trees, van Emde Boas trees, and fusion trees. In the static predecessor problem, the set of elements does not change, but in the dynamic predecessor problem, insertions into and deletions from the set are allowed.
A counting Bloom filter is a generalized data structure of Bloom filter, that is used to test whether a count number of a given element is smaller than a given threshold when a sequence of elements is given. As a generalized form of Bloom filter, false positive matches are possible, but false negatives are not – in other words, a query returns either "possibly bigger or equal than the threshold" or "definitely smaller than the threshold".