Yao's principle

Last updated

In computational complexity theory, Yao's principle (also called Yao's minimax principle or Yao's lemma) states that the expected cost of a randomized algorithm on the worst-case input is no better than the expected cost for a worst-case probability distribution on the inputs of the deterministic algorithm that performs best against that distribution. Thus, to establish a lower bound on the performance of randomized algorithms, it suffices to find an appropriate distribution of difficult inputs, and to prove that no deterministic algorithm can perform well against that distribution. This principle is named after Andrew Yao, who first proposed it.

Contents

Yao's principle may be interpreted in game theoretic terms, via a two-player zero-sum game in which one player, Alice, selects a deterministic algorithm, the other player, Bob, selects an input, and the payoff is the cost of the selected algorithm on the selected input. Any randomized algorithm R may be interpreted as a randomized choice among deterministic algorithms, and thus as a strategy for Alice. By von Neumann's minimax theorem, Bob has a randomized strategy that performs at least as well against R as it does against the best pure strategy Alice might choose; that is, Bob's strategy defines a distribution on the inputs such that the expected cost of R on that distribution (and therefore also the worst case expected cost of R) is no better than the expected cost of any single deterministic algorithm against the same distribution.

Statement

The formulation below states the principle for Las Vegas randomized algorithms, i.e., distributions over deterministic algorithms that are correct on every input but have varying costs. It is straightforward to adapt the principle to Monte Carlo algorithms, i.e., distributions over deterministic algorithms that have bounded costs but can be incorrect on some inputs.

Consider a problem over the inputs , and let be the set of all possible deterministic algorithms that correctly solve the problem. For any algorithm and input , let be the cost of algorithm run on input .

Let be a probability distribution over the algorithms , and let denote a random algorithm chosen according to . Let be a probability distribution over the inputs , and let denote a random input chosen according to . Then,

That is, the worst-case expected cost of the randomized algorithm is at least the expected cost of the best deterministic algorithm against input distribution .

Proof

Let and . We have

As mentioned above, this theorem can also be seen as a very special case of the Minimax theorem.

Related Research Articles

In probability theory, the expected value of a random variable , often denoted , , or , is a generalization of the weighted average, and is intuitively the arithmetic mean of a large number of independent realizations of . The expectation operator is also commonly stylized as or . The expected value is also known as the expectation, mathematical expectation, mean, average, or first moment. Expected value is a key concept in economics, finance, and many other subjects.

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output are random variables.

In computational complexity theory, DSPACE or SPACE is the computational resource describing the resource of memory space for a deterministic Turing machine. It represents the total amount of memory space that a "normal" physical computer would need to solve a given computational problem with a given algorithm.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

In cryptography and the theory of computation, the next-bit test is a test against pseudo-random number generators. We say that a sequence of bits passes the next bit test for at any position in the sequence, if any attacker who knows the first bits cannot predict the st with reasonable computational power.

Empirical risk minimization

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data.

Kernel method

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares estimates the conditional mean of the response variable across values of the predictor variables, quantile regression estimates the conditional median of the response variable. Quantile regression is an extension of linear regression used when the conditions of linear regression are not met.

Learning with errors (LWE) is the computational problem of inferring a linear -ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus be useful in cryptography.

Uncertainty theory is a branch of mathematics based on normality, monotonicity, self-duality, countable subadditivity, and product measure axioms.

Randomized algorithms are algorithms that employ a degree of randomness as part of their logic. These algorithms can be used to give good average-case results (complexity-wise) to problems which are hard to solve deterministically, or display poor worst-case complexity. An algorithmic game theoretic approach can help explain why in the average case randomized algorithms may work better than deterministic algorithms.

Gradient boosting Machine learning technique

Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Within computer science and operations research, many combinatorial optimization problems are computationally intractable to solve exactly . Many such problems do admit fast approximation algorithms—that is, algorithms that are guaranteed to return an approximately optimal solution given any input.

In coding theory, generalized minimum-distance (GMD) decoding provides an efficient algorithm for decoding concatenated codes, which is based on using an errors-and-erasures decoder for the outer code.

Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other machine-learning algorithms. SVM algorithms categorize multidimensional data, with the goal of fitting the training set data well, but also avoiding overfitting, so that the solution generalizes to new data points. Regularization algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high norms in some function space. Specifically, Tikhonov regularization algorithms choose a function that minimizes the sum of training-set error plus the function's norm. The training-set error can be calculated with different loss functions. For example, regularized least squares is a special case of Tikhonov regularization using the squared error loss as the loss function.

Sample complexity

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

Occam learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

Manifold regularization

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

References