The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts (usually identical initial weights), and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. [1] It was discovered repeatedly in very diverse fields such as machine learning (AdaBoost, Winnow, Hedge), optimization (solving linear programs), theoretical computer science (devising fast algorithm for LPs and SDPs), and game theory.
"Multiplicative weights" implies the iterative rule used in algorithms derived from the multiplicative weight update method. [2] It is given with different names in the different fields where it was discovered or rediscovered.
The earliest known version of this technique was in an algorithm named "fictitious play" which was proposed in game theory in the early 1950s. Grigoriadis and Khachiyan [3] applied a randomized variant of "fictitious play" to solve two-player zero-sum games efficiently using the multiplicative weights algorithm. In this case, player allocates higher weight to the actions that had a better outcome and choose his strategy relying on these weights. In machine learning, Littlestone applied the earliest form of the multiplicative weights update rule in his famous winnow algorithm, which is similar to Minsky and Papert's earlier perceptron learning algorithm. Later, he generalized the winnow algorithm to weighted majority algorithm. Freund and Schapire followed his steps and generalized the winnow algorithm in the form of hedge algorithm.
The multiplicative weights algorithm is also widely applied in computational geometry such as Kenneth Clarkson's algorithm for linear programming (LP) with a bounded number of variables in linear time. [4] [5] Later, Bronnimann and Goodrich employed analogous methods to find set covers for hypergraphs with small VC dimension. [6]
In operations research and on-line statistical decision making problem field, the weighted majority algorithm and its more complicated versions have been found independently.
In computer science field, some researchers have previously observed the close relationships between multiplicative update algorithms used in different contexts. Young discovered the similarities between fast LP algorithms and Raghavan's method of pessimistic estimators for derandomization of randomized rounding algorithms; Klivans and Servedio linked boosting algorithms in learning theory to proofs of Yao's XOR Lemma; Garg and Khandekar defined a common framework for convex optimization problems that contains Garg-Konemann and Plotkin-Shmoys-Tardos as subcases. [1]
The Hedge algorithm is a special case of mirror descent.
A binary decision needs to be made based on n experts’ opinions to attain an associated payoff. In the first round, all experts’ opinions have the same weight. The decision maker will make the first decision based on the majority of the experts' prediction. Then, in each successive round, the decision maker will repeatedly update the weight of each expert's opinion depending on the correctness of his prior predictions. Real life examples includes predicting if it is rainy tomorrow or if the stock market will go up or go down.
Given a sequential game played between an adversary and an aggregator who is advised by N experts, the goal is for the aggregator to make as few mistakes as possible. Assume there is an expert among the N experts who always gives the correct prediction. In the halving algorithm, only the consistent experts are retained. Experts who make mistakes will be dismissed. For every decision, the aggregator decides by taking a majority vote among the remaining experts. Therefore, every time the aggregator makes a mistake, at least half of the remaining experts are dismissed. The aggregator makes at most log2(N) mistakes. [2]
Unlike halving algorithm which dismisses experts who have made mistakes, weighted majority algorithm discounts their advice. Given the same "expert advice" setup, suppose we have n decisions, and we need to select one decision for each loop. In each loop, every decision incurs a cost. All costs will be revealed after making the choice. The cost is 0 if the expert is correct, and 1 otherwise. this algorithm's goal is to limit its cumulative losses to roughly the same as the best of experts. The very first algorithm that makes choice based on majority vote every iteration does not work since the majority of the experts can be wrong consistently every time. The weighted majority algorithm corrects above trivial algorithm by keeping a weight of experts instead of fixing the cost at either 1 or 0. [1] This would make fewer mistakes compared to halving algorithm.
Initialization: Fix an . For each expert, associate the weight ≔1. For = , ,...,1. Make the prediction given by the weighted majority of the experts' predictions based on their weights. That is, choose 0 or 1 depending on which prediction has a higher total weight of experts advising it (breaking ties arbitrarily). 2. For every expert i that predicted wrongly, decrease his weight for the next round by multiplying it by a factor of (1-η): = (update rule)
If , the weight of the expert's advice will remain the same. When increases, the weight of the expert's advice will decrease. Note that some researchers fix in weighted majority algorithm.
After steps, let be the number of mistakes of expert i and be the number of mistakes our algorithm has made. Then we have the following bound for every :
.
In particular, this holds for i which is the best expert. Since the best expert will have the least , it will give the best bound on the number of mistakes made by the algorithm as a whole.
This algorithm can be understood as follows: [2] [8]
Given the same setup with N experts. Consider the special situation where the proportions of experts predicting positive and negative, counting the weights, are both close to 50%. Then, there might be a tie. Following the weight update rule in weighted majority algorithm, the predictions made by the algorithm would be randomized. The algorithm calculates the probabilities of experts predicting positive or negatives, and then makes a random decision based on the computed fraction:[ further explanation needed ]
predict
where
.
The number of mistakes made by the randomized weighted majority algorithm is bounded as:
where and .
Note that only the learning algorithm is randomized. The underlying assumption is that the examples and experts' predictions are not random. The only randomness is the randomness where the learner makes his own prediction. In this randomized algorithm, if . Compared to weighted algorithm, this randomness halved the number of mistakes the algorithm is going to make. [9] However, it is important to note that in some research, people define in weighted majority algorithm and allow in randomized weighted majority algorithm. [2]
The multiplicative weights method is usually used to solve a constrained optimization problem. Let each expert be the constraint in the problem, and the events represent the points in the area of interest. The punishment of the expert corresponds to how well its corresponding constraint is satisfied on the point represented by an event. [1]
Suppose we were given the distribution on experts. Let = payoff matrix of a finite two-player zero-sum game, with rows.
When the row player uses plan and the column player uses plan , the payoff of player is ≔, assuming .
If player chooses action from a distribution over the rows, then the expected result for player selecting action is .
To maximize , player should choose plan . Similarly, the expected payoff for player is . Choosing plan would minimize this payoff. By John Von Neumann's Min-Max Theorem, we obtain:
where P and i changes over the distributions over rows, Q and j changes over the columns.
Then, let denote the common value of above quantities, also named as the "value of the game". Let be an error parameter. To solve the zero-sum game bounded by additive error of ,
So there is an algorithm solving zero-sum game up to an additive factor of δ using O(log2(n)/) calls to ORACLE, with an additional processing time of O(n) per call [9]
Bailey and Piliouras showed that although the time average behavior of multiplicative weights update converges to Nash equilibria in zero-sum games the day-to-day (last iterate) behavior diverges away from it. [10]
In machine learning, Littlestone and Warmuth generalized the winnow algorithm to the weighted majority algorithm. [11] Later, Freund and Schapire generalized it in the form of hedge algorithm. [12] AdaBoost Algorithm formulated by Yoav Freund and Robert Schapire also employed the Multiplicative Weight Update Method. [1]
Based on current knowledge in algorithms, the multiplicative weight update method was first used in Littlestone's winnow algorithm. [1] It is used in machine learning to solve a linear program.
Given labeled examples where are feature vectors, and are their labels.
The aim is to find non-negative weights such that for all examples, the sign of the weighted combination of the features matches its labels. That is, require that for all . Without loss of generality, assume the total weight is 1 so that they form a distribution. Thus, for notational convenience, redefine to be , the problem reduces to finding a solution to the following LP:
, , .
This is general form of LP.
The hedge algorithm is similar to the weighted majority algorithm. However, their exponential update rules are different. [2] It is generally used to solve the problem of binary allocation in which we need to allocate different portion of resources into N different options. The loss with every option is available at the end of every iteration. The goal is to reduce the total loss suffered for a particular allocation. The allocation for the following iteration is then revised, based on the total loss suffered in the current iteration using multiplicative update. [13]
Assume the learning rate and for , is picked by Hedge. Then for all experts ,
Initialization: Fix an . For each expert, associate the weight ≔1 For t=1,2,...,T:
1. Pick the distribution where . 2. Observe the cost of the decision . 3. Set ).
This algorithm [12] maintains a set of weights over the training examples. On every iteration , a distribution is computed by normalizing these weights. This distribution is fed to the weak learner WeakLearn which generates a hypothesis that (hopefully) has small error with respect to the distribution. Using the new hypothesis , AdaBoost generates the next weight vector . The process repeats. After T such iterations, the final hypothesis is the output. The hypothesis combines the outputs of the T weak hypotheses using a weighted majority vote. [12]
Input: Sequence of labeled examples (,),...,(, ) Distribution over the examples Weak learning algorithm "'WeakLearn"' Integer specifying number of iterations Initialize the weight vector: for . Do for1. Set . 2. Call WeakLearn, providing it with the distribution ; get back a hypothesis [0,1]. 3. Calculate the error of . 4. Set . 5. Set the new weight vector to be . Output the hypothesis:
Given a matrix and , is there a such that ?
(1)
Using the oracle algorithm in solving zero-sum problem, with an error parameter , the output would either be a point such that or a proof that does not exist, i.e., there is no solution to this linear system of inequalities.
Given vector , solves the following relaxed problem
(2)
If there exists a x satisfying (1), then x satisfies (2) for all . The contrapositive of this statement is also true. Suppose if oracle returns a feasible solution for a , the solution it returns has bounded width . So if there is a solution to (1), then there is an algorithm that its output x satisfies the system (2) up to an additive error of . The algorithm makes at most calls to a width-bounded oracle for the problem (2). The contrapositive stands true as well. The multiplicative updates is applied in the algorithm in this case.
In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model that tries to find the line of best fit for a two-dimensional data set. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.
The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).
In probability and statistics, the Dirichlet distribution, often denoted , is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalization of the beta distribution, hence its alternative name of multivariate beta distribution (MBD). Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact, the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.
Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.
In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also interact with matrix multiplication.
Exponential smoothing or exponential moving average (EMA) is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.
Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.
In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.
In many-body theory, the term Green's function is sometimes used interchangeably with correlation function, but refers specifically to correlators of field operators or creation and annihilation operators.
In mathematics and mathematical physics, raising and lowering indices are operations on tensors which change their type. Raising and lowering indices are a form of index manipulation in tensor expressions.
In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.
A kernel smoother is a statistical technique to estimate a real valued function as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter. Kernel smoothing is a type of weighted moving average.
In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.
The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert.
In coding theory, list decoding is an alternative to unique decoding of error-correcting codes in the presence of many errors. If a code has relative distance , then it is possible in principle to recover an encoded message when up to fraction of the codeword symbols are corrupted. But when error rate is greater than , this will not in general be possible. List decoding overcomes that issue by allowing the decoder to output a short list of messages that might have been encoded. List decoding can correct more than fraction of errors.
In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.
The Bianconi–Barabási model is a model in network science that explains the growth of complex evolving networks. This model can explain that nodes with different characteristics acquire links at different rates. It predicts that a node's growth depends on its fitness and can calculate the degree distribution. The Bianconi–Barabási model is named after its inventors Ginestra Bianconi and Albert-László Barabási. This model is a variant of the Barabási–Albert model. The model can be mapped to a Bose gas and this mapping can predict a topological phase transition between a "rich-get-richer" phase and a "winner-takes-all" phase.
In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.
The GHK algorithm is an importance sampling method for simulating choice probabilities in the multivariate probit model. These simulated probabilities can be used to recover parameter estimates from the maximized likelihood equation using any one of the usual well known maximization methods. Train has well documented steps for implementing this algorithm for a multinomial probit model. What follows here will apply to the binary multivariate probit model.