Randomized weighted majority algorithm

Last updated December 18, 2023

The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems.^[1] It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert.

Example

Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions.

Motivation

In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows:

initialize all experts to weight 1 for each round:     add each expert's weight to the option they predicted     predict the option with the largest weighted sum     multiply the weights of all experts who predicted wrongly by  ${\frac {1}{2}}$

Suppose there are $n$ experts and the best expert makes $m$ mistakes. Then, the weighted majority algorithm (WMA) makes at most $2.4(\log _{2}n+m)$ mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in $N=100$ rounds using $n=10$ experts, the best expert makes $m=20$ mistakes. Then, the weighted majority algorithm only guarantees an upper bound of $2.4(\log _{2}10+20)\approx 56$ mistakes.

As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on $m$ . In particular, we can do better by introducing randomization.

Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does).^[2]

Randomized weighted majority algorithm (RWMA)

The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on $m$ . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority).

Precisely, if $w_{i}$ is the weight of expert $i$ , let $W=\sum _{i}w_{i}$ . We will follow expert $i$ with probability ${\frac {w_{i}}{W}}$ . This results in the following algorithm:

initialize all experts to weight 1. for each round:     add all experts' weights together to obtain the total weight  $W$      choose expert  $i$  randomly with probability  ${\frac {w_{i}}{W}}$       predict as the chosen expert predicts     multiply the weights of all experts who predicted wrongly by  $\beta$

The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day.

The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by $\beta <1$ instead of strictly ${\frac {1}{2}}$ allows us to trade off between dependence on $m$ and $\log _{2}n$ . This trade-off will be quantified in the analysis section.

Analysis

Let $W_{t}$ denote the total weight of all experts at round $t$ . Also let $F_{t}$ denote the fraction of weight placed on experts which predict the wrong answer at round $t$ . Finally, let $N$ be the total number of rounds in the process.

By definition, $F_{t}$ is the probability that the algorithm makes a mistake on round $t$ . It follows from the linearity of expectation that if $M$ denotes the total number of mistakes made during the entire process, $E[M]=\sum _{t=1}^{N}F_{t}$ .

After round $t$ , the total weight is decreased by $\ (1-\beta )F_{t}W_{t}$ , since all weights corresponding to a wrong answer are multiplied by $\ \beta <1$ . It then follows that $W_{t+1}=W_{t}(1-(1-\beta )F_{t})$ . By telescoping, since $W_{1}=n$ , it follows that the total weight after the process concludes is

{\begin{aligned}W=n\prod _{t=1}^{N}(1-(1-\beta )F_{t}).\end{aligned}}

On the other hand, suppose that $\ m$ is the number of mistakes made by the best-performing expert. At the end, this expert has weight $\ \beta ^{m}$ . It follows, then, that the total weight is at least this much; in other words, $\ W\geq \beta ^{m}$ . This inequality and the above result imply

{\begin{aligned}n\prod _{t=1}^{N}(1-(1-\beta )F_{t})\geq \beta ^{m}.\end{aligned}}

Taking the natural logarithm of both sides yields

{\begin{aligned}\ln n+\sum _{t=1}^{N}\ln(1-(1-\beta )F_{t})\geq m\ln \beta .\end{aligned}}

Now, the Taylor series of the natural logarithm is

{\begin{aligned}\ln(1-x)=-x-{\frac {x^{2}}{2}}-{\frac {x^{3}}{3}}-\cdots \end{aligned}}

In particular, it follows that $\ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}$ .
Thus,

{\begin{aligned}\ln n-(1-\beta )\sum _{t=1}^{N}F_{t}\geq m\ln \beta .\end{aligned}}

Recalling that $E[M]=\sum _{t=1}^{N}F_{t}$ and rearranging, it follows that

{\begin{aligned}E[M]\leq {\frac {m\ln(1/\beta )+\ln(n)}{1-\beta }}={\frac {\ln(1/\beta )}{1-\beta }}m+{\frac {1}{1-\beta }}\ln(n).\end{aligned}}

Now, as $\beta \to 1$ from below, the first constant tends to $1$ ; however, the second constant tends to $+\infty$ . To quantify this tradeoff, define $\varepsilon =1-\beta$ to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm,

{\begin{aligned}{\frac {\ln(1/\beta )}{1-\beta }}=-{\frac {\ln(\beta )}{1-\beta }}={\frac {-\ln(1-\varepsilon )}{\varepsilon }}={\frac {\varepsilon +{\frac {\varepsilon ^{2}}{2}}+{\frac {\varepsilon ^{3}}{3}}+\cdots }{\varepsilon }}=1+{\frac {\varepsilon }{2}}+O(\varepsilon ^{2})\end{aligned}}

It then follows that the mistake bound, for small $\varepsilon$ , can be written in the form $\ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)$ .

In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of $\varepsilon$ and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert.

In particular, as long as $m$ is sufficiently large compared to $\ln(n)$ (so that their ratio is sufficiently small), we can assign

{\begin{aligned}\varepsilon ={\sqrt {\frac {\ln(n)}{m}}}\end{aligned}}

we can obtain an upper bound on the number of mistakes equal to

{\begin{aligned}m+O({\sqrt {m\ln(n)}}).\end{aligned}}

This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at $O({\sqrt {m\ln(n)}})$ .

Revisiting the motivation

Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in $N=100$ rounds, with $n=10$ experts, where the best expert makes $m=20$ mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of $2.4(\log _{2}10+20)\approx 56$ . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the function

{\begin{aligned}{\frac {\ln(1/\beta )}{1-\beta }}20+{\frac {1}{1-\beta }}\ln(10).\end{aligned}}

Computational methods show that the optimal value is roughly $\beta \approx 0.641$ , which results in the minimal worst-case number of expected mistakes of $E[M]\approx 31.19$ . When the number of rounds is increased (say, to $N=1000000$ ) while the accuracy rate of the best expert is kept the same the improvement can be even more dramatic; the weighted majority algorithm guarantees only a worst-case mistake rate of 48.0%, but the randomized weighted majority algorithm, when properly tuned to the optimal value of $\varepsilon \approx 0.0117$ , achieves a worst-case mistake rate of 20.2%.

Uses of Randomized Weighted Majority Algorithm (RWMA)

The Randomized Weighted Majority Algorithm can be used to combine multiple algorithms in which case RWMA can be expected to perform nearly as well as the best of the original algorithms in hindsight. Note that the RWMA can be generalized to solve problems which do not have binary mistake variables, which makes it useful for a wide class of problems.

Furthermore, one can apply the Randomized Weighted Majority Algorithm in situations where experts are making choices that cannot be combined (or can't be combined easily). For example, RWMA can be applied to repeated game-playing or the online shortest path problem. In the online shortest path problem, each expert is telling you a different way to drive to work. You pick one path using RWMA. Later you find out how well you would have done using all of the suggested paths and penalize appropriately. The goal is to have an expected loss not much larger than the loss of the best expert.

Applications in software

The randomized weighted majority algorithm has been proposed as a new method for several practical software applications, particularly in the domains of bug detection and cyber-security.^[3]^[4] For instance, Varsha and Madhavu (2021) describe how the randomized weighted majority algorithm can be used to replace conventional voting within a random forest classification approach to detect insider threats. Using experimental results, they show that this approach obtained a higher level of accuracy and recall compared to the standard random forest algorithm. Moustafa et al. (2018) have studied how an ensemble classifier based on the randomized weighted majority algorithm could be used to detect bugs earlier in the software development process, after being trained on existing software repositories.

Extensions

Multi-armed bandit problem.
Efficient algorithm for some cases with many experts.
Sleeping experts/"specialists" setting.

Related Research Articles

Fermi–Dirac statistics is a type of quantum statistics that applies to the physics of a system consisting of many non-interacting, identical particles that obey the Pauli exclusion principle. A result is the Fermi–Dirac distribution of particles over energy states. It is named after Enrico Fermi and Paul Dirac, each of whom derived the distribution independently in 1926. Fermi–Dirac statistics is a part of the field of statistical mechanics and uses the principles of quantum mechanics.

In statistical mechanics, Maxwell–Boltzmann statistics describes the distribution of classical material particles over various energy states in thermal equilibrium. It is applicable when the temperature is high enough or the particle density is low enough to render quantum effects negligible.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In quantum statistics, Bose–Einstein statistics describes one of two possible ways in which a collection of non-interacting identical particles may occupy a set of available discrete energy states at thermodynamic equilibrium. The aggregation of particles in the same state, which is a characteristic of particles obeying Bose–Einstein statistics, accounts for the cohesive streaming of laser light and the frictionless creeping of superfluid helium. The theory of this behaviour was developed (1924–25) by Satyendra Nath Bose, who recognized that a collection of identical and indistinguishable particles can be distributed in this way. The idea was later adopted and extended by Albert Einstein in collaboration with Bose.

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

<span class="mw-page-title-main">Negative temperature</span> Physical systems hotter than any other

Certain systems can achieve negative thermodynamic temperature; that is, their temperature can be expressed as a negative quantity on the Kelvin or Rankine scales. This should be distinguished from temperatures expressed as negative numbers on non-thermodynamic Celsius or Fahrenheit scales, which are nevertheless higher than absolute zero. A system with a truly negative temperature on the Kelvin scale is hotter than any system with a positive temperature. If a negative-temperature system and a positive-temperature system come in contact, heat will flow from the negative- to the positive-temperature system. A standard example of such a system is population inversion in laser physics.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In machine learning, weighted majority algorithm (WMA) is a meta learning algorithm used to construct a compound algorithm from a pool of prediction algorithms, which could be any type of learning algorithms, classifiers, or even real human experts. The algorithm assumes that we have no prior knowledge about the accuracy of the algorithms in the pool, but there are sufficient reasons to believe that one or more will perform well.

In regression, mean response and predicted response, also known as mean outcome and predicted outcome, are values of the dependent variable calculated from the regression parameters and a given value of the independent variable. The values of these two responses are the same, but their calculated variances are different. The concept is a generalization of the distinction between the standard error of the mean and the sample standard deviation.

Viscoplasticity is a theory in continuum mechanics that describes the rate-dependent inelastic behavior of solids. Rate-dependence in this context means that the deformation of the material depends on the rate at which loads are applied. The inelastic behavior that is the subject of viscoplasticity is plastic deformation which means that the material undergoes unrecoverable deformations when a load level is reached. Rate-dependent plasticity is important for transient plasticity calculations. The main difference between rate-independent plastic and viscoplastic material models is that the latter exhibit not only permanent deformations after the application of loads but continue to undergo a creep flow as a function of time under the influence of the applied load.

<span class="mw-page-title-main">Fiber-reinforced composite</span>

A fiber-reinforced composite (FRC) is a composite building material that consists of three components:

the fibers as the discontinuous or dispersed phase,
the matrix as the continuous phase, and
the fine interphase region, also known as the interface.

In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null hypothesis that says that a proposed model fits well. The other component is the pure-error sum of squares.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

<span class="mw-page-title-main">Bianconi–Barabási model</span>

The Bianconi–Barabási model is a model in network science that explains the growth of complex evolving networks. This model can explain that nodes with different characteristics acquire links at different rates. It predicts that a node's growth depends on its fitness and can calculate the degree distribution. The Bianconi–Barabási model is named after its inventors Ginestra Bianconi and Albert-László Barabási. This model is a variant of the Barabási–Albert model. The model can be mapped to a Bose gas and this mapping can predict a topological phase transition between a "rich-get-richer" phase and a "winner-takes-all" phase.

In statistics, linear regression is a linear approach for modelling a predictive relationship between a scalar response and one or more explanatory variables, which are measured without error. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts, and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning, optimization, theoretical computer science, and game theory.

References

↑ Littlestone, N.; Warmuth, M. (1994). "The Weighted Majority Algorithm". Information and Computation. 108 (2): 212–261. doi: 10.1006/inco.1994.1009 .
↑ "COS 511: Foundations of Machine Learning" (PDF). 20 March 2006.
↑ Suresh, P. Varsha; Madhavu, Minu Lalitha (2021). "Insider Attack: Internal Cyber Attack Detection Using Machine Learning". 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT): 1–7. doi:10.1109/ICCCNT51525.2021.9579549.
↑ Moustafa, Sammar; El Nainay, Mustafa Y; El Makky, Nagwa; Abougabal, Mohamed S. (2018). "Software bug prediction using weighted majority voting techniques". Alexandria Engineering Journal. 57 (4): 2763–2774. doi: 10.1016/j.aej.2018.01.003 .