In probability theory and machine learning, the multi-armed bandit problem (sometimes called the K- [1] or N-armed bandit problem [2] ) is a problem in which a decision maker iteratively selects one of multiple fixed choices (i.e., arms or actions) when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms. [3]
Instances of the multi-armed bandit problem include the task of iteratively allocating a fixed, limited set of resources between competing (alternative) choices in a way that minimizes the regret. [4] [5] A notable alternative setup for the multi-armed bandit problem include the "best arm identification (BAI)" problem where the goal is instead to identify the best choice by the end of a finite number of rounds. [6]
The multi-armed bandit problem is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. In contrast to general RL, the selected actions in bandit problems do not affect the reward distribution of the arms. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. [7] The multi-armed bandit problem also falls into the broad category of stochastic scheduling.
In the problem, each machine provides a random reward from a probability distribution specific to that machine, that is not known a priori. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. [4] [5] The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in machine learning. In practice, multi-armed bandits have been used to model problems such as managing research projects in a large organization, like a science foundation or a pharmaceutical company. [4] [5] In early versions of the problem, the gambler begins with no initial knowledge about the machines.
Herbert Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". [8] A theorem, the Gittins index, first published by John C. Gittins, gives an optimal policy for maximizing the expected discounted reward. [9]
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize their decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. There are many practical applications of the bandit model, for example:
In these practical examples, the problem requires balancing reward maximization based on the knowledge already acquired with attempting new actions to further increase knowledge. This is known as the exploitation vs. exploration tradeoff in machine learning.
The model has also been used to control dynamic allocation of resources to different projects, answering the question of which project to work on, given uncertainty about the difficulty and payoff of each possibility. [14]
Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it. [15]
The version of the problem now commonly analyzed was formulated by Herbert Robbins in 1952.
The multi-armed bandit (short: bandit or MAB) can be seen as a set of real distributions , each distribution being associated with the rewards delivered by one of the levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret after rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards:
,
where is the maximal reward mean, , and is the reward in round t.
A zero-regret strategy is a strategy whose average regret per round tends to zero with probability 1 when the number of played rounds tends to infinity. [16] Intuitively, zero-regret strategies are guaranteed to converge to a (not necessarily unique) optimal strategy if enough rounds are played.
A common formulation is the Binary multi-armed bandit or Bernoulli multi-armed bandit, which issues a reward of one with probability , and otherwise a reward of zero.
Another formulation of the multi-armed bandit has each arm representing an independent Markov machine. Each time a particular arm is played, the state of that machine advances to a new one, chosen according to the Markov state evolution probabilities. There is a reward depending on the current state of the machine. In a generalization called the "restless bandit problem", the states of non-played arms can also evolve over time. [17] There has also been discussion of systems where the number of choices (about which arm to play) increases over time. [18]
Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite (asymptotic) time horizons for both stochastic [1] and non-stochastic [19] arm payoffs.
An important variation of the classical regret minimization problem in multi-armed bandits is the one of Best Arm Identification (BAI), [20] also known as pure exploration. This problem is crucial in various applications, including clinical trials, adaptive routing, recommendation systems, and A/B testing.
In BAI, the objective is to identify the arm having the highest expected reward. An algorithm in this setting is characterized by a sampling rule, a decision rule, and a stopping rule, described as follows:
There are two predominant settings in BAI:
Fixed budget setting: Given a time horizon , the objective is to identify the arm with the highest expected reward minimizing probability of error .
Fixed confidence setting: Given a confidence level , the objective is to identify the arm with the highest expected reward with the least possible amount of trials and with probability of error .
For example using a decision rule, we could use where is the machine no.1 (you can use a different variable respectively) and is the amount for each time an attempt is made at pulling the lever, where , identify as the sum of each attempts , (...) as needed, and from there you can get a ratio, sum or mean as quantitative probability and sample your formulation for each slots.
You can also do where equal to each a unique machine slot, is the amount each time the lever is triggered, is the sum of , would be the total available amount in your possession, is relative to where reduced as the sum of each gain or loss from (let's say you have 100$ that is defined as and would be a gain is equal to a loss, from there you get your results either positive or negative to add for with your own specific rule) and as the maximum you are willing to spend. It is possible to express this construction using a combination of multiple algebraic formulation, as mentioned above where you can limit with for, or in Time and so on.
A major breakthrough was the construction of optimal population selection strategies, or policies (that possess uniformly maximum convergence rate to the population with highest mean) in the work described below.
In the paper "Asymptotically efficient adaptive allocation rules", Lai and Robbins [21] (following papers of Robbins and his co-workers going back to Robbins in the year 1952) constructed convergent population selection policies that possess the fastest rate of convergence (to the population with highest mean) for the case that the population reward distributions are the one-parameter exponential family. Then, in Katehakis and Robbins [22] simplifications of the policy and the main proof were given for the case of normal populations with known variances. The next notable progress was obtained by Burnetas and Katehakis in the paper "Optimal adaptive policies for sequential allocation problems", [23] where index based policies with uniformly maximum convergence rate were constructed, under more general conditions that include the case in which the distributions of outcomes from each population depend on a vector of unknown parameters. Burnetas and Katehakis (1996) also provided an explicit solution for the important case in which the distributions of outcomes follow arbitrary (i.e., non-parametric) discrete, univariate distributions.
Later in "Optimal adaptive policies for Markov decision processes" [24] Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend on unknown parameters. In this work, the authors constructed an explicit form for a class of adaptive policies with uniformly maximum convergence rate properties for the total expected finite horizon reward under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations. These inflations have recently been called the optimistic approach in the work of Tewari and Bartlett, [25] Ortner [26] Filippi, Cappé, and Garivier, [27] and Honda and Takemura. [28]
For Bernoulli multi-armed bandits, Pilarski et al. [29] studied computation methods of deriving fully optimal solutions (not just asymptotically) using dynamic programming in the paper "Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge." [29] Via indexing schemes, lookup tables, and other techniques, this work provided practically applicable optimal solutions for Bernoulli bandits provided that time horizons and numbers of arms did not become excessively large. Pilarski et al. [30] later extended this work in "Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI" [30] to create a method of determining the optimal policy for Bernoulli bandits when rewards may not be immediately revealed following a decision and may be delayed. This method relies upon calculating expected values of reward outcomes which have not yet been revealed and updating posterior probabilities when rewards are revealed.
When optimal solutions to multi-arm bandit tasks [31] are used to derive the value of animals' choices, the activity of neurons in the amygdala and ventral striatum encodes the values derived from these policies, and can be used to decode when the animals make exploratory versus exploitative choices. Moreover, optimal policies better predict animals' choice behavior than alternative strategies (described below). This suggests that the optimal solutions to multi-arm bandit problems are biologically plausible, despite being computationally demanding. [32]
Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.
Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.
Probability matching strategies reflect the idea that the number of pulls for a given lever should match its actual probability of being the optimal lever. Probability matching strategies are also known as Thompson sampling or Bayesian Bandits, [37] [38] and are surprisingly easy to implement if you can sample from the posterior for the mean value of each alternative.
Probability matching strategies also admit solutions to so-called contextual bandit problems. [37]
Pricing strategies establish a price for each lever. For example, as illustrated with the POKER algorithm, [16] the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of highest price is always pulled.
A useful generalization of the multi-armed bandit is the contextual multi-armed bandit. At each iteration an agent still has to choose between arms, but they also see a d-dimensional feature vector, the context vector they can use together with the rewards of the arms played in the past to make the choice of the arm to play. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors. [39]
Many strategies exist that provide an approximate solution to the contextual bandit problem, and can be put into two broad categories detailed below.
In practice, there is usually a cost associated with the resource consumed by each action and the total cost is limited by a budget in many applications such as crowdsourcing and clinical trials. Constrained contextual bandit (CCB) is such a model that considers both the time and budget constraints in a multi-armed bandit setting. A. Badanidiyuru et al. [54] first studied contextual bandits with budget constraints, also referred to as Resourceful Contextual Bandits, and show that a regret is achievable. However, their work focuses on a finite set of policies, and the algorithm is computationally inefficient.
A simple algorithm with logarithmic regret is proposed in: [55]
Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration, an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest generalizations of the bandit problem [56] as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.
An example often considered for adversarial bandits is the iterated prisoner's dilemma. In this example, each adversary has two arms to pull. They can either Deny or Confess. Standard stochastic bandit algorithms don't work very well with these iterations. For example, if the opponent cooperates in the first 100 rounds, defects for the next 200, then cooperate in the following 300, etc. then algorithms such as UCB won't be able to react very quickly to these changes. This is because after a certain point sub-optimal arms are rarely pulled to limit exploration and focus on exploitation. When the environment changes the algorithm is unable to adapt or may not even detect the change.
EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multi-armed bandits with side information [Seldin et al., 2011] and to multi-armed bandits in the mixed stochastic-adversarial setting [Bubeck and Slivkins, 2012]. The paper presented an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving "logarithmic" regret in stochastic environment.
Parameters: Real Initialisation: for For each t = 1, 2, ..., T 1. Set 2. Draw randomly according to the probabilities 3. Receive reward 4. For set:
Exp3 chooses an arm at random with probability it prefers arms with higher weights (exploit), it chooses with probability to uniformly randomly explore. After receiving the rewards the weights are updated. The exponential growth significantly increases the weight of good arms.
The (external) regret of the Exp3 algorithm is at most
Parameters: Real Initialisation:For each t = 1,2,...,T 1. For each arm generate a random noise from an exponential distribution 2. Pull arm : Add noise to each arm and pull the one with the highest value 3. Update value: The rest remains the same
We follow the arm that we think has the best performance so far adding exponential noise to it to provide exploration. [58]
Exp3 | FPL |
---|---|
Maintains weights for each arm to calculate pulling probability | Doesn't need to know the pulling probability per arm |
Has efficient theoretical guarantees | The standard FPL does not have good theoretical guarantees |
Might be computationally expensive (calculating the exponential terms) | Computationally quite efficient |
In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable . In the infinite armed case, introduced by Agrawal (1995), [59] the "arms" are a continuous variable in dimensions.
This framework refers to the multi-armed bandit problem in a non-stationary setting (i.e., in presence of concept drift). In the non-stationary setting, it is assumed that the expected reward for an arm can change at every time step : . Thus, no longer represents the whole sequence of expected (stationary) rewards for arm . Instead, denotes the sequence of expected rewards for arm , defined as . [60]
A dynamic oracle represents the optimal policy to be compared with other policies in the non-stationary setting. The dynamic oracle optimises the expected reward at each step by always selecting the best arm, with expected reward of . Thus, the cumulative expected reward for the dynamic oracle at final time step is defined as:
Hence, the regret for policy is computed as the difference between and the cumulative expected reward at step for policy :
Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play. A number of algorithms were presented to deal with this case, including Discounted UCB [61] and Sliding-Window UCB. [62] A similar approach based on Thompson Sampling algorithm is the f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS) [63] proposed by Cavenaghi et al. The f-dsw TS algorithm exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. Another work by Burtini et al. introduces a weighted least squares Thompson sampling approach (WLS-TS), which proves beneficial in both the known and unknown non-stationary cases. [64]
Many variants of the problem have been proposed in recent years.
The dueling bandit variant was introduced by Yue et al. (2012) [65] to model the exploration-versus-exploitation tradeoff for relative feedback. In this variant the gambler is allowed to pull two levers at the same time, but they only get a binary feedback telling which lever provided the best reward. The difficulty of this problem stems from the fact that the gambler has no way of directly observing the reward of their actions. The earliest algorithms for this problem were InterleaveFiltering [65] and Beat-The-Mean. [66] The relative feedback of dueling bandits can also lead to voting paradoxes. A solution is to take the Condorcet winner as a reference. [67]
More recently, researchers have generalized algorithms from traditional MAB to dueling bandits: Relative Upper Confidence Bounds (RUCB), [68] Relative EXponential weighing (REX3), [69] Copeland Confidence Bounds (CCB), [70] Relative Minimum Empirical Divergence (RMED), [71] and Double Thompson Sampling (DTS). [72]
Approaches using multiple bandits that cooperate sharing knowledge in order to better optimize their performance started in 2013 with "A Gang of Bandits", [73] an algorithm relying on a similarity graph between the different bandit problems to share knowledge. The need of a similarity graph was removed in 2014 by the work on the CLUB algorithm. [74] Following this work, several other researchers created algorithms to learn multiple models at the same time under bandit feedback. For example, COFIBA was introduced by Li and Karatzoglou and Gentile (SIGIR 2016), [75] where the classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data.
The Combinatorial Multiarmed Bandit (CMAB) problem [76] [77] [78] arises when instead of a single discrete variable to choose from, an agent needs to choose values for a set of variables. Assuming each variable is discrete, the number of possible choices per iteration is exponential in the number of variables. Several CMAB settings have been studied in the literature, from settings where the variables are binary [77] to more general setting where each variable can take an arbitrary set of values. [78]
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Evolution strategy (ES) from computer science is a subclass of evolutionary algorithms, which serves as an optimization technique. It uses the major genetic operators mutation, recombination and selection of parents.
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.
The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminating. Upon terminating at a given state, the reward achieved is the sum of the probabilistic expected rewards associated with every state from the actual terminating state to the ultimate terminal state, inclusive. The index is a real scalar.
In graph theory, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space and connecting two nodes by a link if and only if their distance is in a given range, e.g. smaller than a certain neighborhood radius, r.
In applied mathematics, the numerical sign problem is the problem of numerically evaluating the integral of a highly oscillatory function of a large number of variables. Numerical methods fail because of the near-cancellation of the positive and negative contributions to the integral. Each has to be integrated to very high precision in order for their difference to be obtained with useful accuracy.
Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user, to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.
AIXI is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book Universal Artificial Intelligence.
In queueing theory, a discipline within the mathematical theory of probability, a fluid queue is a mathematical model used to describe the fluid level in a reservoir subject to randomly determined periods of filling and emptying. The term dam theory was used in earlier literature for these models. The model has been used to approximate discrete models, model the spread of wildfires, in ruin theory and to model high speed data networks. The model applies the leaky bucket algorithm to a stochastic source.
Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that address the exploration–exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.
In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.
A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.
A Tsetlin machine is an artificial intelligence algorithm based on propositional logic.
Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.
In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.
(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.
The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".
The exploration–exploitation dilemma, also known as the explore–exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system, while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.