Thompson sampling

Last updated

Thompson sampling, [1] [2] named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief.



Consider a set of contexts , a set of actions , and rewards in . In each round, the player obtains a context , plays an action and receives a reward following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

The elements of Thompson sampling are as follows:

  1. a likelihood function ;
  2. a set of parameters of the distribution of ;
  3. a prior distribution on these parameters;
  4. past observations triplets ;
  5. a posterior distribution , where is the likelihood function.

Thompson sampling consists in playing the action according to the probability that it maximizes the expected reward, i.e.

where is the indicator function.

In practice, the rule is implemented by sampling, in each round, parameters from the posterior , and choosing the action that maximizes , i.e. the expected reward given the sampled parameters, the action and the current context. Conceptually, this means that the player instantiates their beliefs randomly in each round, and then acts optimally according to them. In most practical applications, it is computationally onerous to maintain and sample from a posterior distribution over models. As such, Thompson sampling is often used in conjunction with approximate sampling techniques. [2]


Thompson sampling was originally described by Thompson in 1933 [1] . It was subsequently rediscovered numerous times independently in the context of multi-armed bandit problems. [3] [4] [5] [6] [7] [8] A first proof of convergence for the bandit case has been shown in 1997. [3] The first application to Markov decision processes was in 2000. [5] A related approach (see Bayesian control rule) was published in 2010. [4] In 2010 it was also shown that Thompson sampling is instantaneously self-correcting. [8] Asymptotic convergence results for contextual bandits were published in 2011. [6] Nowadays, Thompson Sampling has been widely used in many online learning problems: Thompson sampling has also been applied to A/B testing in website design and online advertising; [9] Thompson sampling has formed the basis for accelerated learning in decentralized decision making; [10] a Double Thompson Sampling (D-TS) [11] algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedbacks come in the format of pairwise comparison.

Relationship to other approaches

Probability matching

Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.

Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations. [4] In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let be the actions issued by an agent up to time , and let be the observations gathered by the agent up to time . Then, the agent issues the action with probability: [4]

where the "hat"-notation denotes the fact that is a causal intervention (see Causality), and not an ordinary observation. If the agent holds beliefs over its behaviors, then the Bayesian control rule becomes


where is the posterior distribution over the parameter given actions and observations .

In practice, the Bayesian control amounts to sampling, in each time step, a parameter from the posterior distribution , where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations and ignoring the (causal) likelihoods of the actions , and then by sampling the action from the action distribution .

Upper-Confidence-Bound (UCB) algorithms

Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic." Leveraging this property, one can translate regret bounds established for UCB algorithms to Bayesian regret bounds for Thompson sampling [12] or unify regret analysis across both these algorithms and many classes of problems. [13]

Related Research Articles

In statistics, the likelihood function is formed from the joint probability of a sample of data given a set of model parameter values; it is viewed and used as a function of the parameters given the data sample.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Bayesian network statistical model

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.

In statistics, the score is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point of the parameter vector, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ2-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In statistics, suppose that we have been given some data, and we are constructing a statistical model of that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable.

Stochastic gradient Langevin dynamics

Stochastic gradient Langevin dynamics, is an optimization technique composed of characteristics from Stochastic gradient descent, a Robbins-Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which introduces additional noise to the stochastic gradient estimator used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning, since the method produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.


  1. 1 2 Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika , 25(3–4):285–294, 1933.
  2. 1 2 Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96.
  3. 1 2 J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
  4. 1 2 3 4 P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010.
  5. 1 2 M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000,
  6. 1 2 B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
  7. Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011.
  8. 1 2 O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.
  9. Ian Clarke. "Proportionate A/B testing", September 22nd, 2011,
  10. Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. doi:10.1007/s10489-012-0346-z.
  11. Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits, arXiv: 1604.07101 , Bibcode:2016arXiv160407101W
  12. Daniel J. Russo and Benjamin Van Roy (2014), "Learning to Optimize Via Posterior Sampling", Mathematics of Operations Research, Vol. 39, No. 4, pp. 1221-1243, 2014.
  13. Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264.