# Thompson sampling

Last updated

Thompson sampling,   named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief. In probability theory, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration-exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines, who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

## Description

Consider a set of contexts ${\mathcal {X}}$ , a set of actions ${\mathcal {A}}$ , and rewards in $\mathbb {R}$ . In each round, the player obtains a context $x\in {\mathcal {X}}$ , plays an action $a\in {\mathcal {A}}$ and receives a reward $r\in \mathbb {R}$ following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

The elements of Thompson sampling are as follows:

1. a likelihood function $P(r|\theta ,a,x)$ ;
2. a set $\Theta$ of parameters $\theta$ of the distribution of $r$ ;
3. a prior distribution $P(\theta )$ on these parameters;
4. past observations triplets ${\mathcal {D}}=\{(x;a;r)\}$ ;
5. a posterior distribution $P(\theta |{\mathcal {D}})\propto P({\mathcal {D}}|\theta )P(\theta )$ , where $P({\mathcal {D}}|\theta )$ is the likelihood function.

Thompson sampling consists in playing the action $a^{\ast }\in {\mathcal {A}}$ according to the probability that it maximizes the expected reward, i.e.

$\int \mathbb {I} \left[\mathbb {E} (r|a^{\ast },x,\theta )=\max _{a'}\mathbb {E} (r|a',x,\theta )\right]P(\theta |{\mathcal {D}})d\theta ,$ where $\mathbb {I}$ is the indicator function. In mathematics, an indicator function or a characteristic function is a function defined on a set X that indicates membership of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for all elements of X not in A. It is usually denoted by a symbol 1 or I, sometimes in boldface or blackboard boldface, with a subscript specifying the subset.

In practice, the rule is implemented by sampling, in each round, parameters $\theta ^{\ast }$ from the posterior $P(\theta |{\mathcal {D}})$ , and choosing the action $a^{\ast }$ that maximizes $\mathbb {E} [r|\theta ^{\ast },a^{\ast },x]$ , i.e. the expected reward given the sampled parameters, the action and the current context. Conceptually, this means that the player instantiates their beliefs randomly in each round, and then acts optimally according to them. In most practical applications, it is computationally onerous to maintain and sample from a posterior distribution over models. As such, Thompson sampling is often used in conjunction with approximate sampling techniques. 

## History

Thompson sampling was originally described by Thompson in 1933  . It was subsequently rediscovered numerous times independently in the context of multi-armed bandit problems.       A first proof of convergence for the bandit case has been shown in 1997.  The first application to Markov decision processes was in 2000.  A related approach (see Bayesian control rule) was published in 2010.  In 2010 it was also shown that Thompson sampling is instantaneously self-correcting.  Asymptotic convergence results for contextual bandits were published in 2011.  Nowadays, Thompson Sampling has been widely used in many online learning problems: Thompson sampling has also been applied to A/B testing in website design and online advertising;  Thompson sampling has formed the basis for accelerated learning in decentralized decision making;  a Double Thompson Sampling (D-TS)  algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedbacks come in the format of pairwise comparison.

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov.

## Relationship to other approaches

### Probability matching

Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.

### Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations.  In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let $a_{1},a_{2},\ldots ,a_{T}$ be the actions issued by an agent up to time $T$ , and let $o_{1},o_{2},\ldots ,o_{T}$ be the observations gathered by the agent up to time $T$ . Then, the agent issues the action $a_{T+1}$ with probability: 

$P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T}),$ where the "hat"-notation ${\hat {a}}_{t}$ denotes the fact that $a_{t}$ is a causal intervention (see Causality), and not an ordinary observation. If the agent holds beliefs $\theta \in \Theta$ over its behaviors, then the Bayesian control rule becomes

$P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T})=\int _{\Theta }P(a_{T+1}|\theta ,{\hat {a}}_{1:T},o_{1:T})P(\theta |{\hat {a}}_{1:T},o_{1:T})\,d\theta$ ,

where $P(\theta |{\hat {a}}_{1:T},o_{1:T})$ is the posterior distribution over the parameter $\theta$ given actions $a_{1:T}$ and observations $o_{1:T}$ .

In practice, the Bayesian control amounts to sampling, in each time step, a parameter $\theta ^{\ast }$ from the posterior distribution $P(\theta |{\hat {a}}_{1:T},o_{1:T})$ , where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations $o_{1},o_{2},\ldots ,o_{T}$ and ignoring the (causal) likelihoods of the actions $a_{1},a_{2},\ldots ,a_{T}$ , and then by sampling the action $a_{T+1}^{\ast }$ from the action distribution $P(a_{T+1}|\theta ^{\ast },{\hat {a}}_{1:T},o_{1:T})$ .

### Upper-Confidence-Bound (UCB) Algorithms

Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic." Leveraging this property, one can translate regret bounds established for UCB algorithms to Bayesian regret bounds for Thompson sampling  or unify regret analysis across both these algorithms and many classes of problems. 

## Related Research Articles

In statistics, the likelihood function expresses how likely particular values of statistical parameters are for a given set of observations. It is equal to the joint probability distribution of the random sample evaluated at the given observations, and it is, thus, solely a function of parameters that index the family of those probability distributions.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.

In statistics, the score is the gradient of the log-likelihood function with respect to the parameter vector. Evaluated at a particular point, the score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values. If the log-likelihood function is continuous over the parameter space, the score will vanish at a local maximum or minimum; this fact is used in maximum likelihood estimation to find the parameter values that maximize the likelihood function.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

In statistics, the Wald test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the null hypothesis, where the weight is the precision of the estimate. Intuitively, the larger this weighted distance, the less likely it is that the constraint is true. While the finite sample distributions of Wald tests are generally unknown, it has an asymptotic χ2-distribution under the null hypothesis, a fact that can be used to determine statistical significance.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of samples/images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training samples/images.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable. Stochastic gradient Langevin dynamics, is an optimization technique composed of characteristics from Stochastic gradient descent, a Robbins-Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which introduces additional noise to the stochastic gradient estimator used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning, since the method produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

1. Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika , 25(3–4):285–294, 1933.
2. Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96. https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
3. J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
4. P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010.
5. M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701
6. B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
7. Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling
8. O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.
9. Ian Clarke. "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/
10. Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. doi:10.1007/s10489-012-0346-z.
11. Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits, arXiv:, Bibcode:2016arXiv160407101W
12. Daniel J. Russo and Benjamin Van Roy (2014), "Learning to Optimize Via Posterior Sampling", Mathematics of Operations Research, Vol. 39, No. 4, pp. 1221-1243, 2014. https://pubsonline.informs.org/doi/abs/10.1287/moor.2014.0650
13. Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264. http://papers.nips.cc/paper/4909-eluder-dimension-and-the-sample-complexity-of-optimistic-exploration.pdf