# Thompson sampling

Last updated

Thompson sampling, [1] [2] named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

## Description

Consider a set of contexts ${\displaystyle {\mathcal {X}}}$, a set of actions ${\displaystyle {\mathcal {A}}}$, and rewards in ${\displaystyle \mathbb {R} }$. In each round, the player obtains a context ${\displaystyle x\in {\mathcal {X}}}$, plays an action ${\displaystyle a\in {\mathcal {A}}}$ and receives a reward ${\displaystyle r\in \mathbb {R} }$ following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

The elements of Thompson sampling are as follows:

1. a likelihood function ${\displaystyle P(r|\theta ,a,x)}$;
2. a set ${\displaystyle \Theta }$ of parameters ${\displaystyle \theta }$ of the distribution of ${\displaystyle r}$;
3. a prior distribution ${\displaystyle P(\theta )}$ on these parameters;
4. past observations triplets ${\displaystyle {\mathcal {D}}=\{(x;a;r)\}}$;
5. a posterior distribution ${\displaystyle P(\theta |{\mathcal {D}})\propto P({\mathcal {D}}|\theta )P(\theta )}$, where ${\displaystyle P({\mathcal {D}}|\theta )}$ is the likelihood function.

Thompson sampling consists in playing the action ${\displaystyle a^{\ast }\in {\mathcal {A}}}$ according to the probability that it maximizes the expected reward, i.e. action ${\displaystyle a^{\ast }}$ is chosen with probability

${\displaystyle \int \mathbb {I} \left[\mathbb {E} (r|a^{\ast },x,\theta )=\max _{a'}\mathbb {E} (r|a',x,\theta )\right]P(\theta |{\mathcal {D}})d\theta ,}$

where ${\displaystyle \mathbb {I} }$ is the indicator function.

In practice, the rule is implemented by sampling, in each round, parameters ${\displaystyle \theta ^{\ast }}$ from the posterior ${\displaystyle P(\theta |{\mathcal {D}})}$, and choosing the action ${\displaystyle a^{\ast }}$ that maximizes ${\displaystyle \mathbb {E} [r|\theta ^{\ast },a^{\ast },x]}$, i.e. the expected reward given the sampled parameters, the action and the current context. Conceptually, this means that the player instantiates their beliefs randomly in each round according to the posterior distribution, and then acts optimally according to them. In most practical applications, it is computationally onerous to maintain and sample from a posterior distribution over models. As such, Thompson sampling is often used in conjunction with approximate sampling techniques. [2]

## History

Thompson sampling was originally described by Thompson in 1933 [1] . It was subsequently rediscovered numerous times independently in the context of multi-armed bandit problems. [3] [4] [5] [6] [7] [8] A first proof of convergence for the bandit case has been shown in 1997. [3] The first application to Markov decision processes was in 2000. [5] A related approach (see Bayesian control rule) was published in 2010. [4] In 2010 it was also shown that Thompson sampling is instantaneously self-correcting. [8] Asymptotic convergence results for contextual bandits were published in 2011. [6] Nowadays, Thompson Sampling has been widely used in many online learning problems: Thompson sampling has also been applied to A/B testing in website design and online advertising; [9] Thompson sampling has formed the basis for accelerated learning in decentralized decision making; [10] a Double Thompson Sampling (D-TS) [11] algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedbacks come in the format of pairwise comparison.

## Relationship to other approaches

### Probability matching

Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.

### Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations. [4] In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let ${\displaystyle a_{1},a_{2},\ldots ,a_{T}}$ be the actions issued by an agent up to time ${\displaystyle T}$, and let ${\displaystyle o_{1},o_{2},\ldots ,o_{T}}$ be the observations gathered by the agent up to time ${\displaystyle T}$. Then, the agent issues the action ${\displaystyle a_{T+1}}$ with probability: [4]

${\displaystyle P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T}),}$

where the "hat"-notation ${\displaystyle {\hat {a}}_{t}}$ denotes the fact that ${\displaystyle a_{t}}$ is a causal intervention (see Causality), and not an ordinary observation. If the agent holds beliefs ${\displaystyle \theta \in \Theta }$ over its behaviors, then the Bayesian control rule becomes

${\displaystyle P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T})=\int _{\Theta }P(a_{T+1}|\theta ,{\hat {a}}_{1:T},o_{1:T})P(\theta |{\hat {a}}_{1:T},o_{1:T})\,d\theta }$,

where ${\displaystyle P(\theta |{\hat {a}}_{1:T},o_{1:T})}$ is the posterior distribution over the parameter ${\displaystyle \theta }$ given actions ${\displaystyle a_{1:T}}$ and observations ${\displaystyle o_{1:T}}$.

In practice, the Bayesian control amounts to sampling, in each time step, a parameter ${\displaystyle \theta ^{\ast }}$ from the posterior distribution ${\displaystyle P(\theta |{\hat {a}}_{1:T},o_{1:T})}$, where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations ${\displaystyle o_{1},o_{2},\ldots ,o_{T}}$ and ignoring the (causal) likelihoods of the actions ${\displaystyle a_{1},a_{2},\ldots ,a_{T}}$, and then by sampling the action ${\displaystyle a_{T+1}^{\ast }}$ from the action distribution ${\displaystyle P(a_{T+1}|\theta ^{\ast },{\hat {a}}_{1:T},o_{1:T})}$.

### Upper-Confidence-Bound (UCB) algorithms

Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic." Leveraging this property, one can translate regret bounds established for UCB algorithms to Bayesian regret bounds for Thompson sampling [12] or unify regret analysis across both these algorithms and many classes of problems. [13]

## Related Research Articles

In statistics, the likelihood function measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. However, these activities can be viewed as two facets of the same field of application, and together they have undergone substantial development over the past few decades. A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. "Posterior", in this context, means after taking into account the relevant evidences related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.

In mathematical statistics, the Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread - it also does not satisfy the triangle inequality. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalized. In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson as a prior over probability distributions.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In statistics, suppose that we have been given some data, and we are constructing a statistical model of that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

Stochastic gradient Langevin dynamics (SGLD), is an optimization technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which introduces additional noise to the stochastic gradient estimator used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning, since the method produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

## References

1. Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika , 25(3–4):285–294, 1933.
2. Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96. https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
3. J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
4. P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010.
5. M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701
6. B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
7. Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling
8. O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.
9. Ian Clarke. "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/
10. Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. 38 (4): 479–488. doi:10.1007/s10489-012-0346-z. hdl:.
11. Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits, arXiv:, Bibcode:2016arXiv160407101W
12. Daniel J. Russo and Benjamin Van Roy (2014), "Learning to Optimize Via Posterior Sampling", Mathematics of Operations Research, Vol. 39, No. 4, pp. 1221-1243, 2014. https://pubsonline.informs.org/doi/abs/10.1287/moor.2014.0650
13. Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264. http://papers.nips.cc/paper/4909-eluder-dimension-and-the-sample-complexity-of-optimistic-exploration.pdf