Thompson sampling

Last updated November 19, 2024

Thompson sampling,^[1]^[2]^[3] named after William R. Thompson, is a heuristic for choosing actions that address the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

Description

Consider a set of contexts ${\mathcal {X}}$ , a set of actions ${\mathcal {A}}$ , and rewards in $\mathbb {R}$ . The aim of the player is to play actions under the various contexts, such as to maximize the cumulative rewards. Specifically, in each round, the player obtains a context $x\in {\mathcal {X}}$ , plays an action $a\in {\mathcal {A}}$ and receives a reward $r\in \mathbb {R}$ following a distribution that depends on the context and the issued action.

The elements of Thompson sampling are as follows:

a likelihood function $P(r|\theta ,a,x)$ ;
a set $\Theta$ of parameters $\theta$ of the distribution of $r$ ;
a prior distribution $P(\theta )$ on these parameters;
past observations triplets ${\mathcal {D}}=\{(x;a;r)\}$ ;
a posterior distribution $P(\theta |{\mathcal {D}})\propto P({\mathcal {D}}|\theta )P(\theta )$ , where $P({\mathcal {D}}|\theta )$ is the likelihood function.

Thompson sampling consists of playing the action $a^{\ast }\in {\mathcal {A}}$ according to the probability that it maximizes the expected reward; action $a^{\ast }$ is chosen with probability

\int \mathbb {I} \left[\mathbb {E} (r|a^{\ast },x,\theta )=\max _{a'}\mathbb {E} (r|a',x,\theta )\right]P(\theta |{\mathcal {D}})d\theta ,

where $\mathbb {I}$ is the indicator function.

In practice, the rule is implemented by sampling. In each round, parameters $\theta ^{\ast }$ are sampled from the posterior $P(\theta |{\mathcal {D}})$ , and an action $a^{\ast }$ chosen that maximizes $\mathbb {E} [r|\theta ^{\ast },a^{\ast },x]$ , i.e. the expected reward given the sampled parameters, the action, and the current context. Conceptually, this means that the player instantiates their beliefs randomly in each round according to the posterior distribution, and then acts optimally according to them. In most practical applications, it is computationally onerous to maintain and sample from a posterior distribution over models. As such, Thompson sampling is often used in conjunction with approximate sampling techniques.^[3]

History

Thompson sampling was originally described by Thompson in 1933.^[1] It was subsequently rediscovered numerous times independently in the context of multi-armed bandit problems.^[4]^[5]^[6]^[7]^[8]^[9] A first proof of convergence for the bandit case has been shown in 1997.^[4] The first application to Markov decision processes was in 2000.^[6] A related approach (see Bayesian control rule) was published in 2010.^[5] In 2010 it was also shown that Thompson sampling is instantaneously self-correcting.^[9] Asymptotic convergence results for contextual bandits were published in 2011.^[7] Thompson Sampling has been widely used in many online learning problems including A/B testing in website design and online advertising,^[10] and accelerated learning in decentralized decision making.^[11] A Double Thompson Sampling (D-TS) ^[12] algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedback comes in the form of pairwise comparison.

Relationship to other approaches

Probability matching

Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.

Bayesian control rule

A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations.^[5] In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.

The setup is as follows. Let $a_{1},a_{2},\ldots ,a_{T}$ be the actions issued by an agent up to time $T$ , and let $o_{1},o_{2},\ldots ,o_{T}$ be the observations gathered by the agent up to time $T$ . Then, the agent issues the action $a_{T+1}$ with probability:^[5]

P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T}),

where the "hat"-notation ${\hat {a}}_{t}$ denotes the fact that $a_{t}$ is a causal intervention (see Causality), and not an ordinary observation. If the agent holds beliefs $\theta \in \Theta$ over its behaviors, then the Bayesian control rule becomes

P(a_{T+1}|{\hat {a}}_{1:T},o_{1:T})=\int _{\Theta }P(a_{T+1}|\theta ,{\hat {a}}_{1:T},o_{1:T})P(\theta |{\hat {a}}_{1:T},o_{1:T})\,d\theta

,

where $P(\theta |{\hat {a}}_{1:T},o_{1:T})$ is the posterior distribution over the parameter $\theta$ given actions $a_{1:T}$ and observations $o_{1:T}$ .

In practice, the Bayesian control amounts to sampling, at each time step, a parameter $\theta ^{\ast }$ from the posterior distribution $P(\theta |{\hat {a}}_{1:T},o_{1:T})$ , where the posterior distribution is computed using Bayes' rule by only considering the (causal) likelihoods of the observations $o_{1},o_{2},\ldots ,o_{T}$ and ignoring the (causal) likelihoods of the actions $a_{1},a_{2},\ldots ,a_{T}$ , and then by sampling the action $a_{T+1}^{\ast }$ from the action distribution $P(a_{T+1}|\theta ^{\ast },{\hat {a}}_{1:T},o_{1:T})$ .

Upper-Confidence-Bound (UCB) algorithms

Thompson sampling and upper-confidence bound algorithms share a fundamental property that underlies many of their theoretical guarantees. Roughly speaking, both algorithms allocate exploratory effort to actions that might be optimal and are in this sense "optimistic". Leveraging this property, one can translate regret bounds established for UCB algorithms to Bayesian regret bounds for Thompson sampling^[13] or unify regret analysis across both these algorithms and many classes of problems.^[14]

Related Research Articles

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior probability contains everything there is to know about an uncertain proposition, given prior knowledge and a mathematical model describing the observations available at a particular time. After the arrival of new information, the current posterior probability may serve as the prior in another round of Bayesian updating.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be understood as the probability of the model itself and is therefore often referred to as model evidence or simply evidence.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. The "M" initial stands for "maximum likelihood-type".

Bootstrapping is a procedure for estimating the distribution of an estimator by resampling one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In Bayesian inference, the Bernstein–von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in total variation distance to a multivariate normal distribution centered at the maximum likelihood estimator $with covariance matrix given by, where is the true population parameter and is the Fisher information matrix at the true population parameter value:$

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson as a prior over probability distributions.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

In variational Bayesian methods, the evidence lower bound is a useful lower bound on the log-likelihood of some observed data.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

1 2 Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika , 25(3–4):285–294, 1933.
↑ Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics, 57(2), 450-456.
1 2 Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96. https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
1 2 J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.
1 2 3 4 P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010, http://arxiv.org/abs/0810.3605
1 2 M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701
1 2 B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.
↑ Chapelle, Olivier, and Lihong Li. "An empirical evaluation of Thompson sampling." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling
1 2 O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.
↑ Ian Clarke. "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/
↑ Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. 38 (4): 479–488. doi:10.1007/s10489-012-0346-z. hdl: 11250/137969 . S2CID 8746483.
↑ Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits, arXiv: 1604.07101 , Bibcode:2016arXiv160407101W
↑ Russo, Daniel J.; Van Roy, Benjamin (2014). "Learning to Optimize Via Posterior Sampling". Mathematics of Operations Research . 39 (4): 1221–1243. arXiv: 1301.2609 . doi:10.1287/moor.2014.0650.
↑ Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264. https://proceedings.neurips.cc/paper/2013/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[ref1-1] 1 2 Thompson, William R. "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples". Biometrika , 25(3–4):285–294, 1933.

[ref1b-2] Thompson, W. R. (1935). On the theory of apportionment. American Journal of Mathematics, 57(2), 450-456.

[FnTTutorial-3] 1 2 Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends in Machine Learning: Vol. 11: No. 1, pp 1-96. https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

[ref2-4] 1 2 J. Wyatt. Exploration and Inference in Learning from Reinforcement. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. March 1997.

[ref5-5] 1 2 3 4 P. A. Ortega and D. A. Braun. "A Minimum Relative Entropy Principle for Learning and Acting", Journal of Artificial Intelligence Research, 38, pages 475–511, 2010, http://arxiv.org/abs/0810.3605

[ref6-6] 1 2 M. J. A. Strens. "A Bayesian Framework for Reinforcement Learning", Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, California, June 29–July 2, 2000, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1701

[ref4-7] 1 2 B. C. May, B. C., N. Korda, A. Lee, and D. S. Leslie. "Optimistic Bayesian sampling in contextual-bandit problems". Technical report, Statistics Group, Department of Mathematics, University of Bristol, 2011.

[ref3-8] Chapelle, Olivier, and Lihong Li. "An empirical evaluation of Thompson sampling." Advances in neural information processing systems. 2011. http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling

[ref7-9] 1 2 O.-C. Granmo. "Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton", International Journal of Intelligent Computing and Cybernetics, 3 (2), 2010, 207-234.

[ref9-10] Ian Clarke. "Proportionate A/B testing", September 22nd, 2011, http://blog.locut.us/2011/09/22/proportionate-ab-testing/

[ref8-11] Granmo, O. C.; Glimsdal, S. (2012). "Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game". Applied Intelligence. 38 (4): 479–488. doi:10.1007/s10489-012-0346-z. hdl: 11250/137969 . S2CID 8746483.

[Wu2016DTS-12] Wu, Huasen; Liu, Xin; Srikant, R (2016), Double Thompson Sampling for Dueling Bandits, arXiv: 1604.07101 , Bibcode:2016arXiv160407101W

[RussoVanRoy2014-13] Russo, Daniel J.; Van Roy, Benjamin (2014). "Learning to Optimize Via Posterior Sampling". Mathematics of Operations Research . 39 (4): 1221–1243. arXiv: 1301.2609 . doi:10.1287/moor.2014.0650.

[RussoVanRoy2013-14] Daniel J. Russo and Benjamin Van Roy (2013), "Eluder Dimension and the Sample Complexity of Optimistic Exploration", Advances in Neural Information Processing Systems 26, pp. 2256-2264. https://proceedings.neurips.cc/paper/2013/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]