Temporal difference learning

Last updated

Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods. [1]

Contents

While Monte Carlo methods only adjust their estimates once the final outcome is known, TD methods adjust predictions to match later, more accurate, predictions about the future before the final outcome is known. [2] This is a form of bootstrapping, as illustrated with the following example:

Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday's weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday – and thus be able to change, say, Saturday's model before Saturday arrives. [2]

Temporal difference methods are related to the temporal difference model of animal learning. [3] [4] [5] [6] [7]

Mathematical formulation

The tabular TD(0) method is one of the simplest TD methods. It is a special case of more general stochastic approximation methods. It estimates the state value function of a finite-state Markov decision process (MDP) under a policy . Let denote the state value function of the MDP with states , rewards and discount rate [8] under the policy : [9]

We drop the action from the notation for convenience. satisfies the Hamilton-Jacobi-Bellman Equation:

so is an unbiased estimate for . This observation motivates the following algorithm for estimating .

The algorithm starts by initializing a table arbitrarily, with one value for each state of the MDP. A positive learning rate is chosen.

We then repeatedly evaluate the policy , obtain a reward and update the value function for the current state using the rule: [10]

where and are the current and next states, respectively. The value is known as the TD target, and is known as the TD error.

TD-Lambda

TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. [11] This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players. [12]

The lambda () parameter refers to the trace decay parameter, with . Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distant states and actions when is higher, with producing parallel learning to Monte Carlo RL algorithms. [13]

In neuroscience

The TD algorithm has also received attention in the field of neuroscience. Researchers discovered that the firing rate of dopamine neurons in the ventral tegmental area (VTA) and substantia nigra (SNc) appear to mimic the error function in the algorithm. [3] [4] [5] [6] [7] The error function reports back the difference between the estimated reward at any given state or time step and the actual reward received. The larger the error function, the larger the difference between the expected and actual reward. When this is paired with a stimulus that accurately reflects a future reward, the error can be used to associate the stimulus with the future reward.

Dopamine cells appear to behave in a similar manner. In one experiment measurements of dopamine cells were made while training a monkey to associate a stimulus with the reward of juice. [14] Initially the dopamine cells increased firing rates when the monkey received juice, indicating a difference in expected and actual rewards. Over time this increase in firing back propagated to the earliest reliable stimulus for the reward. Once the monkey was fully trained, there was no increase in firing rate upon presentation of the predicted reward. Subsequently, the firing rate for the dopamine cells decreased below normal activation when the expected reward was not produced. This mimics closely how the error function in TD is used for reinforcement learning.

The relationship between the model and potential neurological function has produced research attempting to use TD to explain many aspects of behavioral research. [15] [16] It has also been used to study conditions such as schizophrenia or the consequences of pharmacological manipulations of dopamine on learning. [17]

See also

Notes

  1. Sutton & Barto (2018), p. 133.
  2. 1 2 Sutton, Richard S. (1 August 1988). "Learning to predict by the methods of temporal differences". Machine Learning. 3 (1): 9–44. doi: 10.1007/BF00115009 . ISSN   1573-0565. S2CID   207771194 . Retrieved 4 April 2023.
  3. 1 2 Schultz, W, Dayan, P & Montague, PR. (1997). "A neural substrate of prediction and reward". Science. 275 (5306): 1593–1599. CiteSeerX   10.1.1.133.6176 . doi:10.1126/science.275.5306.1593. PMID   9054347. S2CID   220093382.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. 1 2 Montague, P. R.; Dayan, P.; Sejnowski, T. J. (1996-03-01). "A framework for mesencephalic dopamine systems based on predictive Hebbian learning" (PDF). The Journal of Neuroscience. 16 (5): 1936–1947. doi:10.1523/JNEUROSCI.16-05-01936.1996. ISSN   0270-6474. PMC   6578666 . PMID   8774460.
  5. 1 2 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. (1993). "Using aperiodic reinforcement for directed self-organization" (PDF). Advances in Neural Information Processing Systems. 5: 969–976.
  6. 1 2 Montague, P. R.; Sejnowski, T. J. (1994). "The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms". Learning & Memory. 1 (1): 1–33. doi: 10.1101/lm.1.1.1 . ISSN   1072-0502. PMID   10467583. S2CID   44560099.
  7. 1 2 Sejnowski, T.J.; Dayan, P.; Montague, P.R. (1995). "Predictive Hebbian learning". Proceedings of the eighth annual conference on Computational learning theory - COLT '95. pp. 15–18. doi: 10.1145/225298.225300 . ISBN   0897917235. S2CID   1709691.
  8. Discount rate parameter allows for a time preference toward more immediate rewards, and away from distant future rewards
  9. Sutton & Barto (2018), p. 134.
  10. Sutton & Barto (2018), p. 135.
  11. Sutton & Barto (2018), p. 130?.
  12. Tesauro (1995).
  13. Sutton & Barto (2018), p. 175.
  14. Schultz, W. (1998). "Predictive reward signal of dopamine neurons". Journal of Neurophysiology. 80 (1): 1–27. CiteSeerX   10.1.1.408.5994 . doi:10.1152/jn.1998.80.1.1. PMID   9658025. S2CID   52857162.
  15. Dayan, P. (2001). "Motivated reinforcement learning" (PDF). Advances in Neural Information Processing Systems. MIT Press. 14: 11–18.
  16. Tobia, M. J., etc. (2016). "Altered behavioral and neural responsiveness to counterfactual gains in the elderly". Cognitive, Affective, & Behavioral Neuroscience. 16 (3): 457–472. doi: 10.3758/s13415-016-0406-7 . PMID   26864879. S2CID   11299945.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  17. Smith, A., Li, M., Becker, S. and Kapur, S. (2006). "Dopamine, prediction error, and associative learning: a model-based account". Network: Computation in Neural Systems. 17 (1): 61–84. doi:10.1080/09548980500361624. PMID   16613795. S2CID   991839.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Works cited

Further reading

Related Research Articles

<span class="mw-page-title-main">Reinforcement learning</span> Field of machine learning

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

In physics, a wave vector is a vector used in describing a wave, with a typical unit being cycle per metre. It has a magnitude and direction. Its magnitude is the wavenumber of the wave, and its direction is perpendicular to the wavefront. In isotropic media, this is also the direction of wave propagation.

<span class="mw-page-title-main">Lambda cube</span>

In mathematical logic and type theory, the λ-cube is a framework introduced by Henk Barendregt to investigate the different dimensions in which the calculus of constructions is a generalization of the simply typed λ-calculus. Each dimension of the cube corresponds to a new kind of dependency between terms and types. Here, "dependency" refers to the capacity of a term or type to bind a term or type. The respective dimensions of the λ-cube correspond to:

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

<span class="mw-page-title-main">Q-learning</span> Model-free reinforcement learning algorithm

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines, who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

<span class="mw-page-title-main">Estimation of distribution algorithm</span>

Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding an uninformative prior over admissible solutions and ending with the model that generates only the global optima.

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations to the actions.

In the mathematical discipline of graph theory, the expander walk sampling theorem intuitively states that sampling vertices in an expander graph by doing relatively short random walk can simulate sampling the vertices independently from a uniform distribution. The earliest version of this theorem is due to Ajtai, Komlós & Szemerédi (1987), and the more general version is typically attributed to Gillman (1998).

<span class="mw-page-title-main">State–action–reward–state–action</span>

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

Pendleton Read Montague, Jr. is an American neuroscientist and popular science author. He is the director of the Human Neuroimaging Lab and Computational Psychiatry Unit at the Fralin Biomedical Research Institute at VTC in Roanoke, Virginia, where he also holds the title of the inaugural Virginia Tech Carilion Vernon Mountcastle Research Professor. Montague is also a professor in the department of physics at Virginia Tech in Blacksburg, Virginia and professor of Psychiatry and Behavioral Medicine at Virginia Tech Carilion School of Medicine.

<span class="mw-page-title-main">Online machine learning</span> Method of machine learning

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

TD-Gammon is a computer backgammon program developed in 1992 by Gerald Tesauro at IBM's Thomas J. Watson Research Center. Its name comes from the fact that it is an artificial neural net trained by a form of temporal-difference learning, specifically TD-Lambda.

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010.

<span class="mw-page-title-main">Mountain car problem</span>

Mountain Car, a standard testing domain in Reinforcement learning, is a problem in which an under-powered car must drive up a steep hill. Since gravity is stronger than the car's engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car is situated in a valley and must learn to leverage potential energy by driving up the opposite hill before the car is able to make it to the goal at the top of the rightmost hill. The domain has been used as a test bed in various Reinforcement Learning papers.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

<span class="mw-page-title-main">Ramanujan's master theorem</span> Mathematical theorem

In mathematics, Ramanujan's Master Theorem, named after Srinivasa Ramanujan, is a technique that provides an analytic expression for the Mellin transform of an analytic function.

<span class="mw-page-title-main">Multiple kernel learning</span> Set of machine learning methods

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

<span class="mw-page-title-main">Model-free (reinforcement learning)</span> Type of machine learning algorithm

In reinforcement learning (RL), a model-free algorithm is an algorithm which does not estimate the transition probability distribution associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution and the reward function are often collectively called the "model" of the environment, hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

<span class="mw-page-title-main">Deep reinforcement learning</span> Machine learning that combines deep learning and reinforcement learning

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.