Actor-critic algorithm

Last updated

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning. [1]

Contents

An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function. [2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

The actor uses a policy function , while the critic estimates either the value function , the action-value Q-function , the advantage function , or any combination thereof.

The actor is a parameterized function , where are the parameters of the actor. The actor takes as argument the state of the environment and produces a probability distribution .

If the action space is discrete, then . If the action space is continuous, then .

The goal of policy optimization is to improve the actor. That is, to find some that maximizes the expected episodic reward :where is the discount factor, is the reward at step , and is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize by gradient ascent on the policy gradient .

As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient:where is a linear sum of the following:

Critic

In the unbiased estimators given above, certain functions such as appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.

For example, if the critic is estimating the state-value function , then it can be learned by any value function approximation method. Let the critic be a function approximator with parameters .

The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:The critic parameters are updated by gradient descent on the squared TD error:where is the learning rate. Note that the gradient is taken with respect to the in only, since the in constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.

Similarly, if the critic is estimating the action-value function , then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by , denoted as . The temporal difference error is then calculated as . The critic is then updated byThe advantage critic can be trained by training both a Q-function and a state-value function , then let . Although, it is more common to train just a state-value function , then estimate the advantage by [3] Here, is a positive integer. The higher is, the more lower is the bias in the advantage estimation, but at the price of higher variance.

The Generalized Advantage Estimation (GAE) introduces a hyperparameter that smoothly interpolates between Monte Carlo returns (, high variance, no bias) and 1-step TD learning (, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with being the decay strength. [4]

Variants

See also

Related Research Articles

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

<span class="mw-page-title-main">Reinforcement learning</span> Field of machine learning

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

<span class="mw-page-title-main">Solid angle</span> Measure of how large an object appears to an observer at a given point in three-dimensional space

In geometry, a solid angle is a measure of the amount of the field of view from some particular point that a given object covers. That is, it is a measure of how large the object appears to an observer looking from that point. The point from which the object is viewed is called the apex of the solid angle, and the object is said to subtend its solid angle at that point.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

<span class="mw-page-title-main">Particle in a spherically symmetric potential</span> Quantum mechanical model

In quantum mechanics, a spherically symmetric potential is a system of which the potential only depends on the radial distance from the spherical center and a location in space. A particle in a spherically symmetric potential will behave accordingly to said potential and can therefore be used as an approximation, for example, of the electron in a hydrogen atom or of the formation of chemical bonds.

<span class="mw-page-title-main">Theta function</span> Special functions of several complex variables

In mathematics, theta functions are special functions of several complex variables. They show up in many topics, including Abelian varieties, moduli spaces, quadratic forms, and solitons. Theta functions are parametrized by points in a tube domain inside a complex Lagrangian Grassmannian, namely the Siegel upper half space.

In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics.

<span class="mw-page-title-main">Stable distribution</span> Distribution of variables which satisfies a stability property under linear combinations

In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be stable if its distribution is stable. The stable distribution family is also sometimes referred to as the Lévy alpha-stable distribution, after Paul Lévy, the first mathematician to have studied it.

<span class="mw-page-title-main">Spherical cap</span> Section of a sphere

In geometry, a spherical cap or spherical dome is a portion of a sphere or of a ball cut off by a plane. It is also a spherical segment of one base, i.e., bounded by a single plane. If the plane passes through the center of the sphere, so that the height of the cap is equal to the radius of the sphere, the spherical cap is called a hemisphere.

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the opposite ends of the range, and the division of the range could notionally be made at any point.

von Mises distribution Probability distribution on the circle

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere.

The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members often asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

<span class="mw-page-title-main">Wrapped Cauchy distribution</span> Wrapped probability distribution

In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

In general relativity, the Weyl metrics are a class of static and axisymmetric solutions to Einstein's field equation. Three members in the renowned Kerr–Newman family solutions, namely the Schwarzschild, nonextremal Reissner–Nordström and extremal Reissner–Nordström metrics, can be identified as Weyl-type metrics.

Policy gradient methods are a class of reinforcement learning algorithms.

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

<span class="mw-page-title-main">Reinforcement learning from human feedback</span> Machine learning technique

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

References

  1. Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv: 1708.05866 . Bibcode:2017ISPM...34...26A. doi:10.1109/MSP.2017.2743240. ISSN   1053-5888.
  2. Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press.
  3. 1 2 3 Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, arXiv: 1602.01783
  4. 1 2 Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv: 1506.02438
  5. Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, arXiv: 1812.05905
  6. Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, arXiv: 1509.02971