Policy gradient methods are a class of reinforcement learning algorithms.
Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function that selects actions without consulting a value function. For policy gradient to apply, the policy function is parameterized by a differentiable parameter .
In policy-based RL, the actor is a parameterized policy function , where are the parameters of the actor. The actor takes as argument the state of the environment and produces a probability distribution .
If the action space is discrete, then . If the action space is continuous, then .
The goal of policy optimization is to find some that maximizes the expected episodic reward :where is the discount factor, is the reward at step , is the starting state, and is the time-horizon (which can be infinite).
The policy gradient is defined as . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". [1]
The REINFORCE algorithm was the first policy gradient method. [2] It is based on the identity for the policy gradientwhich can be improved via the "causality trick" [3]
Lemma — The expectation of the score function is zero, conditional on any present or past state. That is, for any and any state , we have
Further, if is a random variable that is independent of , then
Use the reparameterization trick.
Since the policy is a probability distribution over actions for a given state, .
By the tower law and the previous lemma.
Applying the reparameterization trick,
which is the first equation.
By the lemma, for any . Plugging this into the previous formula, we zero out a whole triangle of terms, to get which is the second equation.
Thus, we have an unbiased estimator of the policy gradient:where the index ranges over rollout trajectories using the policy .
The score function can be interpreted as the direction in the parameter space that increases the probability of taking action in state . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa.
The REINFORCE algorithm is a loop:
Here, is the learning rate at update step .
REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy . This can lead to high variance in the updates, as the returns can vary significantly between trajectories. Many variants of REINFORCE has been introduced, under the title of variance reduction .
A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity:for any function . This can be proven by applying the previous lemma.
The algorithm uses the modified gradient estimatorand the original REINFORCE algorithm is the special case where .
If is chosen well, such that , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function as possible, approaching the ideal of:Note that, as the policy updates, the value function updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic.
The Q-function can also be used as the critic, sinceby a similar argument using the tower law.
Subtracting the value function as a baseline, we find that the advantage function can be used as the critic as well:In summary, there are many unbiased estimators for , all in the form of: where is any linear sum of the following terms:
Some more possible are as follows, with very similar proofs.
The natural policy gradient method is a variant of the policy gradient method, proposed by Sham Kakade in 2001. [5] Unlike standard policy gradient methods, which depend on the choice of parameters (making updates coordinate-dependent), the natural policy gradient aims to provide a coordinate-free update, which is geometrically "natural".
Standard policy gradient updates solve a constrained optimization problem: While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint introduces coordinate dependence. To address this, the natural policy gradient replaces the Euclidean constraint with a Kullback–Leibler divergence (KL) constraint:where the KL divergence between two policies is averaged over the state distribution under policy . That is,This ensures updates are invariant to invertible affine parameter transformations.
For small , the KL divergence is approximated by the Fisher information metric:where is the Fisher information matrix of the policy, defined as:This transforms the problem into a problem in quadratic programming, yielding the natural policy gradient update:The step size is typically adjusted to maintain the KL constraint, with .
Inverting is computationally intensive, especially for high-dimensional parameters (e.g., neural networks). Practical implementations often use approximations.
Trust Region Policy Optimization (TRPO) is a policy gradient method that extends the natural policy gradient approach by enforcing a trust region constraint on policy updates. [6] Developed by Schulman et al. in 2015, TRPO ensures stable policy improvements by limiting the KL divergence between successive policies, addressing key challenges in natural policy gradient methods.
TRPO builds on the natural policy gradient by incorporating a trust region constraint. While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement. This makes TRPO more robust in practice, particularly for high-dimensional policies.
Like natural policy gradient, TRPO iteratively updates the policy parameters by solving a constrained optimization problem specified coordinate-free:where
Note that in general, other surrogate advantages are possible:where is any linear sum of the previously mentioned type. Indeed, OpenAI recommended using the Generalized Advantage Estimate, instead of the plain advantage .
The surrogate advantage is designed to align with the policy gradient . Specifically, when , equals the policy gradient derived from the advantage function: However, when , this is not necessarily true. Thus it is a "surrogate" of the real objective.
As with natural policy gradient, for small policy updates, TRPO approximates the surrogate advantage and KL divergence using Taylor expansions around : where:
This reduces the problem to a quadratic optimization, yielding the natural policy gradient update: So far, this is essentially the same as natural gradient method. However, TRPO improves upon it by two modifications:
A further improvement is proximal policy optimization (PPO), which avoids even computing and via a first-order approximation using clipped probability ratios. [7]
Specifically, instead of maximizing the surrogate advantageunder a KL divergence constraint, it directly inserts the constraint into the surrogate advantage:and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.
In words, gradient-ascending the new surrogate advantage function means that, at some state , if the advantage is positive: , then the gradient should direct towards the direction that increases the probability of performing action under the state . However, as soon as has changed so much that , then the gradient should stop pointing it in that direction. And similarly if . Thus, PPO avoids pushing the parameter update too hard, and avoids changing the policy too much.
To be more precise, to update to requires multiple update steps on the same batch of data. It would initialize , then repeatedly apply gradient descent (such as the Adam optimizer) to update until the surrogate advantage has stabilized. It would then assign to , and do it again.
During this inner-loop, the first update to would not hit the bounds, but as is updated further and further away from , it eventually starts hitting the bounds. For each such bound hit, the corresponding gradient becomes zero, and thus PPO avoid updating too far away from .
This is important, because the surrogate loss assumes that the state-action pair is sampled from what the agent would see if the agent runs the policy , but policy gradient should be on-policy. So, as changes, the surrogate loss becomes more and more off-policy. This is why keeping proximal to is necessary.
If there is a reference policy that the trained policy should not diverge too far from, then additional KL divergence penalty can be added:where adjusts the strength of the penalty. This has been used in training reasoning language models with reinforcement learning from human feedback. [8] The KL divergence penalty term can be estimated with lower variance using the equivalent form (see f-divergence for details): [9]
The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator . Instead, for each state , it samples multiple actions from the policy , then calculate the group-relative advantage [9] where are the mean and standard deviation of . That is, it is the standard score of the rewards.
Then, it maximizes the PPO objective, averaged over all actions:Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.
As before, the KL penalty term can be applied to encourage the trained policy to stay close to a reference policy. GRPO was first proposed in the context of training reasoning language model by researchers at DeepSeek. [9]
The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).
In mathematics, the winding number or winding index of a closed curve in the plane around a given point is an integer representing the total number of times that the curve travels counterclockwise around the point, i.e., the curve's number of turns. For certain open plane curves, the number of turns may be a non-integer. The winding number depends on the orientation of the curve, and it is negative if the curve travels around the point clockwise.
In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral
In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions, under suitably restricted domains. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.
In mathematics, theta functions are special functions of several complex variables. They show up in many topics, including Abelian varieties, moduli spaces, quadratic forms, and solitons. Theta functions are parametrized by points in a tube domain inside a complex Lagrangian Grassmannian, namely the Siegel upper half space.
In mathematics and physics, the Christoffel symbols are an array of numbers describing a metric connection. The metric connection is a specialization of the affine connection to surfaces or other manifolds endowed with a metric, allowing distances to be measured on that surface. In differential geometry, an affine connection can be defined without reference to a metric, and many additional concepts follow: parallel transport, covariant derivatives, geodesics, etc. also do not require the concept of a metric. However, when a metric is available, these concepts can be directly tied to the "shape" of the manifold itself; that shape is determined by how the tangent space is attached to the cotangent space by the metric tensor. Abstractly, one would say that the manifold has an associated (orthonormal) frame bundle, with each "frame" being a possible choice of a coordinate frame. An invariant metric implies that the structure group of the frame bundle is the orthogonal group O(p, q). As a result, such a manifold is necessarily a (pseudo-)Riemannian manifold. The Christoffel symbols provide a concrete representation of the connection of (pseudo-)Riemannian geometry in terms of coordinates on the manifold. Additional concepts, such as parallel transport, geodesics, etc. can then be expressed in terms of Christoffel symbols.
Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds including the Stiefel manifold.
In mathematics, a local system on a topological space X is a tool from algebraic topology which interpolates between cohomology with coefficients in a fixed abelian group A, and general sheaf cohomology in which coefficients vary from point to point. Local coefficient systems were introduced by Norman Steenrod in 1943.
In differential geometry, the torsion tensor is a tensor that is associated to any affine connection. The torsion tensor is a bilinear map of two input vectors , that produces an output vector representing the displacement within a tangent space when the tangent space is developed along an infinitesimal parallelogram whose sides are . It is skew symmetric in its inputs, because developing over the parallelogram in the opposite sense produces the opposite displacement, similarly to how a screw moves in opposite ways when it is twisted in two directions.
In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted as and .
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In mathematics, Maass forms or Maass wave forms are studied in the theory of automorphic forms. Maass forms are complex-valued smooth functions of the upper half plane, which transform in a similar way under the operation of a discrete subgroup of as modular forms. They are eigenforms of the hyperbolic Laplace operator defined on and satisfy certain growth conditions at the cusps of a fundamental domain of . In contrast to modular forms, Maass forms need not be holomorphic. They were studied first by Hans Maass in 1949.
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.
In probability theory and directional statistics, a wrapped Cauchy distribution is a wrapped probability distribution that results from the "wrapping" of the Cauchy distribution around the unit circle. The Cauchy distribution is sometimes known as a Lorentzian distribution, and the wrapped Cauchy distribution may sometimes be referred to as a wrapped Lorentzian distribution.
In representation theory of mathematics, the Waldspurger formula relates the special values of two L-functions of two related admissible irreducible representations. Let k be the base field, f be an automorphic form over k, π be the representation associated via the Jacquet–Langlands correspondence with f. Goro Shimura (1976) proved this formula, when and f is a cusp form; Günter Harder made the same discovery at the same time in an unpublished paper. Marie-France Vignéras (1980) proved this formula, when and f is a newform. Jean-Loup Waldspurger, for whom the formula is named, reproved and generalized the result of Vignéras in 1985 via a totally different method which was widely used thereafter by mathematicians to prove similar formulas.
A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.
Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.
The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.