Proximal policy optimization

Last updated

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017, [1] and had become the default RL algorithm at the US artificial intelligence company OpenAI. [2] Since 2018, PPO has seen success in a wide variety of applications, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games. [3] Many experts called PPO the state of the art, due to its ability to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency. [4]

Contents

PPO is classified as a policy gradient method for training an agent's policy network. This network approximates a policy function, used by the agent to make decisions. Essentially, to train the policy function, PPO takes a small policy update step, so the agent can reliably reach the optimal solution. A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. Consequently, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process. [4]

Development

In 2015, John Schulman introduced Trust Region Policy Optimization (TRPO), which was a predecessor of PPO. TRPO addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region constraint to regulate the KL divergence between the old and new policies. However, TRPO is computationally complex and inefficient due to its use of second-order optimization, leading to difficulties and expensiveness when implementing solutions to large-scale problems. [5] [6]

In 2017, John Schulman solved the complexity issue of TRPO by adopting first-order optimization in PPO. Schulman and his team designed a clipping mechanism that stops the new policy from deviating significantly from the old one when the likelihood ratio between them exceeds a certain clip threshold. [1] [6] In other words, PPO modifies the objective function of TRPO by penalizing policy updates that are too large. PPO also obviates the complicated trust region constraints, and utilizes the clipping function instead. As a result, PPO improves performance and implementation based on the framework of TRPO.

Theory

This section first explores the key components of PPO, then delves deep into the main objective function.

Basic concepts

To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution generated by the policy network. [7] The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes.

Policy gradient laws: the advantage function

The advantage function (denoted as ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output. [6]

The advantage function can be defined as , where is the discounted sum of rewards (the total weighted reward for the completion of an episode) and is the baseline estimate. [8] [6] Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With and computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return. [9] If , the actual return of the action is better than the expected return from experience; if , the actual return is worse.

Ratio function

In PPO, the ratio function () calculates the probability of selecting action in state given the current policy network, divided by the previous probability under the old policy. In other words:

Hence, this ratio function can easily estimate the divergence between old and current policies. [10] [4]

PPO objective function

The objective function of PPO takes the expectation operator (denoted as ) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:

  1. : this is the product of the ratio function and the advantage function introduced in TRPO, also known as the normal policy gradient objective. [11]
  2. : the policy ratio is first clipped to the range ; generally, is defined to be 0.2. Then, as before, it is multiplied by the advantage.

The fundamental intuition behind PPO is the same as that of TRPO: conservatism. Clipping results in a conservative advantage estimate of the new policy. The reasoning is that if an agent makes significant changes due to high advantage estimates, its policy update will be large and unstable, and may diverge from the optimal policy with little possibility of recovery. [12] There are two common applications of the clipping function: when an action under a new policy happens to be a good choice based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be bad, the clipping function constrains how much the agent can accept the down-weighted bad actions of the new policy. [13] Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with gradient descent, as opposed to the strict KL divergence constraint of TRPO, making the implementation faster and more intuitive.

After computing the clipped surrogate objective function, the agent has two probability ratios: one non-clipped and one clipped. Then, by taking the minimum of the two objectives, the final objective becomes a lower bound (a pessimistic bound) of what the agent knows is possible. [13] In other words, the minimum method makes sure that the agent is doing the safest possible update.

Advantages

Simplicity

PPO approximates what TRPO does, with considerably less computation. It uses first-order optimization (the clip function) to constrain the policy update, while TRPO uses KL divergence constraints (second-order optimization). Compared to TRPO, the PPO method is relatively easy to implement and requires less computational resource and time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems. [14]

Stability

While other RL algorithms require hyperparameter tuning, PPO comparatively does not require as much (0.2 for epsilon can be used in most cases). [15] Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard deep learning frameworks and generalized to a broad range of tasks.

Sample efficiency

Sample efficiency indicates whether the algorithms need more or less data to train a good policy. On-policy algorithms, including PPO and TRPO, generally have good sample efficiency (they do not require as many samples as alternative algorithms). [16] However, PPO achieved sample efficiency because of its use of surrogate objectives. The surrogate objective allows PPO to avoid the new policy moving too far from the old policy; the clip function regularizes the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly. [17]

See also

Related Research Articles

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

<span class="mw-page-title-main">Loss function</span> Mathematical relation assigning a probability event to a cost

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite, in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.

Markov decision process (MDP), also called a stochastic dynamic program or stochastic control problem, is a model for sequential decision making when outcomes are uncertain.

The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

Natural evolution strategies (NES) are a family of numerical optimization algorithms for black box problems. Similar in spirit to evolution strategies, they iteratively update the (continuous) parameters of a search distribution by following the natural gradient towards higher expected fitness.

<span class="mw-page-title-main">Thompson sampling</span> Type of heuristic technique

Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that address the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

<span class="mw-page-title-main">Proximal gradient method</span>

Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.

<span class="mw-page-title-main">Simulation-based optimization</span>

Simulation-based optimization integrates optimization techniques into simulation modeling and analysis. Because of the complexity of the simulation, the objective function may become difficult and expensive to evaluate. Usually, the underlying simulation model is stochastic, so that the objective function must be estimated using statistical estimation techniques.

<span class="mw-page-title-main">Stochastic gradient Langevin dynamics</span> Optimization and sampling technique

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

<span class="mw-page-title-main">Learning curve (machine learning)</span> Plot of machine learning model performance over time or experience

In machine learning (ML), a learning curve is a graphical representation that shows how a model's performance on a training set changes with the number of training iterations (epochs) or the amount of training data. Typically, the number of training epochs or training set size is plotted on the x-axis, and the value of the loss function on the y-axis.

In reinforcement learning (RL), a model-free algorithm is an algorithm which does not estimate the transition probability distribution associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution and the reward function are often collectively called the "model" of the environment, hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. Typical examples of model-free algorithms include Monte Carlo (MC) RL, SARSA, and Q-learning.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

<span class="mw-page-title-main">Reinforcement learning from human feedback</span> Machine learning technique

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

The reparameterization trick is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators.

References

  1. 1 2 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv.org, https://arxiv.org/abs/1707.06347 , arXiv:1707.06347 [cs.LG]
  2. OpenAI, "Proximal Policy Optimization" Available at: https://openai.com/research/openai-baselines-ppo
  3. Arxiv Insights. "An introduction to Policy Gradient methods," YouTube, Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8
  4. 1 2 3 T. Simonini, "Proximal Policy Optimization (PPO)," Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo
  5. Wang, Y., He, H., Wen, C., & Tan, X. (2019). Truly Proximal Policy Optimization. ArXiv. /abs/1903.07940
  6. 1 2 3 4 Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ArXiv. /abs/1502.05477
  7. "A Beginner's Guide to deep Reinforcement learning," Pathmind. https://wiki.pathmind.com/deep-reinforcement-learning#reward
  8. OpenAI, "Part 1: Key concepts in RL," Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
  9. Rohitkumar, "PPO (Proximal Policy Optimization) explained with code examples in PyTorch and TensorFlow," PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/
  10. W. Heeswijk, "Proximal Policy Optimization (PPO) explained," Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b
  11. Edan Meyer. "Proximal Policy Optimization Explained," YouTube, May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64
  12. C. M. Wild (Jul 9, 2018). "The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods". towardsdatascience.com.
  13. 1 2 Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ArXiv. /abs/2307.04964
  14. J. Nocedal and Y. Nesterov., "Natural, trust region and proximal policy optimization," TransferLab, https://transferlab.ai/blog/trpo-and-ppo/
  15. J. Hui, "RL - reinforcement learning algorithms comparison," Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf/
  16. Huang, Shengyi, and Dossa, "The 37 implementation details of proximal policy optimization," The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
  17. XiaoYang-ElegantRL, "ElegantRL: Mastering PPO Algorithms - towards Data Science," Medium, Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791