Proximal Policy Optimization

Last updated

Proximal Policy Optimization (PPO) is an algorithm in the field of reinforcement learning that trains a computer agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017, [1] and had become the default reinforcement learning algorithm at American artificial intelligence company OpenAI. [2] In 2018 PPO had received a wide variety of successes, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games. [3] Many experts called PPO the state of the art because it seems to strike a balance between performance and comprehension.[ citation needed ] Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency. [4]

Contents

PPO is classified as a policy gradient method for training an agent’s policy network. The policy network is the function that the agent uses to make decisions. Essentially, to train the right policy network, PPO takes a small policy update (step size), so the agent can reliably reach the optimal solution. A too-big step may direct policy in the false direction, thus having little possibility of recovery; a too-small step lowers overall efficiency. Consequently, PPO implements a clip function that constrains the policy update of an agent from being too large or too small. [4]

Development

Reinforcement learning (RL), to which PPO belongs, has roots in psychology and neuroscience. Compared with other fields of machine learning, Reinforcement learning closely mimics the kind of learning that humans and other animals do. Many of the core algorithms, including PPO, were originally inspired by biological learning systems, like psychologist Edward Thorndike's learning by trial and error (1913). [5] [6]

In 2015, John Schulman introduced Trust Region Policy Optimization (TRPO) as an earlier version of PPO. TRPO addressed the instability issue found in the previous algorithm, deep q-network (DQN), by using the trust region constraint to regulate the KL divergence between the old and new policy. However, TRPO is computationally complicated and inefficient due to its second-order optimization, leading to expensive and difficult implementation for large-scale problems. [7] [8]

In 2017, John Schulman solved the complexity issue of TRPO by adopting first-order optimization in PPO. Schulman and his teams designed a clipping mechanism that forbids the new policy from deviating significantly from the old one when the likelihood ratio between them is out of clipping range. [1] [8] In other words, PPO modifies TRPO’s objective function with a punishment of too-large policy updates. Also, PPO deletes the complicated trust region constraints and utilizes the clipping function instead. As a result, PPO improves performance and implementation based on the framework of TRPO.

Theory

This section will first explore key components of the core algorithm in PPO, and then delve deep into the Main objective function in PPO.

Basic Concepts

To begin the PPO's training process, the agent is placed in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of datasets and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P(A|S) generated by the policy network. [9] The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario known as a State by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize its total rewards across a series of States, also referred as episodes. Scientists reinforce the agent to learn to perform the best actions by experience, and this decision function is called Policy. [10]

Policy Gradient Laws: Advantage Function A

As PPO's essential part, the advantage function tries to answer the question of whether a specific action of the agent is better than the other possible action in a given state or worse than the other action. By definition, the advantage function is an estimate of the relative value for a selected action. The positive output of the advantage function means that the chosen action is better than the average return, so the possibilities of that specific action will increase, and vice versa. [8]

Advantage function calculation: A = discounted sum (Q) - baseline estimate (V). The first part, the discounted sum, is the total weighted reward for the completion of a current episode. More weight will be given to a specific action that brings easy and quick rewards. On the other hand, less weight will be credited to actions that need significant effort but offer disproportionate rewards. [11] [8] Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The second part, the baseline estimate, is the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variances) because it utilizes a neural network. With the two parts computed, the advantage function is calculated by subtracting the baseline estimate from the actual return (discounted sum). [12] A > 0 signifies how much better the actual return of the action is based on the expected return from experience; A < 0 implies how bad the actual return is based on the expected return.

Ratio Function

In PPO, the ratio function calculates the probability of taking action a at state s in the current policy network divided by the previous old version of policy.

In this function, rt(θ) denotes the probability ratio between the current and old policy:

This ratio function can easily estimate the divergence between old and current policies. [13] [4]

PPO's Objective Function

The central objective function of PPO takes the expectation operator (denoted as E) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:

1, R-theta * Advantage Function: this is the product of the ratio function and the advantage function that was introduced in TRPO, also known as normal policy gradient objective. [14]

2. Clipped (R-theta) * Advantage Function: The policy ratio is first clipped between 1- epsilon and 1 + epsilon; generally, epsilon is defined to be 0.20. Then, multiply the clipped version by the advantage.

The fundamental intuition behind PPO is the same as TRPO: conservatism. Clipping applies to make the "advantage estimate" of the new policy conservative. The reasoning behind conservatism is that if agents make significant changes due to high advantage estimates, the policy update will be large and unstable, and may "fall off the cliff" (little possibility of recovery). [15] There are two common applications of the clipping function. When an action under a new policy happens to be a really good action based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be a bad action, the clipping function constrains how much the agent can cut the new policy slack for down-weighted bad actions. [16] Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with gradient descent, as opposed to TRPO's strict KL divergence constraint, which makes the implementation faster and cleaner.

After computing the clipped surrogate objective function, the program has two probability ratios: one non-clipped and one clipped; then, by taking the minimum of the two objectives, the final objective becomes a lower bound (pessimistic bound) of what an agent knows is possible. [16] In other words, the minimum method makes sure that the agent is doing the safest possible update.

Advantages

Simplicity

PPO approximates what TRPO did without doing too much computation. It uses first-order optimization (clip function) to constrain the policy update, while TRPO uses KL divergence constraints outside the objective function (second-order optimization). Compared with the TRPO, the PPO method is relatively easy to implement and takes less computation time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems. [17]

Stability

While other reinforcement learning algorithms require hyperparameter tuning, PPO does not necessarily require hyperparameter tuning (0.2 for epsilon can be used in most cases). [18] Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard deep learning frameworks and generalized to a broad range of tasks.

Sample efficiency

Sample efficiency indicates whether the algorithms need more or less amount of data to train a good policy. On-policy algorithms, including PPO and TRPO, generally have a low level of sample efficiency. [19] However, PPO achieved sample efficiency because of its usage of surrogate objectives. The surrogate objectives enable PPO to avoid the new policy changing too far from the old policy; the clip function regularizes the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly. [20]

See also

Related Research Articles

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

<span class="mw-page-title-main">Intelligent agent</span> Software agent which acts autonomously

In intelligence and artificial intelligence, an intelligent agent (IA) is an agent acting in an intelligent manner; It perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or acquiring knowledge. An intelligent agent may be simple or complex: A thermostat or other control system is considered an example of an intelligent agent, as is a human being, as is any system that meets the definition, such as a firm, a state, or a biome.

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

Rprop, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich Braun in 1992.

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

In machine learning, a hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.

<span class="mw-page-title-main">Proximal gradient method</span>

Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.

Bayesian optimization is a sequential design strategy for global optimization of black-box functions that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans' intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system pursues some objectives, but not the intended ones.

The following outline is provided as an overview of and topical guide to machine learning:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

In reinforcement learning (RL), a model-free algorithm is an algorithm which does not estimate the transition probability distribution associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution and the reward function are often collectively called the "model" of the environment, hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves".

In machine learning, reinforcement learning from human feedback (RLHF), also called reinforcement learning from human preferences, is a technique to align an AI agent to human preferences. In classical reinforcement learning, such an agent learns a policy that maximizes a reward function that measures how well it performed its task. However, it is difficult to explicitly define such a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback. This model can then function as a reward function to optimize an agent's policy through an optimization algorithm like Proximal Policy Optimization. The reward model is trained in advance to the policy being optimized to predict if a given output is good or bad. RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.

References

  1. 1 2 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv.org, https://arxiv.org/abs/1707.06347 , arXiv:1707.06347 [cs.LG].
  2. OpenAI, "Proximal Policy Optimization" Available at: https://openai.com/research/openai-baselines-ppo (Nov.1 2023 retrieved).
  3. Arxiv Insights. "An introduction to Policy Gradient methods," YouTube, Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8.
  4. 1 2 3 T. Simonini, “Proximal Policy Optimization (PPO),” Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo .
  5. R. Sutton and A. Barto, Reinforcement learning: An introduction, https://beiyulincs.github.io/teach/spring_21/behavior_modeling/reading/rl_reading.pdf (accessed Nov. 6, 2023).
  6. C. Mahoney, “Reinforcement Learning: A review of historic, modern, and historic developments ... | towards Data Science,” Medium, Mar. 30, 2022. [Online]. Available: https://towardsdatascience.com/reinforcement-learning-fda8ff535bb6#5554
  7. Wang, Y., He, H., Wen, C., & Tan, X. (2019). Truly Proximal Policy Optimization. ArXiv. /abs/1903.07940
  8. 1 2 3 4 Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ArXiv. /abs/1502.05477
  9. “A Beginner’s Guide to deep Reinforcement learning,” Pathmind. https://wiki.pathmind.com/deep-reinforcement-learning#reward
  10. Q. T. Luu, “Q-learning vs. deep Q-learning vs. Deep Q-Network,” Baeldung on Computer Science, https://www.baeldung.com/cs/q-learning-vs-deep-q-learning-vs-deep-q-network (accessed Nov. 1, 2023).
  11. OpenAI, “Part 1: Key concepts in RL¶,” Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html (accessed Nov. 4, 2023).
  12. Rohitkumar, “PPO (Proximal Policy Optimization) explained with code examples in Pytorch and tensorflow,” PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/ (accessed Nov. 5, 2023).
  13. W. Heeswijk, “Proximal Policy Optimization (PPO) explained,” Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b (accessed Nov. 4, 2023).
  14. Edan Meyer. "Proximal Policy Optimization Explained," YouTube, May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64 (Accessed Nov 4, 2023).
  15. C. M. Wild. The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods. 2018. URL: https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpoand-ppo-stabilize-policy-gradient-methods-545784094e3b (visited on 05/11/2023).
  16. 1 2 Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ArXiv. /abs/2307.04964
  17. J. Nocedal and Y. Nesterov., “Natural, trust region and proximal policy optimization,” TransferLab, https://transferlab.ai/blog/trpo-and-ppo/ (accessed Nov. 5, 2023).
  18. J. Hui, “RL - reinforcement learning algorithms comparison,” Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf (accessed Nov. 4, 2023).
  19. Huang, Shengyi, and Dossa, “The 37 implementation details of proximal policy optimization,” The 37 Implementation Details of Proximal Policy Optimization · The ICLR Blog Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ (accessed Nov. 5, 2023).
  20. XiaoYang-ElegantRL, “ElegantRL: Mastering PPO Algorithms - towards Data Science,” Medium, Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791