Reinforcement learning from human feedback

Last updated

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

Contents

In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a policy. This function is iteratively updated to maximize rewards based on the agent's task performance. [1] However, explicitly defining a reward function that accurately approximates human preferences is challenging. Therefore, RLHF seeks to train a "reward model" directly from human feedback. [2] The reward model is first trained in a supervised manner to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human annotators. This model then serves as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization. [3] [4] [5]

RLHF has applications in various domains in machine learning, including natural language processing tasks such as text summarization and conversational agents, computer vision tasks like text-to-image models, and the development of video game bots. While RLHF is an effective method of training models to act better in accordance with human preferences, it also faces challenges due to the way the human preference data is collected. Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative sample, the resulting model may exhibit unwanted biases.

High-level overview of reinforcement learning from human feedback RLHF diagram.svg
High-level overview of reinforcement learning from human feedback

Background and motivation

Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge. [6] For example, one may want to train a model to generate safe text that is both helpful and harmless (such as lacking bias, toxicity, or otherwise harmful content). Asking humans to manually create examples of harmless and harmful text would be difficult and time-consuming. However, humans are adept at swiftly assessing and comparing the harmfulness of different AI-generated text. Therefore, a more practical objective would be to allow the model to use this type of human feedback to improve its text generation. [7]

Despite the clear benefits of incorporating human feedback in training models, prior efforts—including some that leverage reinforcement learning—have encountered significant challenges. Most attempts were either narrow and difficult to generalize, breaking down on more complex tasks, [8] [9] [10] [11] or they faced difficulties learning from sparse (lacking specific information and relating to large amounts of text at a time) or noisy (inconsistently rewarding similar outputs) reward functions. [12] [13]

RLHF was not the first successful method of using human feedback for reinforcement learning, but it is one of the most widely used. The foundation for RLHF was introduced as an attempt to create a general algorithm for learning from a practical amount of human feedback. [6] [3] The algorithm as used today was introduced by OpenAI in a paper on enhancing text continuation or summarization based on human feedback, and it began to gain popularity when the same method was reused in their paper on InstructGPT. [2] [14] [15] RLHF has also been shown to improve the robustness of RL agents and their capacity for exploration, which results in an optimization process more adept at handling uncertainty and efficiently exploring its environment in search of the highest reward. [16]

Collecting human feedback

Human feedback is commonly collected by prompting humans to rank instances of the agent's behavior. [15] [17] [18] These rankings can then be used to score outputs, for example, using the Elo rating system, which is an algorithm for calculating the relative skill levels of players in a game based only on the outcome of each game. [3] While ranking outputs is the most widely adopted form of feedback, recent research has explored other forms, such as numerical feedback, natural language feedback, and prompting for direct edits to the model's output. [19]

One initial motivation of RLHF was that it requires relatively small amounts of comparison data to be effective. [6] It has been shown that a small amount of data can lead to comparable results to a larger amount. In addition, increasing the amount of data tends to be less effective than proportionally increasing the size of the reward model. [14] Nevertheless, a larger and more diverse amount of data can be crucial for tasks where it is important to avoid bias from a partially representative group of annotators. [15]

When learning from human feedback through pairwise comparison under the Bradley–Terry–Luce model (or the Plackett–Luce model for K-wise comparisons over more than two comparisons), the maximum likelihood estimator (MLE) for linear reward functions has been shown to converge if the comparison data is generated under a well-specified linear model. This implies that, under certain conditions, if a model is trained to decide which choices people would prefer between pairs (or groups) of choices, it will necessarily improve at predicting future preferences. This improvement is expected as long as the comparisons it learns from are based on a consistent and simple rule. [20] [21]

Both offline data collection models, where the model is learning by interacting with a static dataset and updating its policy in batches, as well as online data collection models, where the model directly interacts with the dynamic environment and updates its policy immediately, have been mathematically studied proving sample complexity bounds for RLHF under different feedback models. [20] [22]

In the offline data collection model, when the objective is policy training, a pessimistic MLE that incorporates a lower confidence bound as the reward estimate is most effective. Moreover, when applicable, it has been shown that considering K-wise comparisons directly is asymptotically more efficient than converting them into pairwise comparisons for prediction purposes. [22] [23] [15]

In the online scenario, when human feedback is collected through pairwise comparisons under the Bradley–Terry–Luce model and the objective is to minimize the algorithm's regret (the difference in performance compared to an optimal agent), it has been shown that an optimistic MLE that incorporates an upper confidence bound as the reward estimate can be used to design sample efficient algorithms (meaning that they require relatively little training data). A key challenge in RLHF when learning from pairwise (or dueling) comparisons is associated with the non-Markovian nature of its optimal policies. Unlike simpler scenarios where the optimal strategy does not require memory of past actions, in RLHF, the best course of action often depends on previous events and decisions, making the strategy inherently memory-dependent. [21]

Applications

RLHF has been applied to various domains of natural language processing (NLP), such as conversational agents, text summarization, and natural language understanding. [24] [14] Ordinary reinforcement learning, in which agents learn from their actions based on a predefined "reward function", is difficult to apply to NLP tasks because the rewards tend to be difficult to define or measure, especially when dealing with complex tasks that involve human values or preferences. [6] RLHF can steer NLP models, in particular language models, to provide answers that align with human preferences with regard to such tasks by capturing their preferences beforehand in the reward model. This results in a model capable of generating more relevant responses and rejecting inappropriate or irrelevant queries. [15] [25] Some notable examples of RLHF-trained language models are OpenAI's ChatGPT (and its predecessor InstructGPT), [17] [26] [27] DeepMind's Sparrow, [28] [29] [30] Google's Gemini, [31] and Anthropic's Claude. [32]

In computer vision, RLHF has also been used to align text-to-image models. Studies that successfully used RLHF for this goal have noted that the use of KL regularization in RLHF, which aims to prevent the learned policy from straying too far from the unaligned model, helped to stabilize the training process by reducing overfitting to the reward model. The final image outputs from models trained with KL regularization were noted to be of significantly higher quality than those trained without. [33] [34] Other methods tried to incorporate the feedback through more direct training—based on maximizing the reward without the use of reinforcement learning—but conceded that an RLHF-based approach would likely perform better due to the online sample generation used in RLHF during updates as well as the aforementioned KL regularization over the prior model, which mitigates overfitting to the reward function. [35]

RLHF was initially applied to other areas, such as the development of video game bots and tasks in simulated robotics. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences. In classical RL-based training of such bots, the reward function is simply correlated to how well the agent is performing in the game, usually using metrics like the in-game score. In comparison, in RLHF, a human is periodically presented with two clips of the agent's behavior in the game and must decide which one looks better. This approach can teach agents to perform at a competitive level without ever having access to their score. In fact, it was shown that RLHF can sometimes lead to superior performance over RL with score metrics because the human's preferences can contain more useful information than performance-based metrics. [6] [36] The agents achieved strong performance in many of the environments tested, often surpassing human performance. [37]

Training

In RLHF, two different models are trained: a reward model and a reinforcement learning (RL) policy. The reward model learns to determine what behavior is desirable based on human feedback, while the policy is guided by the reward model to determine the agent's actions. Both models are commonly initialized using a pre-trained autoregressive language model. This model is then customarily trained in a supervised manner on a relatively small dataset of pairs of prompts to an assistant and their accompanying responses, written by human annotators.

Reward model

The reward model is usually initialized with a pre-trained model, as this initializes it with an understanding of language and focuses training explicitly on learning human preferences. In addition to being used to initialize the reward model and the RL policy, the model is then also used to sample data to be compared by annotators. [15] [14]

The reward model is then trained by replacing the final layer of the previous model with a randomly initialized regression head. This change shifts the model from its original classification task over its vocabulary to simply outputting a number corresponding to the score of any given prompt and response. This model is trained on the human preference comparison data collected earlier from the supervised model. In particular, it is trained to minimize the following cross-entropy loss function:where is the number of responses the labelers ranked, is the output of the reward model for prompt and completion , is the preferred completion over , denotes the sigmoid function, and denotes the expected value. [15] This can be thought of as a form of logistic regression, where the model predicts the probability that a response is preferred over .

This loss function essentially measures the difference between the reward model's predictions and the decisions made by humans. The goal is to make the model's guesses as close as possible to the humans' preferences by minimizing the difference measured by this equation. For example, in the case of only pairwise comparisons, , so the factor of . [14] In general, all comparisons from each prompt are used for training as a single batch. [15]

After training, the outputs of the model are normalized, such that the reference completions have a mean score of 0. That is, [14] for each .

Policy

Similarly to the reward model, the human feedback policy is also initialized from a pre-trained model. [14]

The key is to understand language generation as if it is a game to be learned by RL. In RL, a policy is a function that maps a game state to a game action. In RLHF, the "game" is the game of replying to prompts. A prompt is a game state, and a response is a game action. This is a fairly trivial kind of game, since every game lasts for exactly one step. Nevertheless, it is a game, and so RL algorithms can be applied to it.

The first step in its training is supervised fine-tuning (SFT). This step does not require the reward model. Instead, the pre-trained model is trained on a dataset that contains prompt-response pairs . Then, during SFT, the model is trained to autoregressively generate the corresponding response when given a random prompt . The original paper recommends to SFT for only one epoch, since more than that causes overfitting.

The dataset is usually written by human contractors, who write both the prompts and responses.

The second step uses a policy gradient method to the reward model. It uses a dataset , which contains prompts, but not responses. Like most policy gradient methods, this algorithm has an outer loop and two inner loops:

Note that is equivalent to , which means "sample a prompt from , then sample a response from the policy".

The objective function has two parts. The first part is simply the expected reward , and is standard for any RL algorithm. The second part is a "penalty term" involving the KL divergence. The strength of the penalty term is determined by the hyperparameter .

This KL term works by penalizing the KL divergence (a measure of statistical distance between distributions) between the model being fine-tuned and the initial supervised model. By choosing an appropriate , the training can balance learning from new data while retaining useful information from the initial model, increasing generalization by avoiding fitting too closely to the new data. Aside from preventing the new model from producing outputs too dissimilar those of the initial model, a second motivation of including the KL term is to encourage the model to output high-entropy text, so as to prevent the model from collapsing to a small number of canned responses. [14]

In simpler terms, the objective function calculates how well the policy's responses are expected to align with human feedback. The policy generates responses to prompts, and each response is evaluated both on how well it matches human preferences (as measured by the reward model) and how similar it is to responses the model would naturally generate. The goal is to balance improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training. This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses.

Proximal policy optimization

The policy function is usually trained by proximal policy optimization (PPO) algorithm. That is, the parameter is trained by gradient ascent on the clipped surrogate function. [15] [14]

Technically, the PPO uses generalized advantage estimation, which means that there is an extra value estimator, that updates concurrently with the policy during PPO training: . The value estimator is used only during training, and not outside of training.

The PPO uses gradient descent on the following clipped surrogate advantage:where the advantage term is defined asThat is, the advantage is reward, minus KL penalty, minus value estimation. This is used to train the policy by gradient ascent on it, usually using a standard momentum-gradient optimizer, like the Adam optimizer.

The original paper initialized the value estimator from the trained reward model. [14] Since PPO is an actor-critic algorithm, the value estimator is updated concurrently with the policy, via minimizing the squared TD-error, which in this case equals the squared advantage term:which is minimized by gradient descent on it. Other methods than squared TD-error might be used. See the actor-critic algorithm page for details.

Mixing pretraining gradients

A third term is commonly added to the objective function to prevent the model from catastrophic forgetting. For example, if the model is only trained in customer service, then it might forget general knowledge in geography. To prevent this, the RLHF process incorporates the original language modeling objective. That is, some random texts are sampled from the original pretraining dataset , and the model is trained to maximize the log-likelihood of the text . The final objective function is written as:where controls the strength of this pretraining term. [15] This combined objective function is called PPO-ptx, where "ptx" means "Mixing Pretraining Gradients". [7] It was first used in the InstructGPT paper. [15]

In total, this objective function defines the method for adjusting the RL policy, blending the aim of aligning with human feedback and maintaining the model's original language understanding.

So, writing out fully explicitly, the PPO-ptx objective function is:which is optimized by gradient ascent on it.

Limitations

RLHF suffers from challenges with collecting human feedback, learning a reward model, and optimizing the policy. [38] Compared to data collection for techniques like unsupervised or self-supervised learning, collecting data for RLHF is less scalable and more expensive. Its quality and consistency may vary depending on the task, interface, and the preferences and biases of individual humans. [15] [39]

The effectiveness of RLHF depends on the quality of human feedback. For instance, the model may become biased, favoring certain groups over others, if the feedback lacks impartiality, is inconsistent, or is incorrect. [3] [40] There is a risk of overfitting, where the model memorizes specific feedback examples instead of learning to generalize. For instance, feedback predominantly from a specific demographic might lead the model to learn peculiarities or noise, along with the intended alignment. Excessive alignment to the specific feedback it received (that is, to the bias therein) can lead to the model performing sub-optimally in new contexts or when used by different groups. [41] A single reward function cannot always represent the opinions of diverse groups of people. Even with a representative sample, conflicting views and preferences may result in the reward model favoring the majority's opinion, potentially disadvantaging underrepresented groups. [38]

In some cases, as is possible in regular reinforcement learning, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards rather than genuinely improving its performance. [42] In the case of RLHF, a model may learn to exploit the fact that it is rewarded for what is evaluated positively and not necessarily for what is actually good, which can lead to it learning to persuade and manipulate. For example, models might learn that apparent confidence, even if inaccurate, garners higher rewards. Such behavior, if unchecked, is not just incentivized but can cause significant deployment issues due to the model's potential to mislead. Studies have found that humans are not skilled at identifying mistakes in LLM outputs in complex tasks; therefore, models learning to generate confident-sounding yet incorrect text can lead to significant issues when deployed. [38]

Alternatives

Reinforcement learning from AI feedback

Similarly to RLHF, reinforcement learning from AI feedback (RLAIF) relies on training a preference model, except that the feedback is automatically generated. [43] This is notably used in Anthropic's constitutional AI, where the AI feedback is based on the conformance to the principles of a constitution. [44]

Direct preference optimization

Another alternative to RLHF called Direct Preference Optimization (DPO) has been proposed to learn human preferences. Like RLHF, it has been applied to align pre-trained large language models using human-generated preference data. Unlike RLHF, however, which first trains a separate intermediate model to understand what good outcomes look like and then teaches the main model how to achieve those outcomes, DPO simplifies the process by directly adjusting the main model according to people's preferences. It uses a change of variables to define the "preference loss" directly as a function of the policy and uses this loss to fine-tune the model, helping it understand and prioritize human preferences without needing a separate step. Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback.

The idea of DPO is as follows:

Instead of doing the intermediate step of the reward model, DPO directly obtains the final policy.

First, solve directly for the optimal policy, which can be done by Lagrange multipliers, as usual in statistical mechanics:where is the partition function. This is unfortunately not tractable, since it requires summing over all possible responses: Next, invert this relationship to express the reward implicitly in terms of the optimal policy:Finally, plug it back to the maximum likelihood estimator, we obtain [45] :Appendix AUsually, DPO is used for modeling human preference in pairwise comparisons, so that . In that case, we have

DPO eliminates the need for a separate reward model or reinforcement learning loop, treating alignment as a supervised learning problem over preference data. This is simpler to implement and train than RLHF and has been shown to produce comparable and sometimes superior results. [45] Nevertheless, RLHF has also been shown to beat DPO on some datasets, for example, on benchmarks that attempt to measure truthfulness. Therefore, the choice of method may vary depending on the features of the human preference data and the nature of the task. [46]

See also

Related Research Articles

<span class="mw-page-title-main">Particle in a box</span> Mathematical model in quantum mechanics

In quantum mechanics, the particle in a box model describes the movement of a free particle in a small space surrounded by impenetrable barriers. The model is mainly used as a hypothetical example to illustrate the differences between classical and quantum systems. In classical systems, for example, a particle trapped inside a large box can move at any speed within the box and it is no more likely to be found at one position than another. However, when the well becomes very narrow, quantum effects become important. The particle may only occupy certain positive energy levels. Likewise, it can never have zero energy, meaning that the particle can never "sit still". Additionally, it is more likely to be found at certain positions than at others, depending on its energy level. The particle may never be detected at certain positions, known as spatial nodes.

<span class="mw-page-title-main">Laplace's equation</span> Second-order partial differential equation

In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as or where is the Laplace operator, is the divergence operator, is the gradient operator, and is a twice-differentiable real-valued function. The Laplace operator therefore maps a scalar function to another scalar function.

<span class="mw-page-title-main">Reinforcement learning</span> Field of machine learning

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

<span class="mw-page-title-main">Gudermannian function</span> Mathematical function relating circular and hyperbolic functions

In mathematics, the Gudermannian function relates a hyperbolic angle measure to a circular angle measure called the gudermannian of and denoted . The Gudermannian function reveals a close relationship between the circular functions and hyperbolic functions. It was introduced in the 1760s by Johann Heinrich Lambert, and later named for Christoph Gudermann who also described the relationship between circular and hyperbolic functions in 1830. The gudermannian is sometimes called the hyperbolic amplitude as a limiting case of the Jacobi elliptic amplitude when parameter

In probability theory, the Borel–Kolmogorov paradox is a paradox relating to conditional probability with respect to an event of probability zero. It is named after Émile Borel and Andrey Kolmogorov.

Markov decision process (MDP), also called a stochastic dynamic program or stochastic control problem, is a model for sequential decision making when outcomes are uncertain.

In mathematics, the Lerch transcendent, is a special function that generalizes the Hurwitz zeta function and the polylogarithm. It is named after Czech mathematician Mathias Lerch, who published a paper about a similar function in 1887. The Lerch transcendent, is given by:

<span class="mw-page-title-main">Directivity</span> Measure of how much of an antennas signal is transmitted in one direction

In electromagnetics, directivity is a parameter of an antenna or optical system which measures the degree to which the radiation emitted is concentrated in a single direction. It is the ratio of the radiation intensity in a given direction from the antenna to the radiation intensity averaged over all directions. Therefore, the directivity of a hypothetical isotropic radiator, a source of electromagnetic waves which radiates the same power in all directions, is 1, or 0 dBi.

<span class="mw-page-title-main">Folded normal distribution</span> Probability distribution

The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ2, the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.

In financial mathematics, tail value at risk (TVaR), also known as tail conditional expectation (TCE) or conditional tail expectation (CTE), is a risk measure associated with the more general value at risk. It quantifies the expected value of the loss given that an event outside a given probability level has occurred.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In mathematics, the quantum dilogarithm is a special function defined by the formula

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010.

<span class="mw-page-title-main">Two-ray ground-reflection model</span> Multipath radio propagation model

The two-rays ground-reflection model is a multipath radio propagation model which predicts the path losses between a transmitting antenna and a receiving antenna when they are in line of sight (LOS). Generally, the two antenna each have different height. The received signal having two components, the LOS component and the reflection component formed predominantly by a single ground reflected wave.

Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.

Policy gradient methods are a class of reinforcement learning algorithms.

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

The exploration-exploitation dilemma, also known as the explore-exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system, while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.

The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.

Reasoning language models are artificial intelligence systems that combine natural language processing with structured reasoning capabilities. These models are usually constructed by prompting, supervised finetuning (SFT), and reinforcement learning (RL) initialized with pretrained language models.

References

  1. Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global ed.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. pp. 830–831. ISBN   978-0-13-604259-4.
  2. 1 2 Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv: 1909.08593 [cs.CL].
  3. 1 2 3 4 Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.
  4. Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg (2017). "Proximal Policy Optimization Algorithms". arXiv: 1707.06347 [cs.LG].
  5. Tuan, Yi-Lin; Zhang, Jinzhi; Li, Yujia; Lee, Hung-yi (2018). "Proximal Policy Optimization and its Dynamic Version for Sequence Generation". arXiv: 1808.07982 [cs.CL].
  6. 1 2 3 4 5 Amodei, Dario; Christiano, Paul; Ray, Alex (13 June 2017). "Learning from human preferences". openai.com. Retrieved 4 March 2023.
  7. 1 2 Zheng, Rui; Dou, Shihan; Gao, Songyang; Hua, Yuan; Shen, Wei; Wang, Binghai; Liu, Yan; Jin, Senjie; Liu, Qin; Zhou, Yuhao; Xiong, Limao; Chen, Lu; Xi, Zhiheng; Xu, Nuo; Lai, Wenbin; Zhu, Minghao; Chang, Cheng; Yin, Zhangyue; Weng, Rongxiang; Cheng, Wensen; Huang, Haoran; Sun, Tianxiang; Yan, Hang; Gui, Tao; Zhang, Qi; Qiu, Xipeng; Huang, Xuanjing (2023). "Secrets of RLHF in Large Language Models Part I: PPO". arXiv: 2307.04964 [cs.CL].
  8. Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). "Training a Robot via Human Feedback: A Case Study". Social Robotics. Lecture Notes in Computer Science. Vol. 8239. Springer International Publishing. pp. 460–470. doi:10.1007/978-3-319-02675-6_46. ISBN   978-3-319-02674-9 . Retrieved 26 February 2024.
  9. Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). "APRIL: Active Preference Learning-Based Reinforcement Learning". Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Vol. 7524. Springer. pp. 116–131. arXiv: 1208.0984 . doi:10.1007/978-3-642-33486-3_8. ISBN   978-3-642-33485-6 . Retrieved 26 February 2024.
  10. Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). "A Bayesian Approach for Policy Learning from Trajectory Preference Queries". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc. Retrieved 26 February 2024.
  11. Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18 June 2014). "Programming by Feedback". Proceedings of the 31st International Conference on Machine Learning. PMLR: 1503–1511. Retrieved 26 February 2024.
  12. Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv: 1709.10163 . doi:10.1609/aaai.v32i1.11485. S2CID   4130751.
  13. MacGlashan, James; Ho, Mark K.; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv: 1701.06049 .
  14. 1 2 3 4 5 6 7 8 9 10 Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
  15. 1 2 3 4 5 6 7 8 9 10 11 12 Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv: 2203.02155 .
  16. Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv: 2204.05862 [cs.CL].
  17. 1 2 Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
  18. Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
  19. Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv: 2305.00955 [cs.CL].
  20. 1 2 Xie, Tengyang; Jiang, Nan; Wang, Huan; Xiong, Caiming; Bai, Yu (2021). "Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 27395–27407. arXiv: 2106.04895 . Retrieved 10 March 2024.
  21. 1 2 Pacchiano, Aldo; Saha, Aadirupa; Lee, Jonathan (2023-03-03). "Dueling RL: Reinforcement Learning with Trajectory Preferences". Proceedings of the 26th International Conference on Artificial Intelligence and Statistics. PMLR: 6263–6289. arXiv: 2111.04850 .
  22. 1 2 Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067. arXiv: 2301.11270 .
  23. Li, Zihao; Yang, Zhuoran; Wang, Mengdi (20 June 2023). "Reinforcement learning with Human Feedback: Learning Dynamic Choices via Pessimism". ILHF Workshop ICML 2023. arXiv: 2305.18438 . Retrieved 10 March 2024.
  24. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv: 2203.02155 [cs.CL].
  25. Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
  26. Heikkilä, Melissa (21 February 2023). "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.
  27. Douglas Heaven, Will (30 November 2022). "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.
  28. Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv: 2209.14375 [cs.LG].
  29. Goldman, Sharon (23 September 2022). "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. Retrieved 4 March 2023.
  30. The Sparrow team (22 September 2022). "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.
  31. Pinchai, Sundar; Hassabis, Demis (6 December 2023). "Introducing Gemini: our largest and most capable AI model". Google. Retrieved 29 February 2024.
  32. Henshall, Will (18 July 2023). "What to Know About Claude 2, Anthropic's Rival to ChatGPT". TIME. Retrieved 6 March 2024.
  33. Fan, Ying; Watkins, Olivia; Du, Yuqing; Liu, Hao; Ryu, Moonkyung; Boutilier, Craig; Abbeel, Pieter; Ghavamzadeh, Mohammad; Lee, Kangwook; Lee, Kimin (2 November 2023). "DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models". NeurIPS 2023. arXiv: 2305.16381 . Retrieved 1 March 2024.
  34. Xu, Jiazheng; Liu, Xiao; Wu, Yuchen; Tong, Yuxuan; Li, Qinkai; Ding, Ming; Tang, Jie; Dong, Yuxiao (15 December 2023). "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation". Advances in Neural Information Processing Systems. 36: 15903–15935. arXiv: 2304.05977 . Retrieved 1 March 2024.
  35. Lee, Kimin; Liu, Hao; Ryu, Moonkyung; Watkins, Olivia; Du, Yuqing; Boutilier, Craig; Abbeel, Pieter; Ghavamzadeh, Mohammad; Gu, Shixiang Shane (2023). "Aligning Text-to-Image Models using Human Feedback". arXiv: 2302.12192 [cs.LG].
  36. Leike, Jan; Martic, Miljan; Legg, Shane (12 June 2017). "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.
  37. Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv: 1706.03741 . Retrieved 4 March 2023.
  38. 1 2 3 Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony Tong; Marks, Samuel; Segerie, Charbel-Raphael; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Biyik, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (18 September 2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". Transactions on Machine Learning Research. arXiv: 2307.15217 .
  39. Christiano, Paul (25 January 2023). "Thoughts on the impact of RLHF research" . Retrieved 4 March 2023.
  40. Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. 2 (4). AI Ethics: 771–787. doi:10.1007/s43681-022-00138-8. PMC   8830968 . PMID   35194591.
  41. Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.
  42. Clark, Jack; Amodei, Dario (21 December 2016). "Faulty reward functions in the wild". OpenAI.
  43. Lee, Harrison; Phatale, Samrat; Mansoor, Hassan; Lu, Kellie Ren; Mesnard, Thomas; Ferret, Johan; Bishop, Colton; Hall, Ethan; Carbune, Victor; Rastogi, Abhinav (2023-10-13). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback". ICLR.
  44. Edwards, Benj (2023-05-09). "AI gains "values" with Anthropic's new Constitutional AI chatbot approach". Ars Technica. Retrieved 2024-04-27.
  45. 1 2 Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv: 2305.18290 [cs.LG].
  46. Wang, Zhilin; Dong, Yi; Zeng, Jiaqi; Adams, Virginia; Sreedhar, Makesh Narsimhan; Egert, Daniel; Delalleau, Olivier; Scowcroft, Jane Polak; Kant, Neel; Swope, Aidan; Kuchaiev, Oleksii (2023). "HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM". arXiv: 2311.09528 [cs.CL].

Further reading