Imitation learning is a paradigm in reinforcement learning, where an agent learns to perform a task by supervised learning from expert demonstrations. It is also called learning from demonstration and apprenticeship learning. [1] [2] [3]
It has been applied to underactuated robotics, [4] self-driving cars, [5] [6] [7] quadcopter navigation, [8] helicopter aerobatics, [9] and locomotion. [10] [11]
Expert demonstrations are recordings of an expert performing the desired task, often collected as state-action pairs .
Behavior Cloning (BC) is the most basic form of imitation learning. Essentially, it uses supervised learning to train a policy such that, given an observation , it would output an action distribution that is approximately the same as the action distribution of the experts. [12]
BC is susceptible to distribution shift. Specifically, if the trained policy differs from the expert policy, it might find itself straying from expert trajectory into observations that would have never occurred in expert trajectories. [12]
This was already noted by ALVINN, where they trained a neural network to drive a van using human demonstrations. They noticed that because a human driver never strays far from the path, the network would never be trained on what action to take if it ever finds itself straying far from the path. [5]
Dagger (Dataset Aggregation) [13] improves on behavior cloning by iteratively training on a dataset of expert demonstrations. In each iteration, the algorithm first collects data by rolling out the learned policy . Then, it queries the expert for the optimal action on each observation encountered during the rollout. Finally, it aggregates the new data into the datasetand trains a new policy on the aggregated dataset. [12]
The Decision Transformer approach models reinforcement learning as a sequence modelling problem. [14] Similar to Behavior Cloning, it trains a sequence model, such as a Transformer, that models rollout sequences where is the sum of future reward in the rollout. During training time, the sequence model is trained to predict each action , given the previous rollout as context:During inference time, to use the sequence model as an effective controller, it is simply given a very high reward prediction , and it would generalize by predicting an action that would result in the high reward. This was shown to scale predictably to a Transformer with 1 billion parameters that is superhuman on 41 Atari games. [15]
Inverse Reinforcement Learning (IRL) learns a reward function that explains the expert's behavior and then uses reinforcement learning to find a policy that maximizes this reward. [18]
Generative Adversarial Imitation Learning (GAIL) uses generative adversarial networks (GANs) to match the distribution of agent behavior to the distribution of expert demonstrations. [19] It extends a previous approach using game theory. [20] [16]
In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.
In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.
In artificial intelligence, apprenticeship learning is the process of learning by observing an expert. It can be viewed as a form of supervised learning, where the training dataset consists of task executions by a demonstration teacher.
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
In the field of artificial intelligence (AI), AI alignment aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.
Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning.
Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.
Federated learning is a machine learning technique focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.
Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves".
The exploration-exploitation dilemma, also known as the explore-exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system, while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.
In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.
In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.
Inverse Planning refers to the process of inferring an agent's mental states, such as goals, beliefs, emotions, etc., from actions by assuming agents are rational planners. It is a method commonly used in computational cognitive science and artificial intelligence for modeling agents' Theory of mind.