Imitation learning

Last updated

Imitation learning is a paradigm in reinforcement learning, where an agent learns to perform a task by supervised learning from expert demonstrations. It is also called learning from demonstration and apprenticeship learning. [1] [2] [3]

Contents

It has been applied to underactuated robotics, [4] self-driving cars, [5] [6] [7] quadcopter navigation, [8] helicopter aerobatics, [9] and locomotion. [10] [11]

Approaches

Expert demonstrations are recordings of an expert performing the desired task, often collected as state-action pairs .

Behavior Cloning

Behavior Cloning (BC) is the most basic form of imitation learning. Essentially, it uses supervised learning to train a policy such that, given an observation , it would output an action distribution that is approximately the same as the action distribution of the experts. [12]

BC is susceptible to distribution shift. Specifically, if the trained policy differs from the expert policy, it might find itself straying from expert trajectory into observations that would have never occurred in expert trajectories. [12]

This was already noted by ALVINN, where they trained a neural network to drive a van using human demonstrations. They noticed that because a human driver never strays far from the path, the network would never be trained on what action to take if it ever finds itself straying far from the path. [5]

DAgger

Dagger (Dataset Aggregation) [13] improves on behavior cloning by iteratively training on a dataset of expert demonstrations. In each iteration, the algorithm first collects data by rolling out the learned policy . Then, it queries the expert for the optimal action on each observation encountered during the rollout. Finally, it aggregates the new data into the datasetand trains a new policy on the aggregated dataset. [12]

Decision transformer

Architecture diagram of the decision transformer. Decision Transformer architecture.png
Architecture diagram of the decision transformer.

The Decision Transformer approach models reinforcement learning as a sequence modelling problem. [14] Similar to Behavior Cloning, it trains a sequence model, such as a Transformer, that models rollout sequences where is the sum of future reward in the rollout. During training time, the sequence model is trained to predict each action , given the previous rollout as context:During inference time, to use the sequence model as an effective controller, it is simply given a very high reward prediction , and it would generalize by predicting an action that would result in the high reward. This was shown to scale predictably to a Transformer with 1 billion parameters that is superhuman on 41 Atari games. [15]

Other approaches

See [16] [17] for more examples.

Inverse Reinforcement Learning (IRL) learns a reward function that explains the expert's behavior and then uses reinforcement learning to find a policy that maximizes this reward. [18]

Generative Adversarial Imitation Learning (GAIL) uses generative adversarial networks (GANs) to match the distribution of agent behavior to the distribution of expert demonstrations. [19] It extends a previous approach using game theory. [20] [16]

See also

Further reading

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

<span class="mw-page-title-main">Long short-term memory</span> Type of recurrent neural network architecture

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

In artificial intelligence, apprenticeship learning is the process of learning by observing an expert. It can be viewed as a form of supervised learning, where the training dataset consists of task executions by a demonstration teacher.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

In the field of artificial intelligence (AI), AI alignment aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

<span class="mw-page-title-main">Federated learning</span> Decentralized machine learning

Federated learning is a machine learning technique focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves".

The exploration-exploitation dilemma, also known as the explore-exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system, while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.

<span class="mw-page-title-main">Reinforcement learning from human feedback</span> Machine learning technique

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

<span class="mw-page-title-main">Neural scaling law</span> Law in machine learning

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

Inverse Planning refers to the process of inferring an agent's mental states, such as goals, beliefs, emotions, etc., from actions by assuming agents are rational planners. It is a method commonly used in computational cognitive science and artificial intelligence for modeling agents' Theory of mind.

References

  1. Russell, Stuart J.; Norvig, Peter (2021). "22.6 Apprenticeship and Inverse Reinforcement Learning". Artificial intelligence: a modern approach. Pearson series in artificial intelligence (Fourth ed.). Hoboken: Pearson. ISBN   978-0-13-461099-3.
  2. Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (Second ed.). Cambridge, Massachusetts: The MIT Press. p. 470. ISBN   978-0-262-03924-6.
  3. Hussein, Ahmed; Gaber, Mohamed Medhat; Elyan, Eyad; Jayne, Chrisina (2017-04-06). "Imitation Learning: A Survey of Learning Methods". ACM Comput. Surv. 50 (2): 21:1–21:35. doi:10.1145/3054912. hdl: 10059/2298 . ISSN   0360-0300.
  4. "Ch. 21 - Imitation Learning". underactuated.mit.edu. Retrieved 2024-08-08.
  5. 1 2 Pomerleau, Dean A. (1988). "ALVINN: An Autonomous Land Vehicle in a Neural Network". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.
  6. Bojarski, Mariusz; Del Testa, Davide; Dworakowski, Daniel; Firner, Bernhard; Flepp, Beat; Goyal, Prasoon; Jackel, Lawrence D.; Monfort, Mathew; Muller, Urs (2016-04-25). "End to End Learning for Self-Driving Cars". arXiv: 1604.07316v1 [cs.CV].
  7. Kiran, B Ravi; Sobh, Ibrahim; Talpaert, Victor; Mannion, Patrick; Sallab, Ahmad A. Al; Yogamani, Senthil; Perez, Patrick (June 2022). "Deep Reinforcement Learning for Autonomous Driving: A Survey". IEEE Transactions on Intelligent Transportation Systems. 23 (6): 4909–4926. arXiv: 2002.00444 . doi:10.1109/TITS.2021.3054625. ISSN   1524-9050.
  8. Giusti, Alessandro; Guzzi, Jerome; Ciresan, Dan C.; He, Fang-Lin; Rodriguez, Juan P.; Fontana, Flavio; Faessler, Matthias; Forster, Christian; Schmidhuber, Jurgen; Caro, Gianni Di; Scaramuzza, Davide; Gambardella, Luca M. (July 2016). "A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots". IEEE Robotics and Automation Letters. 1 (2): 661–667. doi:10.1109/LRA.2015.2509024. ISSN   2377-3766.
  9. "Autonomous Helicopter: Stanford University AI Lab". heli.stanford.edu. Retrieved 2024-08-08.
  10. Nakanishi, Jun; Morimoto, Jun; Endo, Gen; Cheng, Gordon; Schaal, Stefan; Kawato, Mitsuo (June 2004). "Learning from demonstration and adaptation of biped locomotion". Robotics and Autonomous Systems. 47 (2–3): 79–91. doi:10.1016/j.robot.2004.03.003.
  11. Kalakrishnan, Mrinal; Buchli, Jonas; Pastor, Peter; Schaal, Stefan (October 2009). "Learning locomotion over rough terrain using terrain templates". 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. pp. 167–172. doi:10.1109/iros.2009.5354701. ISBN   978-1-4244-3803-7.
  12. 1 2 3 CS 285 at UC Berkeley: Deep Reinforcement Learning. Lecture 2: Supervised Learning of Behaviors
  13. Ross, Stephane; Gordon, Geoffrey; Bagnell, Drew (2011-06-14). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 627–635.
  14. Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Misha; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 15084–15097. arXiv: 2106.01345 .
  15. Lee, Kuang-Huei; Nachum, Ofir; Yang, Mengjiao; Lee, Lisa; Freeman, Daniel; Xu, Winnie; Guadarrama, Sergio; Fischer, Ian; Jang, Eric (2022-10-15), Multi-Game Decision Transformers, arXiv: 2205.15241 , retrieved 2024-10-22
  16. 1 2 Hester, Todd; Vecerik, Matej; Pietquin, Olivier; Lanctot, Marc; Schaul, Tom; Piot, Bilal; Horgan, Dan; Quan, John; Sendonaris, Andrew (2017-04-12). "Deep Q-learning from Demonstrations". arXiv: 1704.03732v4 [cs.AI].
  17. Duan, Yan; Andrychowicz, Marcin; Stadie, Bradly; Jonathan Ho, OpenAI; Schneider, Jonas; Sutskever, Ilya; Abbeel, Pieter; Zaremba, Wojciech (2017). "One-Shot Imitation Learning". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  18. A, Ng (2000). "Algorithms for Inverse Reinforcement Learning". Proc. Of 17th International Conference on Machine Learning, 2000: 663–670.
  19. Ho, Jonathan; Ermon, Stefano (2016). "Generative Adversarial Imitation Learning". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv: 1606.03476 .
  20. Syed, Umar; Schapire, Robert E (2007). "A Game-Theoretic Approach to Apprenticeship Learning". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.