Imitation learning

Last updated December 07, 2024

Imitation learning is a paradigm in reinforcement learning, where an agent learns to perform a task by supervised learning from expert demonstrations. It is also called learning from demonstration and apprenticeship learning.^[1]^[2]^[3]

Approaches

Expert demonstrations are recordings of an expert performing the desired task, often collected as state-action pairs $(o_{t}^{*},a_{t}^{*})$ .

Behavior Cloning

Behavior Cloning (BC) is the most basic form of imitation learning. Essentially, it uses supervised learning to train a policy $\pi _{\theta }$ such that, given an observation $o_{t}$ , it would output an action distribution $\pi _{\theta }(\cdot |o_{t})$ that is approximately the same as the action distribution of the experts.^[12]

BC is susceptible to distribution shift. Specifically, if the trained policy differs from the expert policy, it might find itself straying from expert trajectory into observations that would have never occurred in expert trajectories.^[12]

This was already noted by ALVINN, where they trained a neural network to drive a van using human demonstrations. They noticed that because a human driver never strays far from the path, the network would never be trained on what action to take if it ever finds itself straying far from the path.^[5]

DAgger

Dagger (Dataset Aggregation)^[13] improves on behavior cloning by iteratively training on a dataset of expert demonstrations. In each iteration, the algorithm first collects data by rolling out the learned policy $\pi _{\theta }$ . Then, it queries the expert for the optimal action $a_{t}^{*}$ on each observation $o_{t}$ encountered during the rollout. Finally, it aggregates the new data into the dataset $D\leftarrow D\cup \{(o_{1},a_{1}^{*}),(o_{2},a_{2}^{*}),...,(o_{T},a_{T}^{*})\}$ and trains a new policy on the aggregated dataset.^[12]

Decision transformer

The Decision Transformer approach models reinforcement learning as a sequence modelling problem.^[14] Similar to Behavior Cloning, it trains a sequence model, such as a Transformer, that models rollout sequences $(R_{1},o_{1},a_{1}),(R_{2},o_{2},a_{2}),\dots ,(R_{t},o_{t},a_{t}),$ where $R_{t}=r_{t}+r_{t+1}+\dots +r_{T}$ is the sum of future reward in the rollout. During training time, the sequence model is trained to predict each action $a_{t}$ , given the previous rollout as context: $(R_{1},o_{1},a_{1}),(R_{2},o_{2},a_{2}),\dots ,(R_{t},o_{t})$ During inference time, to use the sequence model as an effective controller, it is simply given a very high reward prediction $R$ , and it would generalize by predicting an action that would result in the high reward. This was shown to scale predictably to a Transformer with 1 billion parameters that is superhuman on 41 Atari games.^[15]

Other approaches

See ^[16]^[17] for more examples.

Related approaches

Inverse Reinforcement Learning (IRL) learns a reward function that explains the expert's behavior and then uses reinforcement learning to find a policy that maximizes this reward.^[18]

Generative Adversarial Imitation Learning (GAIL) uses generative adversarial networks (GANs) to match the distribution of agent behavior to the distribution of expert demonstrations.^[19] It extends a previous approach using game theory.^[20]^[16]

Related Research Articles

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Q-learning is a model-free reinforcement learning algorithm that teaches an agent to assign values to each action it might take, conditioned on the agent being in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

In artificial intelligence, apprenticeship learning is the process of learning by observing an expert. It can be viewed as a form of supervised learning, where the training dataset consists of task executions by a demonstration teacher.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

In the field of artificial intelligence (AI), AI alignment aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Federated learning is a machine learning technique focusing on settings in which multiple entities collaboratively train a model while ensuring that their data remains decentralized. This stands in contrast to machine learning settings in which data is centrally stored. One of the primary defining characteristics of federated learning is data heterogeneity. Due to the decentralized nature of the clients' data, there is no guarantee that data samples held by each client are independently and identically distributed.

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

The exploration-exploitation dilemma, also known as the explore-exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system, while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

<span class="mw-page-title-main">Attention Is All You Need</span> 2017 research paper by Google

"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.

Inverse Planning refers to the process of inferring an agent's mental states, such as goals, beliefs, emotions, etc., from actions by assuming agents are rational planners. It is a method commonly used in computational cognitive science and artificial intelligence for modeling agents' Theory of mind.

References

↑ Russell, Stuart J.; Norvig, Peter (2021). "22.6 Apprenticeship and Inverse Reinforcement Learning". Artificial intelligence: a modern approach. Pearson series in artificial intelligence (Fourth ed.). Hoboken: Pearson. ISBN 978-0-13-461099-3.
↑ Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (Second ed.). Cambridge, Massachusetts: The MIT Press. p. 470. ISBN 978-0-262-03924-6.
↑ Hussein, Ahmed; Gaber, Mohamed Medhat; Elyan, Eyad; Jayne, Chrisina (2017-04-06). "Imitation Learning: A Survey of Learning Methods". ACM Comput. Surv. 50 (2): 21:1–21:35. doi:10.1145/3054912. hdl: 10059/2298 . ISSN 0360-0300.
↑ "Ch. 21 - Imitation Learning". underactuated.mit.edu. Retrieved 2024-08-08.
1 2 Pomerleau, Dean A. (1988). "ALVINN: An Autonomous Land Vehicle in a Neural Network". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.
↑ Bojarski, Mariusz; Del Testa, Davide; Dworakowski, Daniel; Firner, Bernhard; Flepp, Beat; Goyal, Prasoon; Jackel, Lawrence D.; Monfort, Mathew; Muller, Urs (2016-04-25). "End to End Learning for Self-Driving Cars". arXiv: 1604.07316v1 [cs.CV].
↑ Kiran, B Ravi; Sobh, Ibrahim; Talpaert, Victor; Mannion, Patrick; Sallab, Ahmad A. Al; Yogamani, Senthil; Perez, Patrick (June 2022). "Deep Reinforcement Learning for Autonomous Driving: A Survey". IEEE Transactions on Intelligent Transportation Systems. 23 (6): 4909–4926. arXiv: 2002.00444 . doi:10.1109/TITS.2021.3054625. ISSN 1524-9050.
↑ Giusti, Alessandro; Guzzi, Jerome; Ciresan, Dan C.; He, Fang-Lin; Rodriguez, Juan P.; Fontana, Flavio; Faessler, Matthias; Forster, Christian; Schmidhuber, Jurgen; Caro, Gianni Di; Scaramuzza, Davide; Gambardella, Luca M. (July 2016). "A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots". IEEE Robotics and Automation Letters. 1 (2): 661–667. doi:10.1109/LRA.2015.2509024. ISSN 2377-3766.
↑ "Autonomous Helicopter: Stanford University AI Lab". heli.stanford.edu. Retrieved 2024-08-08.
↑ Nakanishi, Jun; Morimoto, Jun; Endo, Gen; Cheng, Gordon; Schaal, Stefan; Kawato, Mitsuo (June 2004). "Learning from demonstration and adaptation of biped locomotion". Robotics and Autonomous Systems. 47 (2–3): 79–91. doi:10.1016/j.robot.2004.03.003.
↑ Kalakrishnan, Mrinal; Buchli, Jonas; Pastor, Peter; Schaal, Stefan (October 2009). "Learning locomotion over rough terrain using terrain templates". 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. pp. 167–172. doi:10.1109/iros.2009.5354701. ISBN 978-1-4244-3803-7.
1 2 3 CS 285 at UC Berkeley: Deep Reinforcement Learning. Lecture 2: Supervised Learning of Behaviors
↑ Ross, Stephane; Gordon, Geoffrey; Bagnell, Drew (2011-06-14). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 627–635.
↑ Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Misha; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 15084–15097. arXiv: 2106.01345 .
↑ Lee, Kuang-Huei; Nachum, Ofir; Yang, Mengjiao; Lee, Lisa; Freeman, Daniel; Xu, Winnie; Guadarrama, Sergio; Fischer, Ian; Jang, Eric (2022-10-15), Multi-Game Decision Transformers, arXiv: 2205.15241 , retrieved 2024-10-22
1 2 Hester, Todd; Vecerik, Matej; Pietquin, Olivier; Lanctot, Marc; Schaul, Tom; Piot, Bilal; Horgan, Dan; Quan, John; Sendonaris, Andrew (2017-04-12). "Deep Q-learning from Demonstrations". arXiv: 1704.03732v4 [cs.AI].
↑ Duan, Yan; Andrychowicz, Marcin; Stadie, Bradly; Jonathan Ho, OpenAI; Schneider, Jonas; Sutskever, Ilya; Abbeel, Pieter; Zaremba, Wojciech (2017). "One-Shot Imitation Learning". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
↑ A, Ng (2000). "Algorithms for Inverse Reinforcement Learning". Proc. Of 17th International Conference on Machine Learning, 2000: 663–670.
↑ Ho, Jonathan; Ermon, Stefano (2016). "Generative Adversarial Imitation Learning". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv: 1606.03476 .
↑ Syed, Umar; Schapire, Robert E (2007). "A Game-Theoretic Approach to Apprenticeship Learning". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Russell, Stuart J.; Norvig, Peter (2021). "22.6 Apprenticeship and Inverse Reinforcement Learning". Artificial intelligence: a modern approach. Pearson series in artificial intelligence (Fourth ed.). Hoboken: Pearson. ISBN 978-0-13-461099-3.

[2] Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (Second ed.). Cambridge, Massachusetts: The MIT Press. p. 470. ISBN 978-0-262-03924-6.

[3] Hussein, Ahmed; Gaber, Mohamed Medhat; Elyan, Eyad; Jayne, Chrisina (2017-04-06). "Imitation Learning: A Survey of Learning Methods". ACM Comput. Surv. 50 (2): 21:1–21:35. doi:10.1145/3054912. hdl: 10059/2298 . ISSN 0360-0300.

[4] "Ch. 21 - Imitation Learning". underactuated.mit.edu. Retrieved 2024-08-08.

[:0-5] 1 2 Pomerleau, Dean A. (1988). "ALVINN: An Autonomous Land Vehicle in a Neural Network". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.

[6] Bojarski, Mariusz; Del Testa, Davide; Dworakowski, Daniel; Firner, Bernhard; Flepp, Beat; Goyal, Prasoon; Jackel, Lawrence D.; Monfort, Mathew; Muller, Urs (2016-04-25). "End to End Learning for Self-Driving Cars". arXiv: 1604.07316v1 [cs.CV].

[7] Kiran, B Ravi; Sobh, Ibrahim; Talpaert, Victor; Mannion, Patrick; Sallab, Ahmad A. Al; Yogamani, Senthil; Perez, Patrick (June 2022). "Deep Reinforcement Learning for Autonomous Driving: A Survey". IEEE Transactions on Intelligent Transportation Systems. 23 (6): 4909–4926. arXiv: 2002.00444 . doi:10.1109/TITS.2021.3054625. ISSN 1524-9050.

[8] Giusti, Alessandro; Guzzi, Jerome; Ciresan, Dan C.; He, Fang-Lin; Rodriguez, Juan P.; Fontana, Flavio; Faessler, Matthias; Forster, Christian; Schmidhuber, Jurgen; Caro, Gianni Di; Scaramuzza, Davide; Gambardella, Luca M. (July 2016). "A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots". IEEE Robotics and Automation Letters. 1 (2): 661–667. doi:10.1109/LRA.2015.2509024. ISSN 2377-3766.

[9] "Autonomous Helicopter: Stanford University AI Lab". heli.stanford.edu. Retrieved 2024-08-08.

[10] Nakanishi, Jun; Morimoto, Jun; Endo, Gen; Cheng, Gordon; Schaal, Stefan; Kawato, Mitsuo (June 2004). "Learning from demonstration and adaptation of biped locomotion". Robotics and Autonomous Systems. 47 (2–3): 79–91. doi:10.1016/j.robot.2004.03.003.

[11] Kalakrishnan, Mrinal; Buchli, Jonas; Pastor, Peter; Schaal, Stefan (October 2009). "Learning locomotion over rough terrain using terrain templates". 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. pp. 167–172. doi:10.1109/iros.2009.5354701. ISBN 978-1-4244-3803-7.

[:1-12] 1 2 3 CS 285 at UC Berkeley: Deep Reinforcement Learning. Lecture 2: Supervised Learning of Behaviors

[13] Ross, Stephane; Gordon, Geoffrey; Bagnell, Drew (2011-06-14). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 627–635.

[14] Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Misha; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 15084–15097. arXiv: 2106.01345 .

[15] Lee, Kuang-Huei; Nachum, Ofir; Yang, Mengjiao; Lee, Lisa; Freeman, Daniel; Xu, Winnie; Guadarrama, Sergio; Fischer, Ian; Jang, Eric (2022-10-15), Multi-Game Decision Transformers, arXiv: 2205.15241 , retrieved 2024-10-22

[:2-16] 1 2 Hester, Todd; Vecerik, Matej; Pietquin, Olivier; Lanctot, Marc; Schaul, Tom; Piot, Bilal; Horgan, Dan; Quan, John; Sendonaris, Andrew (2017-04-12). "Deep Q-learning from Demonstrations". arXiv: 1704.03732v4 [cs.AI].

[17] Duan, Yan; Andrychowicz, Marcin; Stadie, Bradly; Jonathan Ho, OpenAI; Schneider, Jonas; Sutskever, Ilya; Abbeel, Pieter; Zaremba, Wojciech (2017). "One-Shot Imitation Learning". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[18] A, Ng (2000). "Algorithms for Inverse Reinforcement Learning". Proc. Of 17th International Conference on Machine Learning, 2000: 663–670.

[19] Ho, Jonathan; Ermon, Stefano (2016). "Generative Adversarial Imitation Learning". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv: 1606.03476 .

[20] Syed, Umar; Schapire, Robert E (2007). "A Game-Theoretic Approach to Apprenticeship Learning". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]