Apprenticeship learning

Last updated

In artificial intelligence, apprenticeship learning (or learning from demonstration or imitation learning) is the process of learning by observing an expert. [1] [2] It can be viewed as a form of supervised learning, where the training dataset consists of task executions by a demonstration teacher. [2]

Contents

Mapping function approach

Mapping methods try to mimic the expert by forming a direct mapping either from states to actions, [2] or from states to reward values. [1] For example, in 2002 researchers used such an approach to teach an AIBO robot basic soccer skills. [2]

Inverse reinforcement learning approach

Inverse reinforcement learning (IRL) is the process of deriving a reward function from observed behavior. While ordinary "reinforcement learning" involves using rewards and punishments to learn behavior, in IRL the direction is reversed, and a robot observes a person's behavior to figure out what goal that behavior seems to be trying to achieve. [3] The IRL problem can be defined as: [4]

Given 1) measurements of an agent's behaviour over time, in a variety of circumstances; 2) measurements of the sensory inputs to that agent; 3) a model of the physical environment (including the agent's body): Determine the reward function that the agent is optimizing.

IRL researcher Stuart J. Russell proposes that IRL might be used to observe humans and attempt to codify their complex "ethical values", in an effort to create "ethical robots" that might someday know "not to cook your cat" without needing to be explicitly told. [5] The scenario can be modeled as a "cooperative inverse reinforcement learning game", where a "person" player and a "robot" player cooperate to secure the person's implicit goals, despite these goals not being explicitly known by either the person nor the robot. [6] [7]

In 2017, OpenAI and DeepMind applied deep learning to the cooperative inverse reinforcement learning in simple domains such as Atari games and straightforward robot tasks such as backflips. The human role was limited to answering queries from the robot as to which of two different actions were preferred. The researchers found evidence that the techniques may be economically scalable to modern systems. [8] [9]

Apprenticeship via inverse reinforcement learning (AIRP) was developed by in 2004 Pieter Abbeel, Professor in Berkeley's EE CS department, and Andrew Ng, Associate Professor in Stanford University's Computer Science Department. AIRP deals with "Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform". [1] AIRP has been used to model reward functions of highly dynamic scenarios where there is no obvious reward function intuitively. Take the task of driving for example, there are many different objectives working simultaneously - such as maintaining safe following distance, a good speed, not changing lanes too often, etc. This task, may seem easy at first glance, but a trivial reward function may not converge to the policy wanted.

One domain where AIRP has been used extensively is helicopter control. While simple trajectories can be intuitively derived, complicated tasks like aerobatics for shows has been successful. These include aerobatic maneuvers like - in-place flips, in-place rolls, loops, hurricanes and even auto-rotation landings. This work was developed by Pieter Abbeel, Adam Coates, and Andrew Ng - "Autonomous Helicopter Aerobatics through Apprenticeship Learning" [10]

System model approach

System models try to mimic the expert by modeling world dynamics. [2]

Plan approach

The system learns rules to associate preconditions and postconditions with each action. In one 1994 demonstration, a humanoid learns a generalized plan from only two demonstrations of a repetitive ball collection task. [2]

Example

Learning from demonstration is often explained from a perspective that the working Robot-control-system is available and the human-demonstrator is using it. And indeed, if the software works, the Human operator takes the robot-arm, makes a move with it, and the robot will reproduce the action later. For example, he teaches the robot-arm how to put a cup under a coffeemaker and press the start-button. In the replay phase, the robot is imitating this behavior 1:1. But that is not how the system works internally; it is only what the audience can observe. In reality, Learning from demonstration is much more complex. One of the first works on learning by robot apprentices (anthropomorphic robots learning by imitation) was Adrian Stoica's PhD thesis in 1995. [11]

In 1997, robotics expert Stefan Schaal was working on the Sarcos robot-arm. The goal was simple: solve the pendulum swingup task. The robot itself can execute a movement, and as a result, the pendulum is moving. The problem is, that it is unclear what actions will result into which movement. It is an Optimal control-problem which can be described with mathematical formulas but is hard to solve. The idea from Schaal was, not to use a Brute-force solver but record the movements of a human-demonstration. The angle of the pendulum is logged over three seconds at the y-axis. This results into a diagram which produces a pattern. [12]

Trajectory over time
time (seconds)angle (radians)
0-3.0
0.5-2.8
1.0-4.5
1.5-1.0

In computer animation, the principle is called spline animation. [13] That means, on the x-axis the time is given, for example 0.5 seconds, 1.0 seconds, 1.5 seconds, while on the y-axis is the variable given. In most cases it's the position of an object. In the inverted pendulum it is the angle.

The overall task consists of two parts: recording the angle over time and reproducing the recorded motion. The reproducing step is surprisingly simple. As an input we know, in which time step which angle the pendulum must have. Bringing the system to a state is called “Tracking control” or PID control. That means, we have a trajectory over time, and must find control actions to map the system to this trajectory. Other authors call the principle “steering behavior”, [14] because the aim is to bring a robot to a given line.

See also

Related Research Articles

<span class="mw-page-title-main">Reinforcement learning</span> Field of machine learning

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

In computer science, a software agent or software AI is a computer program that acts for a user or other program in a relationship of agency, which derives from the Latin agere : an agreement to act on one's behalf. Such "action on behalf of" implies the authority to decide which, if any, action is appropriate. Agents are colloquially known as bots, from robot. They may be embodied, as when execution is paired with a robot body, or as software such as a chatbot executing on a phone or other computing device. Software agents may be autonomous or work together with other agents or people. Software agents interacting with people may possess human-like qualities such as natural language understanding and speech, personality or embody humanoid form.

Soar is a cognitive architecture, originally created by John Laird, Allen Newell, and Paul Rosenbloom at Carnegie Mellon University. It is now maintained and developed by John Laird's research group at the University of Michigan.

<span class="mw-page-title-main">Q-learning</span> Model-free reinforcement learning algorithm

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

Perceptual control theory (PCT) is a model of behavior based on the properties of negative feedback control loops. A control loop maintains a sensed variable at or near a reference value by means of the effects of its outputs upon that variable, as mediated by physical properties of the environment. In engineering control theory, reference values are set by a user outside the system. An example is a thermostat. In a living organism, reference values for controlled perceptual variables are endogenously maintained. Biological homeostasis and reflexes are simple, low-level examples. The discovery of mathematical principles of control introduced a way to model a negative feedback loop closed through the environment, which spawned perceptual control theory. It differs fundamentally from some models in behavioral and cognitive psychology that model stimuli as causes of behavior. PCT research is published in experimental psychology, neuroscience, ethology, anthropology, linguistics, sociology, robotics, developmental psychology, organizational psychology and management, and a number of other fields. PCT has been applied to design and administration of educational systems, and has led to a psychotherapy called the method of levels.

<span class="mw-page-title-main">Intelligent agent</span> Software agent which acts autonomously

In artificial intelligence, an intelligent agent (IA) is an agent acting in an intelligent manner; It perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or acquiring knowledge. An intelligent agent may be simple or complex: A thermostat or other control system is considered an example of an intelligent agent, as is a human being, as is any system that meets the definition, such as a firm, a state, or a biome.

Cognitive Robotics or Cognitive Technology is a subfield of robotics concerned with endowing a robot with intelligent behavior by providing it with a processing architecture that will allow it to learn and reason about how to behave in response to complex goals in a complex world. Cognitive robotics may be considered the engineering branch of embodied cognitive science and embodied embedded cognition, consisting of Robotic Process Automation, Artificial Intelligence, Machine Learning, Deep Learning, Optical Character Recognition, Image Processing, Process Mining, Analytics, Software Development and System Integration.

Robot learning is a research field at the intersection of machine learning and robotics. It studies techniques allowing a robot to acquire novel skills or adapt to its environment through learning algorithms. The embodiment of the robot, situated in a physical embedding, provides at the same time specific difficulties and opportunities for guiding the learning process.

<span class="mw-page-title-main">Meta-learning (computer science)</span> Subfield of machine learning

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

In computer science, programming by demonstration (PbD) is an end-user development technique for teaching a computer or a robot new behaviors by demonstrating the task to transfer directly instead of programming it through machine commands.

The following outline is provided as an overview of and topical guide to robotics:

Imitative learning is a type of social learning whereby new behaviors are acquired via imitation. Imitation aids in communication, social interaction, and the ability to modulate one's emotions to account for the emotions of others, and is "essential for healthy sensorimotor development and social functioning". The ability to match one's actions to those observed in others occurs in humans and animals; imitative learning plays an important role in humans in cultural development. Imitative learning is different from observational learning in that it requires a duplication of the behaviour exhibited by the model, whereas observational learning can occur when the learner observes an unwanted behaviour and its subsequent consequences and as a result learns to avoid that behaviour.

<span class="mw-page-title-main">AI alignment</span> Conformance to the intended objective

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans' intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system is competent at advancing some objectives, but not the intended ones.

<span class="mw-page-title-main">Pieter Abbeel</span> Machine learning researcher at Berkeley

Pieter Abbeel is a professor of electrical engineering and computer sciences, Director of the Berkeley Robot Learning Lab, and co-director of the Berkeley AI Research (BAIR) Lab at the University of California, Berkeley. He is also the co-founder of covariant.ai, a venture-funded start-up that aims to teach robots new, complex skills, and co-founder of Gradescope, an online grading system that has been implemented in over 500 universities nationwide. He is best known for his cutting-edge research in robotics and machine learning, particularly in deep reinforcement learning. In 2021, he joined AIX Ventures as an Investment Partner. AIX Ventures is a venture capital fund that invests in artificial intelligence startups.

<span class="mw-page-title-main">Deep reinforcement learning</span> Machine learning that combines deep learning and reinforcement learning

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

<span class="mw-page-title-main">Multi-agent reinforcement learning</span> Sub-field of reinforcement learning

Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.

<span class="mw-page-title-main">Center for Human-Compatible Artificial Intelligence</span> AI safety research center

The Center for Human-Compatible Artificial Intelligence (CHAI) is a research center at the University of California, Berkeley focusing on advanced artificial intelligence (AI) safety methods. The center was founded in 2016 by a group of academics led by Berkeley computer science professor and AI expert Stuart J. Russell. Russell is known for co-authoring the widely used AI textbook Artificial Intelligence: A Modern Approach.

<span class="mw-page-title-main">Aude Billard</span> Swiss physicist

Aude G. Billard is a Swiss physicist in the fields of machine learning and human-robot interactions. As a full professor at the School of Engineering at Swiss Federal Institute of Technology in Lausanne (EPFL), Billard’s research focuses on applying machine learning to support robot learning through human guidance. Billard’s work on human-robot interactions has been recognized numerous times by the Institute of Electrical and Electronics Engineers (IEEE) and she currently holds a leadership position on the executive committee of the IEEE Robotics and Automation Society (RAS) as the vice president of publication activities.

Intrinsic motivation in the study of artificial intelligence and robotics is a mechanism for enabling artificial agents to exhibit inherently rewarding behaviours such as exploration and curiosity, grouped under the same term in the study of psychology. Psychologists consider intrinsic motivation in humans to be the drive to perform an activity for inherent satisfaction – just for the fun or challenge of it.

Artificial intelligence agents sometimes misbehave due to faulty objective functions that fail to adequately encapsulate the programmers' intended goals. The misaligned objective function may look correct to the programmer, and may even perform well in a limited test environment, yet may still produce unanticipated and undesired results when deployed.

References

  1. 1 2 3 "Apprenticeship learning via inverse reinforcement learning". Pieter Abbeel, Andrew Ng, In 21st International Conference on Machine Learning (ICML). 2004.
  2. 1 2 3 4 5 6 Argall, Brenna D.; Chernova, Sonia; Veloso, Manuela; Browning, Brett (May 2009). "A survey of robot learning from demonstration". Robotics and Autonomous Systems. 57 (5): 469–483. CiteSeerX   10.1.1.145.345 . doi:10.1016/j.robot.2008.10.024. S2CID   1045325.
  3. Wolchover, Natalie. "This Artificial Intelligence Pioneer Has a Few Concerns". WIRED. Retrieved 22 January 2018.
  4. Russell, Stuart (1998). "Learning agents for uncertain environments". Proceedings of the eleventh annual conference on Computational learning theory. pp. 101–103. doi:10.1145/279943.279964. S2CID   546942.
  5. Havens, John C. (23 June 2015). "The ethics of AI: how to stop your robot cooking your cat". the Guardian. Retrieved 22 January 2018.
  6. "Artificial Intelligence And The King Midas Problem". Huffington Post. 12 December 2016. Retrieved 22 January 2018.
  7. Hadfield-Menell, D., Russell, S. J., Abbeel, Pieter & Dragan, A. (2016). Cooperative inverse reinforcement learning. In Advances in neural information processing systems (pp. 3909-3917).
  8. "Two Giants of AI Team Up to Head Off the Robot Apocalypse". WIRED. 7 July 2017. Retrieved 29 January 2018.
  9. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (pp. 4302-4310).
  10. Pieter Abbeel, Adam Coates, Andrew Ng, “Autonomous Helicopter Aerobatics through Apprenticeship Learning.” In Vol. 29, Issue 13 International Journal of Robotics Research. 2010.
  11. Stoica, Adrian (1995). Motion learning by robot apprentices : a fuzzy neural approach (phd thesis). Victoria University of Technology. https://vuir.vu.edu.au/15323/
  12. Atkeson, Christopher G., and Stefan Schaal (1997). "Learning tasks from a single demonstration". Proceedings of International Conference on Robotics and Automation (PDF). Vol. 2. IEEE. pp. 1706–1712. CiteSeerX   10.1.1.385.3520 . doi:10.1109/robot.1997.614389. ISBN   978-0-7803-3612-4. S2CID   1945873.{{cite book}}: CS1 maint: multiple names: authors list (link)
  13. Baris Akgun and Maya Cakmak and Karl Jiang and Andrea L. Thomaz (2012). "Keyframe-based Learning from Demonstration" (PDF). International Journal of Social Robotics. 4 (4): 343–355. doi:10.1007/s12369-012-0160-0. S2CID   10004846.
  14. Reynolds, Craig W. (1999). Steering behaviors for autonomous characters. Game developers conference. pp. 763–782.