Mountain car problem

Last updated

The mountain car problem Diagram of the mountain car problem.png
The mountain car problem

Mountain Car, a standard testing domain in Reinforcement learning, is a problem in which an under-powered car must drive up a steep hill. Since gravity is stronger than the car's engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car is situated in a valley and must learn to leverage potential energy by driving up the opposite hill before the car is able to make it to the goal at the top of the rightmost hill. The domain has been used as a test bed in various Reinforcement Learning papers.

Contents

Introduction

The mountain car problem, although fairly simple, is commonly applied because it requires a reinforcement learning agent to learn on two continuous variables: position and velocity. For any given state (position and velocity) of the car, the agent is given the possibility of driving left, driving right, or not using the engine at all. In the standard version of the problem, the agent receives a negative reward at every time step when the goal is not reached; the agent has no information about the goal until an initial success.

History

The mountain car problem appeared first in Andrew Moore's PhD Thesis (1990). [1] It was later more strictly defined in Singh and Sutton's Reinforcement Leaning paper with eligibility traces. [2] The problem became more widely studied when Sutton and Barto added it to their book Reinforcement Learning: An Introduction (1998). [3] Throughout the years many versions of the problem have been used, such as those which modify the reward function, termination condition, and/or the start state.

Techniques used to solve mountain car

Q-learning and similar techniques for mapping discrete states to discrete actions need to be extended to be able to deal with the continuous state space of the problem. Approaches often fall into one of two categories, state space discretization or function approximation.

Discretization

In this approach, two continuous state variables are pushed into discrete states by bucketing each continuous variable into multiple discrete states. This approach works with properly tuned parameters but a disadvantage is information gathered from one state is not used to evaluate another state. Tile coding can be used to improve discretization and involves continuous variables mapping into sets of buckets offset from one another. Each step of training has a wider impact on the value function approximation because when the offset grids are summed, the information is diffused. [4]

Function approximation

Function approximation is another way to solve the mountain car. By choosing a set of basis functions beforehand, or by generating them as the car drives, the agent can approximate the value function at each state. Unlike the step-wise version of the value function created with discretization, function approximation can more cleanly estimate the true smooth function of the mountain car domain. [5]

Eligibility Traces

An interesting aspect of the problem involves the delay of actual reward. The agent isn't able to learn about the goal until a successful completion. Given a naive approach for each trial the car can only backup the reward of the goal slightly. This is a problem for naive discretization because each discrete state will only be backed up once, taking a larger number of episodes to learn the problem. This problem can be alleviated via the mechanism of eligibility traces, which will automatically backup the reward given to states before, dramatically increasing the speed of learning. Eligibility traces can be viewed as a bridge from temporal difference learning methods to Monte Carlo methods. [6]

Technical details

The mountain car problem has undergone many iterations. This section will focus on the standard well defined version from Sutton (2008). [7]

State variables

Two-dimensional continuous state space.

Actions

One-dimensional discrete action space.

Reward

For every time step:

Update function

For every time step:

Starting condition

Optionally, many implementations include randomness in both parameters to show better generalized learning.

Termination condition

End the simulation when:

Variations

There are many versions of the mountain car which deviate in different ways from the standard model. Variables that vary include but are not limited to changing the constants (gravity and steepness) of the problem so specific tuning for specific policies become irrelevant and altering the reward function to affect the agent's ability to learn in a different manner. An example is changing the reward to be equal to the distance from the goal, or changing the reward to zero everywhere and one at the goal. Additionally we can use a 3D mountain car with a 4D continuous state space. [8]

Related Research Articles

A mathematical model is an abstract description of a concrete system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in applied mathematics and in the natural sciences and engineering disciplines, as well as in non-physical systems such as the social sciences (such as economics, psychology, sociology, political science). It can also be taught as a subject in its own right.

<span class="mw-page-title-main">Reinforcement learning</span> Field of machine learning

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

<span class="mw-page-title-main">Assignment problem</span> Combinatorial optimization problem

The assignment problem is a fundamental combinatorial optimization problem. In its most general form, the problem is as follows:

<span class="mw-page-title-main">Discretization</span> Process of transferring continuous functions into discrete counterparts

In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers. Dichotomization is the special case of discretization in which the number of discrete classes is 2, which can approximate a continuous variable as a binary variable.

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

In mathematics, engineering, computer science and economics, an optimization problem is the problem of finding the best solution from all feasible solutions.

<span class="mw-page-title-main">Temporal difference learning</span> Computer programming concept

Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.

<span class="mw-page-title-main">Q-learning</span> Model-free reinforcement learning algorithm

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

<span class="mw-page-title-main">Multi-armed bandit</span> Machine Learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines, who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations to the actions.

<span class="mw-page-title-main">Softmax function</span> Smooth approximation of one-hot arg max

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

<span class="mw-page-title-main">State–action–reward–state–action</span>

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

<span class="mw-page-title-main">Finite element method</span> Numerical method for solving physical or engineering problems

The finite element method (FEM) is a popular method for numerically solving differential equations arising in engineering and mathematical modeling. Typical problem areas of interest include the traditional fields of structural analysis, heat transfer, fluid flow, mass transport, and electromagnetic potential.

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010.

In applied mathematics, proto-value functions (PVFs) are automatically learned basis functions that are useful in approximating task-specific value functions, providing a compact representation of the powers of transition matrices. They provide a novel framework for solving the credit assignment problem. The framework introduces a novel approach to solving Markov decision processes (MDP) and reinforcement learning problems, using multiscale spectral and manifold learning methods. Proto-value functions are generated by spectral analysis of a graph, using the graph Laplacian.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

<span class="mw-page-title-main">Deep reinforcement learning</span> Machine learning that combines deep learning and reinforcement learning

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Intrinsic motivation in the study of artificial intelligence and robotics is a mechanism for enabling artificial agents to exhibit inherently rewarding behaviours such as exploration and curiosity, grouped under the same term in the study of psychology. Psychologists consider intrinsic motivation in humans to be the drive to perform an activity for inherent satisfaction – just for the fun or challenge of it.

Empowerment in the field of artificial intelligence formalises and quantifies the potential an agent perceives that it has to influence its environment. An agent which follows an empowerment maximising policy, acts to maximise future options. Empowerment can be used as a (pseudo) utility function that depends only on information gathered from the local environment to guide action, rather than seeking an externally imposed goal, thus is a form of intrinsic motivation.

References

  1. [Moore, 1990] A. Moore, Efficient Memory-Based Learning for Robot Control, PhD thesis, University of Cambridge, November 1990.
  2. [Singh and Sutton, 1996] Singh, S.P. and Sutton, R.S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22(1/2/3):123-158.
  3. [Sutton and Barto, 1998] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto. A Bradford Book. The MIT Press Cambridge, Massachusetts London, England, 1998
  4. "8.3.2 Tile Coding". Archived from the original on 28 April 2012. Retrieved 14 December 2011.
  5. "8.4 Control with Function Approximation". Archived from the original on 30 April 2012. Retrieved 14 December 2011.
  6. Sutton, Richard S.; Barto, Andrew G.; Bach, Francis (13 November 2018). "7. Eligibility Traces". Reinforcement Learning: An Introduction (Second ed.). A Bradford Book. ISBN   9780262039246.
  7. [Sutton, 2008] Mountain Car Software. Richard s. Sutton. http://www.cs.ualberta.ca/~sutton/MountainCar/MountainCar.html Archived 12 October 2009 at the Wayback Machine
  8. "Mountain Car 3D (CPP) - RL-Library". Archived from the original on 26 April 2012. Retrieved 14 December 2011.

Implementations

Further reading