Model-free (reinforcement learning)

Last updated December 08, 2023

In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an algorithm which does not estimate the transition probability distribution (and the reward function) associated with the Markov decision process (MDP),^[1] which, in RL, represents the problem to be solved. The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP), hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm.^[1] Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

In model-free reinforcement learning, Monte Carlo (MC) estimation is a central component of a large class of model-free algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps, i.e., policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to output a better policy. The MC estimation is mainly applied to the first step, i.e., policy evaluation. The simplest idea, i.e., averaging the returns of all collected samples, is used to judge the effectiveness of the current policy. As more experience is accumulated, the estimate will converge to the true value by the law of large numbers. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, all it needs is experience, i.e., samples of state, action, and reward, which are generated from interacting with a real environment.^[2]

The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal difference. TD has the ability to learn from an incomplete sequence of events without waiting for the final outcome. TD has the ability to approximate the future return as a function of the current state. Similar to MC, TD only uses experience to estimate the value function without knowing any prior knowledge of the environment dynamics. The advantage of TD lies in the fact that it can update the value function based on its current estimate. Therefore, TD learning algorithms can learn from incomplete episodes or continuing tasks in a step-by-step manner, while MC must be implemented in an episode-by-episode fashion.^[2]

Model-Free reinforcement learning algorithms

Model-free reinforcement learning algorithms can start from a blank policy candidate and achieve superhuman performance in many complex tasks, including Atari games, StarCraft and Chinese Go. Deep neural networks are responsible for recent artificial intelligence breakthroughs, and they can be combined with reinforcement learning to create something astounding, such as DeepMind’s AlphaGo. Mainstream model-free RL algorithms include Deep Q-Network (DQN), Dueling DQN, Double DQN (DDQN), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC), Distributional Soft Actor-Critic (DSAC), etc.^[2] Some model-free algorithms are listed as follows, especially those with deep learning.

Algorithm	Description	Model	Policy	Action Space	State Space	Operator
DQN	Deep Q Network	Model-Free	Off-policy	Discrete	Typically Discrete or Continuous	Q-value
DDPG	Deep Deterministic Policy Gradient	Model-Free	Off-policy	Continuous	Discrete or Continuous	Q-value
A3C	Asynchronous Advantage Actor-Critic Algorithm	Model-Free	On-policy	Continuous	Discrete or Continuous	Advantage
TRPO	Trust Region Policy Optimization	Model-Free	On-policy	Continuous or Discrete	Discrete or Continuous	Advantage
PPO	Proximal Policy Optimization	Model-Free	On-policy	Continuous or Discrete	Discrete or Continuous	Advantage
TD3	Twin Delayed Deep Deterministic Policy Gradient	Model-Free	Off-policy	Continuous	Continuous	Q-value
SAC	Soft Actor-Critic	Model-Free	Off-policy	Continuous	Discrete or Continuous	Advantage
DSAC ^[3]	Distributional Soft Actor-Critic	Model-free	Off-policy	Continuous	Continuous	Value distribution

Related Research Articles

The following outline is provided as an overview of and topical guide to statistics:

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution. Various algorithms exist for constructing chains, including the Metropolis–Hastings algorithm.

Maven is an artificial intelligence Scrabble player, created by Brian Sheppard. It has been used in official licensed Hasbro Scrabble games.

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines, who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations to the actions.

A learning automaton is one type of machine learning algorithm studied since 1970s. Learning automata select their current action based on past experiences from the environment. It will fall into the range of reinforcement learning if the environment is stochastic and a Markov decision process (MDP) is used.

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are derived via training.

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.

The following outline is provided as an overview of and topical guide to machine learning:

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.

MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules. Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero. It matched AlphaZero's performance in chess and shogi, improved on its performance in Go, and improved on the state of the art in mastering a suite of 57 Atari games, a visually-complex domain.

Artificial intelligence agents sometimes misbehave due to faulty objective functions that fail to adequately encapsulate the programmers' intended goals. The misaligned objective function may look correct to the programmer, and may even perform well in a limited test environment, yet may still produce unanticipated and undesired results when deployed.

Proximal Policy Optimization (PPO) is an algorithm in the field of reinforcement learning that trains a computer agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017, and has become the default reinforcement learning algorithm at American artificial intelligence company OpenAI. PPO has recently enjoyed a wide variety of successes, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games. Many experts called PPO the state of the art because it seems to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.

References

1 2 Sutton, Richard S.; Barto, Andrew G. (November 13, 2018). Reinforcement Learning: An Introduction (PDF) (Second ed.). A Bradford Book. p. 552. ISBN 0262039249 . Retrieved 18 February 2019.
1 2 3 Li, Shengbo Eben (2023). Reinforcement Learning for Sequential Decision and Optimal Control (First ed.). Springer Verlag, Singapore. pp. 1–460. doi:10.1007/978-981-19-7784-8. ISBN 978-9-811-97783-1. S2CID 257928563.{{cite book}}: CS1 maint: location missing publisher (link)
↑ J Duan; Y Guan; S Li (2021). "Distributional Soft Actor-Critic: Off-policy reinforcement learning for addressing value estimation errors". IEEE Transactions on Neural Networks and Learning Systems. 33 (11): 6584–6598. arXiv: 2001.02811 . doi:10.1109/TNNLS.2021.3082568. PMID 34101599. S2CID 211259373.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[sutton2018-1] 1 2 Sutton, Richard S.; Barto, Andrew G. (November 13, 2018). Reinforcement Learning: An Introduction (PDF) (Second ed.). A Bradford Book. p. 552. ISBN 0262039249 . Retrieved 18 February 2019.

[LiS-2023-2] 1 2 3 Li, Shengbo Eben (2023). Reinforcement Learning for Sequential Decision and Optimal Control (First ed.). Springer Verlag, Singapore. pp. 1–460. doi:10.1007/978-981-19-7784-8. ISBN 978-9-811-97783-1. S2CID 257928563.{{cite book}}: CS1 maint: location missing publisher (link)

[3] J Duan; Y Guan; S Li (2021). "Distributional Soft Actor-Critic: Off-policy reinforcement learning for addressing value estimation errors". IEEE Transactions on Neural Networks and Learning Systems. 33 (11): 6584–6598. arXiv: 2001.02811 . doi:10.1109/TNNLS.2021.3082568. PMID 34101599. S2CID 211259373.

[1]

[2]

[3]