# human-level control through deep reinforcement learning

## Contents

## Introduction

Reinforcement learning is "the study of how animals and artificial systems can learn to optimize their behaviour in the face of rewards and punishments" <ref>Dayan, Peter and Watkins, Christopher. "Reinforcement Learning." Encyclopedia of Cognitive Science.</ref>. Attempting to teach a dog a new trick is an example of this type of learning in animals, as the dog will (hopefully) learn that when it does the desired trick, it obtains a reward. The process is similar in artificial systems. For a given task, the algorithm for reinforcement learning selects an action that maximizes the expected cumulative future reward at every step in time, given the current state of the process. Then, it will carry out this action, observe the reward, and update the model to incorporate information about the reward generated by the action. The natural connection to neuroscience and animal behaviour makes reinforcement learning an attractive machine learning paradigm.

When creating artificial systems based on machine learning, however, we run into the problem of finding efficient representations from high-dimensional inputs. Furthermore, these representations need to use high-dimensional information from past experiences and generalize to new situations. The human brain is adept at this type of learning, using systems involving dopamine in the neurons with a similar structure to reinforcement learning algorithms <ref> Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) </ref>. Unfortunately, previous models have only been able replicate the success of humans on fully-observable, low-dimensional problems, such as backgammon <ref> Tesauro, Gerald. TD-Gammon, a self-teaching backgammon program, achieves master's play. AAAI Techinical Report (1993)</ref>.

In the paper that is summarized below <ref name = "main">Mnih, Volodymyr et. al. "Human-level control through deep reinforcement learning." Nature 518, 529-533 (2015) </ref>, a new structure called a deep Q-network (DQN) is proposed to handle high-dimensional inputs directly. The task that the DQN model is tested on is playing Atari 2600 games, where the only input data are the pixels displayed on the screen and the score of the game. It turns out that the DQN produces state-of-the-art results in this particular task, as we will see.

## Problem Description

The goal of this research is to create a framework that would be able to excel at a variety of challenging learning tasks - a common goal in artificial intelligence. To that end, the DQN network is proposed, which combines reinforcement learning with deep neural networks <ref name = "main"></ref>. In particular, a deep convolutional network will be used to approximate the so-called optimal action-value function

- [math]Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots | s_t=s, a_t=a, \pi\right][/math]

which is the maximum cumulative sum of rewards [math]r_t\,[/math] discounted by [math]\,\gamma[/math] at each timestep [math]t\,[/math]. Here, [math]\,\gamma \in [0,1] [/math] denotes the discout factor that wighs the importance of more distant rewards. Setting [math]\,\gamma[/math] to zero results in a single-step RL problem in which only current reward is important. Setting [math]\,\gamma[/math] to one results in a system in which cumulative reward needs to be optimized. .This sum can be achieved from a policy [math]\pi = P\left(a|s\right)[/math] after making an observation [math]\,s[/math] and taking an action [math]\,a[/math] <ref name = "main"></ref>.

During learning, we apply Q-learning updates, on samples (or minibatches) of experience [math](s,a,r,s')\sim U(D)[/math], drawn uniformly at random from the pool of stored samples. The Q-learning update at iteration *I* uses the following loss function:

[math] L_i(\theta_i) = \mathbb E_{(s,a,r,s')\sim U(D)}[(r+\gamma \max_{a'} Q(s',a';\theta_i^-)-Q(s,a;\theta_i))^2] [/math]

### Instability of Neural Networks as Function Estimate

Unfortunately, current methods which use deep networks to estimate [math]Q\,[/math]suffer from instability or divergence for the following reasons:

- Correlation within sequence of observations
- Small updates to [math]Q\,[/math]can significantly change the policy, and thus the data distribution
- The action values [math]Q\,[/math]are correlated with the target values [math]\,y = r_t + \gamma \max_{a'}Q(s', a')[/math]

### Overcoming Instability

One method of overcoming this instability is through the use of a biologically-inspired mechanism called experience replay. This involves storing [math]e_t = \left(s_t, a_t, r_t, s_{t+1}\right)[/math] - known as the "experiences" - at each time step in a dataset [math]D_t = \left(e_1, e_2, \ldots, e_t\right)[/math]. Then, during each learning update, minibatch updates are performed, where points are sampled uniformly from [math]\,D_t[/math]. In practice, only [math]N[/math] experiences are stored, where [math]N[/math] is some large, finite number (e.g. [math]N=10^6[/math]). Incorporating experience replay into the algorithm removes the correlation between observations and smooths out the data distribution, since it takes an average over the past [math]N[/math] states. This makes it much more unlikely that instability or divergence will occur.

This approach has several advantages over standard online Q-learning. First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency. Second, learning directly from consecutive samples is inefficient, owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. Or in other words, the updates to the network parameters and the selection of data samples used to calculate these updates are not independent processes. This can lead to unwanted feedback loops in the sample-update cycle that force the network to converge on a poor local optimum of the loss function. The use of experience replay solves this problem essentially by introducing a greater degree of independence between the choice of training samples and the current network parameters at a particular iteration in the learning procedure.

Another method used to combat instability is to use a separate network for generating the targets [math]y_i[/math] as opposed to the same network. This is implemented by cloning the network [math]\,Q[/math] every [math]\,C[/math] iterations, and using this static, cloned network to generate the next [math]\,C[/math] target values. Thus, the target network parameters are only updated with the Q-network parameters every [math]\,C[/math] steps and are held fixed between individual updates. This additional modification reduces correlation between the target values and the action values, providing more stability to the network. In this paper, [math]\,C = 10^4[/math].

They also found that clipping the error term to be between -1 and 1 can further improve the stability of the algorithm. This clipping is equivalent to constraining the magnitude of the gradient used to update the network parameters on each iteration of the algorithm described below, since the error term in question comprises part of the derivative of the loss function with respect to the parameters. With such a constraint on the magnitude of the gradient, each parameter update can only have a minimal effect on the network's approximation of the optimal action-value function. Hence, it is not surprising that this clipping technique improves the stability of the algorithm.

### Data & Preprocessing

The data used for this experiment is initially frame-by-frame pixel data from the Atari 2600 emulator, along with the game score (and the number of lives in applicable games). The frames are 210 x 160 images with colour, so some preprocessing is performed to simplify the data. The first step to encode a single frame is to take the maximum value for each pixel colour value over the frame being encoded and the previous frame <ref name = "main"></ref>. This removes flickering between frames, as sometimes images are only shown on every even or odd frame. Then, the image is converted to greyscale and downsampled to 84 x 84. This process is applied to the [math]m[/math] most recent frames (here [math]m=4[/math]), and these 84 x 84 x 4 images are the inputs to the network. To be more specific, the network takes the last four frames as input and outputs the action value of each action.

## Model Architecture

The framework for building an agent that can learn from environmental inputs is an iterative reward-based system. The agent will perform actions and constantly re-assess how its actions have affected its environment. To model this type of behaviour, the agent attempts to build up a model that relates actions to rewards over time. The underlying model relating actions to rewards, [math]Q[/math], is estimated by a deep convolutional network, and is updated at every step in time.

The structure of the network itself is as follows. There are separate output units for each possible action, and the only input to the network is the state representation. The outputs are the predicted Q-values for each action performed on the input state. The first hidden layer convolves 32 8x8 filters with stride 4, then applies a rectified nonlinear function <ref>Jarrett K. et. al. What is the best multi-stage architecture for object recognition? Proc. IEEE Int. Conf. Comput. Vis. 2146-2153 (2009)</ref>. In the next hidden layer, 64 4x4 filters with stride 2 are convolved and again followed by rectified non-linearity. The next layer is the final convolutional layer, with 64 3x3 filters of stride 1, followed by the rectifier. The final hidden layer in the network is fully-connected, with 512 rectifying units. The output layer is a fully-connected linear layer with a single output for each valid action, of which there ranged from 4 to 18 in any particular game<ref name = "main"></ref>.

## Training

### Framework and Additional Setup Details

Forty-nine Atari games were considered as experiments. A unique DQN is trained for each game, but the same structure, algorithm, and global parameters (e.g. [math]C[/math] or [math]m[/math] as above, among others) were used throughout. The value of the global parameters was selected by performing an informal search on a small subset of the 49 games. The goal is to use minimal prior knowledge and perform end-to-end training of these models based on game experience.

The reward structure for games was slightly changed, clipping negative rewards at -1 and positive rewards at 1, since score varies from game to game. This could be problematic since the agent may not properly prioritize higher-scoring actions, but it also helps stabilize the network and allows it to generalize to more games. For example, the same learning rate can be used across all games. However, the game itself is not otherwise changed.

There is also a frame-skipping technique employed, in which the agent only performs an action every [math]k^{th}[/math] frame to allow the agent to play [math]k[/math] times more games, as the network does not have to be trained on the skipped frames ([math]k=4[/math] here). Furthermore, I believe this creates a more realistic experience for the agent, as human players would not be able to change their own actions every single frame.

The agents are trained on 50 million frames of game play, which is about 38 days. The experience replay memory used contains the previous 1 million frames of game play. The RMSProp <ref>Hinton, Geoffrey et al.Overview of Minibatch Gradient Descent. University of Toronto.</ref> algorithm, which performs stochastic gradient descent in small batches, is used to train the network.

### Algorithm Background

#### Markovian Decision Process

A Markovian Decision Process (MDP) is described by:

- A set of states [math]S[/math]
- A set of actions [math]A[/math]
- Stochastic transition [math]p(s, a, s')[/math], describing the stochastic system
- Cost function [math]c: S X A \mapsto R[/math]
- Policy [math]\pi^{*}: S \mapsto A [/math]

The objective is to find the optimal policy [math]\pi[/math] that minimizes the cost function [math]c[/math].

An example graph of an MDP is shown below between the set of states {LAZY,WORKING,FEEL GOOD} with several example actions as edges between the vertices. Multiple actions transitioning between states are possible as well.

#### Q Learning

Q learning is a reinforcement learning technique that is used to find an optimal policy (sequence of actions) for any Markov Decision Process (MDP). The Q learning algorithm at its core computes the best sequence of future actions based on previous states. It is an incremental Dynamic Programming procedure where all possible states have to be computed and stored in a table to find an optimal policy.

[math]Q_{k+1}(s, a) := (1 - \alpha) Q(s, a) + \alpha(c(s, a) + \gamma \min_{b} Q_{k}(S^{'}, b))[/math]

Where:

- [math]s[/math]: state where the transition starts
- [math]a[/math]: action applied
- [math]s'[/math]: resulting state
- [math]\alpha[/math]: learning parameter
- [math]\gamma[/math]: discount factor (between 0 (short-term) and 1 (long-term))

Typically Q-learning is performed on-line. Riedmiller (2005) <ref>Riedmiller, Martin. "Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method." Machine Learning: ECML 2005. Springer Berlin Heidelberg, 2005. 317-328.</ref> argues that in principle while Q-learning can be directly implemented in a neural network, the problem with online approach is it creates unstable results. Intuitively this makes sense because if Q-learning were to be used for controls, 1 sample point is not indicative whether the action applied is optimal or not. For reasons listed above Mnih et al (2015) <ref>Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.</ref> devised a Q-learning updates on minibatches of experience. In real world the Markov assumption is often violated and we need a modification for Q-learning. One approach for capturing dynamics of many real world environments is using Partially Observable Markov Decision Process (POMDP)<ref>Hausknecht M and Stone P. "Deep Recurrent Q-Learning for Partially Observable MDPs" Nature 518.7540 (2015): 529-533.</ref>.

#### Q-Iteration Learning

At each step in time, the agent selects an action [math]a_t\,[/math] from the set of legal game actions [math]\mathbb{A}[/math]. The agent observes an image [math]x_t \in \mathbb{R}^d[/math] from the emulator, along with a reward [math]r_t\,[/math]. It is impossible to fully understand the current game situation from a single screen, so a sequence of actions and observations [math]s_t = x_1, a_1, x_2, \ldots, a_{t-1}, x_t[/math] is the input state.

Recall that if we define [math]R_t = \sum_{t'=t}^T \gamma^{t'-t}r_t[/math], where [math]\gamma\,[/math] is the discount factor and [math]\,T[/math] is the step in which the game terminates, then

- [math]Q^*\left(s,a\right) = \max_\pi \mathop{\mathbb{E}}\left[R_t| s_t=s, a_t=a, \pi\right][/math]

is the optimal action-value function.

#### The Bellman Equation in the Loss Framework

The optimal action-value function obeys the Bellman Equation:

- [math]Q^*\left(s,a\right) = \mathop{\mathbb{E}_{s'}}\left[r + \gamma \max_{a'}Q^*\left(s',a'\right) | s, a\right] [/math]

The intuition behind this identity is as follows: if the optimal value [math]Q^*(s',a')\,[/math] at the next time step was known for all possible actions [math]a'[/math], then the optimal strategy is to select the action [math]a'[/math] maximizing the expected value above <ref name = "main"> </ref>. Using the Bellman Equation as an iterative update formula is impractical, however, since the action-value function is estimated separately for each sequence and cannot generalize.

It is necessary, in practice, to operate with an approximation of the action-value function. When a neural network with weights [math]\,\theta[/math] is used, it is referred to as a Q-Network. A Q-Network is trained by adjusting [math]\,\theta_t[/math] to reduce the mean-squared error in the Bellman Equation. The new target values for training are given by [math]y = r + \gamma\max_{a'} Q\left(s', a'; \theta_t^-\right)[/math], where [math]\theta_t^-\,[/math] are the parameters from some previous iteration.

### The Full Algorithm

Now, with all the background behind us, the full Deep Q-Learning with Experience Replay algorithm is presented below:

Some notes about the algorithm:

- Replay memory is used to implement the experience replay technique described above, but only the last N experience tuples are stored in the replay memory, and sampled uniformly at random from D when performing updates. They believe that a more sophisticated sampling strategy might emphasize transitions from which we can learn the most, instead of giving equal importance to all transitions in the replay memory.
- An episode is one try in a game
- Correlations between target values and the action function [math]Q\,[/math]are mitigated by using [math]\hat{Q}[/math] for the target values, where [math]\hat{Q}[/math] is the target action-value function, which updated only once every [math]\,C[/math] steps.
- The gradient used in the algorithm is defined as [math] \nabla_{\theta_i} L_i(\theta_i) = \mathbb E_{(s,a,r,s')}[(r+\gamma \max_{a'} Q(s',a';\theta_i^-)-Q(s,a;\theta_i)) \nabla_{\theta_i} Q(s,a;\theta_i)] [/math], adjusted to accommodate the use of stochastic gradient descent (meaning that the full expectations are not computed).

## Results

### Evaluation Procedure

The trained networks played each game 30 times, up to 5 minutes at a time. The random agent, which is the baseline comparison, chooses a random action every 6 frames (10 Hz). The human player uses the same emulator as the agents, and played under controlled conditions (most notably without sound). The human performance is the average reward from around 20 episodes of the game lasting up to 5 minutes, after 2 hours of practice playing each game. The human performance is set to be 100%, and the random agent has performance set to 0%.

### Raw Score Results

The DQN agent outperforms the best existing reinforcement learning methods on 43 of the games without incorporating prior knowledge about Atari games. Furthermore, the agent scores at least 75% of the human score on more than half of the games. Also, DQN performs well across a number of types of games. However, games which involve extended planning strategies still pose a major problem to DQN (e.g. Montezuma's Revenge). These results are visualized in the figure below:

### t-SNE

The researchers also explored the representation learned by DQN that underpinned the most successful agent in the context of the game Space Invaders by using a technique developed for visualizing high-dimensional data called 't-SNE'. As expected, the t-SNE algorithm tends to map the DQN representation of perceptually similar states to nearby points. It was also discovered that t-SNE created similar embeddings for DQN representations that were close in terms of expected reward but perceptually dissimilar. This is consistent with the notion that the network can learn representations that support adaptive behaviour from high-dimensional sensory inputs. See the figure below for a depiction of the t-SNE.

### Results with model components removed

Two important advances in this paper were presented: experience replay and creating a separate network to evaluate the targets. To visualize the impact of these advances, the network was trained with and without both of these concepts, and evaluated on its performance in each case. The results are shown in the table below, in percentage form as above:

Game | With Replay and Target Q | With Replay, Without Target Q | Without Replay, With Target Q | Without Replay, Without Target Q |
---|---|---|---|---|

Breakout | 316.8 | 240.7 | 10.2 | 3.2 |

Enduro | 1006.3 | 831.4 | 141.9 | 29.1 |

River Raid | 7446.6 | 4102.8 | 2867.7 | 1453.0 |

Seaquest | 2894.4 | 822.6 | 1003.0 | 275.8 |

Space Invaders | 1088.9 | 826.3 | 373.2 | 302.0 |

Clearly, experience replay and maintaining a secondary network for computing target values are important. From these results, it seems that experience replay is more important on its own, except in Seaquest.

## Conclusion

The framework presented has demonstrated the ability to learn how to play Atari games, given minimal prior knowledge of the game and very basic inputs. Using reinforcement learning with the Q-network architecture was more effective than previous similar attempts, since experience replay and a separate target network were utilized in training. These two modifications removed correlations between sequential inputs, which improved stability in the network. Future work should be undertaken to improve the experience replay algorithm: instead of sampling uniformly from the replay memory, the sampling should be biased towards high-reward events. However, this may add a layer of instability to the network, but it is certainly worth investigating.

## Discussion

- Using simulators for reinforcement learning is very similar to Evolutionary Robotics, where instead of using Q-learning variants to learn neural networks, evolutionary algorithms are used to evolve neural network topology and weights <ref>Nolfi, Stefano, and Dario Floreano. Evolutionary robotics: The biology, intelligence, and technology of self-organizing machines. MIT press, 2000.</ref>. MarI/O for example by Seth Bling, used an NeuroEvolutionary algorithm to evolve a Neural Network to play a level in Mario video game, although the Neural Network trained is very different, the experimental evaluation is very similar to the one described by Mnih et al (2015).

- The problem of using simulators for solving control problems is that they do not translate well in the real world engineering <ref>Sofge, D. A., et al. "Challenges and opportunities of evolutionary robotics." arXiv preprint arXiv:0706.0457 (2007).</ref>, additionally Q-iteration makes the assumption that the state of the environment is fully observable.

- A lot of the games that the system performed poorly on are adventure games that require semantic understanding of what's being seen and the ability to do induction. For example, the complex path finding and puzzle solving of Pitfall. Obviously, Q-learning is incapable of accomplishing this alone and must be embedded in a cognitive system.

Generally, the popular Q-learning algorithm is known to overestimate action values under certain conditions. The introduced setting is a best case scenario for Q-learning in some cases, because the deep neural network results in flexible function approximation which has a low asymptotic approximation error. Plus, the determinism of the environments prevents the harmful effects of noise. Surprisingly, it has been shown that even in this setting DQN sometimes overestimates the values of the actions. Double Q-learning can provide a solution to this issue. Double DQN not only yields more accurate value estimates, but results in much higher scores on several games. This demonstrates that the overestimations of DQN were indeed leading to poorer policies and that double DQN is beneficial to reduce them.<ref> Hasselt, H. V., et al. [http://arxiv.org/pdf/1509.06461.pdf " Deep reinforcement learning with double Q_learning." arXiv preprint arXiv:1509.06461v1 (2015).</ref>

## Bibliography

<references />