Q-learning

Last updated

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations. [1]

Contents

For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. [2] Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. [2] "Q" refers to the function that the algorithm computes – the expected rewards for an action taken in a given state. [3]

Reinforcement learning

Reinforcement learning involves an agent, a set of states, and a set of actions per state. By performing an action , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).

The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. [1]

As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then:

On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now:

Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy.

Algorithm

Q-Learning table of states by actions that is initialized to zero, then each cell is updated through training. Q-Learning Matrix Initialized and After Training.png
Q-Learning table of states by actions that is initialized to zero, then each cell is updated through training.

After steps into the future the agent will decide some next step. The weight for this step is calculated as , where (the discount factor) is a number between 0 and 1 (). Assuming , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). may also be interpreted as the probability to succeed (or survive) at every step .

The algorithm, therefore, has a function that calculates the quality of a state–action combination:

.

Before learning begins, is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time the agent selects an action , observes a reward , enters a new state (that may depend on both the previous state and the selected action), and is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: [4]

where is the reward received when moving from the state to the state , and is the learning rate .

Note that is the sum of three factors:

An episode of the algorithm ends when state is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

For all final states , is never updated, but is set to the reward value observed for state . In most cases, can be taken to equal zero.

Influence of variables

Learning rate

The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as for all . [5]

Discount factor

The discount factor determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For , without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. [6] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. [7] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. [8]

Initial conditions (Q0)

Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", [9] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward can be used to reset the initial conditions. [10] According to this idea, the first time an action is taken the reward is used to set the value of . This allows immediate learning in case of fixed deterministic rewards. A model that incorporates reset of initial conditions (RIC) is expected to predict participants' behavior better than a model that assumes any arbitrary initial condition (AIC). [10] RIC seems to be consistent with human behaviour in repeated binary choice experiments. [10]

Implementation

Q-learning at its simplest stores data in tables. This approach falters with increasing numbers of states/actions since the likelihood of the agent visiting a particular state and performing a particular action is increasingly small.

Function approximation

Q-learning can be combined with function approximation. [11] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous.

One solution is to use an (adapted) artificial neural network as a function approximator. [12] Another possibility is to integrate Fuzzy Rule Interpolation (FRI) and use sparse fuzzy rule-bases [13] instead of discrete Q-tables or ANNs, which has the advantage of being a human-readable knowledge representation form. Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

Quantization

Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity, the angle of the stick and the angular velocity of the stick. This yields a four-element vector that describes one state, i.e. a snapshot of one state encoded into four values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values can be assigned to a bucket. The exact distance of the finger from its starting position (-Infinity to Infinity) is not known, but rather whether it is far away or not (Near, Far). [14]

History

Q-learning was introduced by Chris Watkins in 1989. [15] A convergence proof was presented by Watkins and Peter Dayan in 1992. [16]

Watkins was addressing “Learning from delayed rewards”, the title of his PhD thesis. Eight years earlier in 1981 the same problem, under the name of “Delayed reinforcement learning”, was solved by Bozinovski's Crossbar Adaptive Array (CAA). [17] [18] The memory matrix was the same as the eight years later Q-table of Q-learning. The architecture introduced the term “state evaluation” in reinforcement learning. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation:

The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via backpropagation: the state value of the consequence situation is backpropagated to the previously encountered situations. CAA computes state values vertically and actions horizontally (the "crossbar"). Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by the state evaluation function. This learning system was a forerunner of the Q-learning algorithm. [19]

In 2014, Google DeepMind patented [20] an application of Q-learning to deep learning, titled "deep reinforcement learning" or "deep Q-learning" that can play Atari 2600 games at expert human levels.

Variants

Deep Q-learning

The DeepMind system used a deep convolutional neural network, with layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy of the agent and the data distribution, and the correlations between Q and the target values. The method can be used for stochastic search in various domains and applications. [1] [21]

The technique used experience replay, a biologically inspired mechanism that uses a random sample of prior actions instead of the most recent action to proceed. [3] This removes correlations in the observation sequence and smooths changes in the data distribution. Iterative updates adjust Q towards target values that are only periodically updated, further reducing correlations with the target. [22]

Double Q-learning

Because the future maximum approximated action value in Q-learning is evaluated using the same Q function as in current action selection policy, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this. Double Q-learning [23] is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.

In practice, two separate value functions and are trained in a mutually symmetric fashion using separate experiences. The double Q-learning update step is then as follows:

, and

Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.

This algorithm was later modified in 2015 and combined with deep learning, [24] as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm. [25]

Others

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with probably approximately correct (PAC) learning. [26]

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation. [27] The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values.

Distributional Q-learning is a variant of Q-learning which seeks to model the distribution of returns rather than the expected return of each action. It has been observed to facilitate estimate by deep neural networks and can enable alternative control methods, such as risk-sensitive control. [28]

Multi-agent learning

Q-learning has been proposed in the multi-agent setting (see Section 4.1.2 in [29] ). One approach consists in pretending the environment is passive. [30] Littman proposes the minimax Q learning algorithm. [31]

Limitations

The standard Q-learning algorithm (using a table) applies only to discrete action and state spaces. Discretization of these values leads to inefficient learning, largely due to the curse of dimensionality. However, there are adaptations of Q-learning that attempt to solve this problem such as Wire-fitted Neural Network Q-Learning. [32]

See also

Related Research Articles

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Early versions of MTL were called "hints".

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many disciplines, including robotics, automatic control, economics and manufacturing. The name of MDPs comes from the Russian mathematician Andrey Markov as they are an extension of Markov chains.

Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.

<span class="mw-page-title-main">Bellman equation</span> Necessary condition for optimality associated with dynamic programming

A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into a sequence of simpler subproblems, as Bellman's “principle of optimality" prescribes. The equation applies to algebraic structures with a total ordering; for algebraic structures with a partial ordering, the generic Bellman's equation can be used.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations to the actions.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote.

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear -ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010.

In machine learning, automatic basis function construction is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.

Fusion adaptive resonance theory (fusion ART) is a generalization of self-organizing neural networks known as the original Adaptive Resonance Theory models for learning recognition categories across multiple pattern channels. There is a separate stream of work on fusion ARTMAP, that extends fuzzy ARTMAP consisting of two fuzzy ART modules connected by an inter-ART map field to an extended architecture consisting of multiple ART modules.

The decentralized partially observable Markov decision process (Dec-POMDP) is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication.

In reinforcement learning (RL), a model-free algorithm is an algorithm which does not estimate the transition probability distribution associated with the Markov decision process (MDP), which, in RL, represents the problem to be solved. The transition probability distribution and the reward function are often collectively called the "model" of the environment, hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

<span class="mw-page-title-main">Multi-agent reinforcement learning</span> Sub-field of reinforcement learning

Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.

Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves".

Proximal policy optimization (PPO) is an algorithm in the field of reinforcement learning that trains a computer agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017, and had become the default reinforcement learning algorithm at American artificial intelligence company OpenAI. In 2018 PPO had received a wide variety of successes, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games. Many experts called PPO the state of the art because it seems to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.

<span class="mw-page-title-main">Reinforcement learning from human feedback</span> Machine learning technique

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent to human preferences. In classical reinforcement learning, the goal of such an agent is to learn a function that guides its behavior called a policy. This function learns to maximize the reward it receives from a separate reward function based on its task performance. In the case of human preferences, however, it tends to be difficult to define explicitly a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback. The reward model is first trained in a supervised fashion—independently from the policy being optimized—to predict if a response to a given prompt is good or bad based on ranking data collected from human annotators. This model is then used as a reward function to improve an agent's policy through an optimization algorithm like proximal policy optimization.

References

  1. 1 2 3 Li, Shengbo (2023). Reinforcement Learning for Sequential Decision and Optimal Control (First ed.). Springer Verlag, Singapore. pp. 1–460. doi:10.1007/978-981-19-7784-8. ISBN   978-9-811-97783-1. S2CID   257928563.{{cite book}}: CS1 maint: location missing publisher (link)
  2. 1 2 Melo, Francisco S. "Convergence of Q-learning: a simple proof" (PDF).
  3. 1 2 Matiisen, Tambet (December 19, 2015). "Demystifying Deep Reinforcement Learning". neuro.cs.ut.ee. Computational Neuroscience Lab. Retrieved 2018-04-06.
  4. Dietterich, Thomas G. (21 May 1999). "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition". arXiv: cs/9905014 .
  5. Sutton, Richard; Barto, Andrew (1998). Reinforcement Learning: An Introduction. MIT Press.
  6. Russell, Stuart J.; Norvig, Peter (2010). Artificial Intelligence: A Modern Approach (Third ed.). Prentice Hall. p. 649. ISBN   978-0136042594.
  7. Baird, Leemon (1995). "Residual algorithms: Reinforcement learning with function approximation" (PDF). ICML: 30–37.
  8. François-Lavet, Vincent; Fonteneau, Raphael; Ernst, Damien (2015-12-07). "How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies". arXiv: 1512.02011 [cs.LG].
  9. Sutton, Richard S.; Barto, Andrew G. "2.7 Optimistic Initial Values". Reinforcement Learning: An Introduction. Archived from the original on 2013-09-08. Retrieved 2013-07-18.
  10. 1 2 3 Shteingart, Hanan; Neiman, Tal; Loewenstein, Yonatan (May 2013). "The role of first impression in operant learning" (PDF). Journal of Experimental Psychology: General. 142 (2): 476–488. doi:10.1037/a0029550. ISSN   1939-2222. PMID   22924882.
  11. Hasselt, Hado van (5 March 2012). "Reinforcement Learning in Continuous State and Action Spaces". In Wiering, Marco; Otterlo, Martijn van (eds.). Reinforcement Learning: State-of-the-Art. Springer Science & Business Media. pp. 207–251. ISBN   978-3-642-27645-3.
  12. Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3): 58–68. doi: 10.1145/203330.203343 . S2CID   8763243 . Retrieved 2010-02-08.
  13. Vincze, David (2017). "Fuzzy rule interpolation and reinforcement learning" (PDF). 2017 IEEE 15th International Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE. pp. 173–178. doi:10.1109/SAMI.2017.7880298. ISBN   978-1-5090-5655-2. S2CID   17590120.
  14. Krishnan, Srivatsan; Lam, Maximilian; Chitlangia, Sharad; Wan, Zishen; Barth-Maron, Gabriel; Faust, Aleksandra; Reddi, Vijay Janapa (13 November 2022). "QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning". arXiv: 1910.01055 [cs.LG].
  15. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards (PDF) (Ph.D. thesis). University of Cambridge. EThOS   uk.bl.ethos.330022.
  16. Watkins, Chris; Dayan, Peter (1992). "Q-learning". Machine Learning. 8 (3–4): 279–292. doi: 10.1007/BF00992698 . hdl: 21.11116/0000-0002-D738-D .
  17. Bozinovski, S. (15 July 1999). "Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem". In Dobnikar, Andrej; Steele, Nigel C.; Pearson, David W.; Albrecht, Rudolf F. (eds.). Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Portorož, Slovenia, 1999. Springer Science & Business Media. pp. 320–325. ISBN   978-3-211-83364-3.
  18. Bozinovski, S. (1982). "A self learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North Holland. pp. 397–402. ISBN   978-0-444-86488-8.
  19. Barto, A. (24 February 1997). "Reinforcement learning". In Omidvar, Omid; Elliott, David L. (eds.). Neural Systems for Control. Elsevier. ISBN   978-0-08-053739-9.
  20. "Methods and Apparatus for Reinforcement Learning, US Patent #20150100530A1" (PDF). US Patent Office. 9 April 2015. Retrieved 28 July 2018.
  21. Matzliach B.; Ben-Gal I.; Kagan E. (2022). "Detection of Static and Mobile Targets by an Autonomous Agent with Deep Q-Learning Abilities" (PDF). Entropy. 24 (8): 1168. Bibcode:2022Entrp..24.1168M. doi: 10.3390/e24081168 . PMC   9407070 . PMID   36010832.
  22. Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K. (Feb 2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236. ISSN   0028-0836. PMID   25719670. S2CID   205242740.
  23. van Hasselt, Hado (2011). "Double Q-learning" (PDF). Advances in Neural Information Processing Systems. 23: 2613–2622.
  24. van Hasselt, Hado; Guez, Arthur; Silver, David (8 December 2015). "Deep Reinforcement Learning with Double Q-learning". arXiv: 1509.06461 [cs.LG].
  25. van Hasselt, Hado; Guez, Arthur; Silver, David (2015). "Deep reinforcement learning with double Q-learning" (PDF). AAAI Conference on Artificial Intelligence: 2094–2100. arXiv: 1509.06461 .
  26. Strehl, Alexander L.; Li, Lihong; Wiewiora, Eric; Langford, John; Littman, Michael L. (2006). "Pac model-free reinforcement learning" (PDF). Proc. 22nd ICML: 881–888.
  27. Maei, Hamid; Szepesvári, Csaba; Bhatnagar, Shalabh; Sutton, Richard (2010). "Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning" (PDF). pp. 719–726. Archived from the original (PDF) on 2012-09-08. Retrieved 2016-01-25.
  28. Hessel, Matteo; Modayil, Joseph; van Hasselt, Hado; Schaul, Tom; Ostrovski, Georg; Dabney, Will; Horgan, Dan; Piot, Bilal; Azar, Mohammad; Silver, David (February 2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning". Proceedings of the AAAI Conference on Artificial Intelligence. 32. arXiv: 1710.02298 . doi:10.1609/aaai.v32i1.11796. S2CID   19135734.
  29. Shoham, Yoav; Powers, Rob; Grenager, Trond (1 May 2007). "If multi-agent learning is the answer, what is the question?". Artificial Intelligence. 171 (7): 365–377. doi:10.1016/j.artint.2006.02.006. ISSN   0004-3702 . Retrieved 4 April 2023.
  30. Sen, Sandip; Sekaran, Mahendra; Hale, John (1 August 1994). "Learning to coordinate without sharing information". Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence. AAAI Press: 426–431. Retrieved 4 April 2023.
  31. Littman, Michael L. (10 July 1994). "Markov games as a framework for multi-agent reinforcement learning". Proceedings of the Eleventh International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.: 157–163. ISBN   9781558603356 . Retrieved 4 April 2023.
  32. Gaskett, Chris; Wettergreen, David; Zelinsky, Alexander (1999). "Q-Learning in Continuous State and Action Spaces" (PDF).