MuZero

Last updated March 06, 2024

MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules.^[1]^[2]^[3] Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero. It matched AlphaZero's performance in chess and shogi, improved on its performance in Go (setting a new world record), and improved on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain.

History

MuZero really is discovering for itself how to build a model and understand it just from first principles.
— David Silver, DeepMind, Wired^[5]

On November 19, 2019, the DeepMind team released a preprint introducing MuZero.

Derivation from AlphaZero

MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.

MuZero was derived directly from AZ code, sharing its rules for setting hyperparameters. Differences between the approaches include:^[6]

AZ's planning process uses a simulator. The simulator knows the rules of the game. It has to be explicitly programmed. A neural network then predicts the policy and value of a future position. Perfect knowledge of game rules is used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to the rules, and instead learns one with neural networks.
AZ has a single model for the game (from board state to predictions); MZ has separate models for representation of the current state (from board state into its internal embedding), dynamics of states (how actions change representations of board states), and prediction of policy and value of a future position (given a state's representation).
MZ's hidden model may be complex, and it may turn out it can host computation; exploring the details of the hidden model in a trained instance of MZ is a topic for future exploration.
MZ does not expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with time discounting. AZ was designed for two-player games that could be won, drawn, or lost.

Comparison with R2D2

The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.^[7]

MuZero surpassed both R2D2's mean and median performance across the suite of games, though it did not do better in every game.

Training and results

MuZero used 16 third-generation tensor processing units (TPUs) for training, and 1000 TPUs for selfplay for board games, with 800 simulations per step and 8 TPUs for training and 32 TPUs for selfplay for Atari games, with 50 simulations per step.

AlphaZero used 64 second-generation TPUs for training, and 5000 first-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful individually as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are comparable training setups.

R2D2 was trained for 5 days through 2M training steps.

Initial results

MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance in Go after 500,000 training steps and surpassed it by 1 million steps. It matched R2D2's mean and median performance across the Atari game suite after 500 thousand training steps and surpassed it by 1 million steps, though it never performed well on 6 games in the suite.

Reactions and related work

MuZero was viewed as a significant advancement over AlphaZero,^[8] and a generalizable step forward in unsupervised learning techniques.^[9]^[10] The work was seen as advancing understanding of how to compose systems from smaller components, a systems-level development more than a pure machine-learning development.^[11]

While only pseudocode was released by the development team, Werner Duvaud produced an open source implementation based on that.^[12]

MuZero has been used as a reference implementation in other work, for instance as a way to generate model-based behavior.^[13]

In late 2021, a more efficient variant of MuZero was proposed, named EfficientZero. It "achieves 194.3 percent mean human performance and 109.0 percent median performance on the Atari 100k benchmark with only two hours of real-time game experience".^[14]

In early 2022, a variant of MuZero was proposed to play stochastic games (for example 2048, backgammon), called Stochastic MuZero, which uses afterstate dynamics and chance codes to account for the stochastic nature of the environment when training the dynamics network.^[15]

Related Research Articles

Computer chess includes both hardware and software capable of playing chess. Computer chess provides opportunities for players to practice even in the absence of human opponents, and also provides opportunities for analysis, entertainment and training. Computer chess applications that play at the level of a chess grandmaster or higher are available on hardware from supercomputers to smart phones. Standalone chess-playing machines are also available. Stockfish, Leela Chess Zero, GNU Chess, Fruit, and other free open source applications are available for various platforms.

An evaluation function, also known as a heuristic evaluation function or static evaluation function, is a function used by game-playing computer programs to estimate the value or goodness of a position in a game tree. Most of the time, the value is either a real number or a quantized integer, often in nths of the value of a playing piece such as a stone in go or a pawn in chess, where n may be tenths, hundredths or other convenient fraction, but sometimes, the value is an array of three values in the unit interval, representing the win, draw, and loss percentages of the position.

General game playing (GGP) is the design of artificial intelligence programs to be able to play more than one game successfully. For many games like chess, computers are programmed to play these games using a specially designed algorithm, which cannot be transferred to another context. For instance, a chess-playing computer program cannot play checkers. General game playing is considered as a necessary milestone on the way to artificial general intelligence.

In the game of Go, a ladder,(征子) is a basic sequence of moves in which an attacker pursues a group in atari in a zig-zag pattern across the board. If there are no intervening stones, the group will hit the edge of the board and be captured.

Dimitri Panteli Bertsekas is an applied mathematician, electrical engineer, and computer scientist, a McAfee Professor at the Department of Electrical Engineering and Computer Science in School of Engineering at the Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, and also a Fulton Professor of Computational Decision Making at Arizona State University, Tempe.

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.

DeepMind Technologies Limited, doing business as Google DeepMind, is a British-American artificial intelligence research laboratory which serves as a subsidiary of Google. Founded in the UK in 2010, it was acquired by Google in 2014, The company is based in London, with research centres in Canada, France, Germany and the United States.

AlphaGo is a computer program that plays the board game Go. It was developed by the London-based DeepMind Technologies, an acquired subsidiary of Google. Subsequent versions of AlphaGo became increasingly powerful, including a version that competed under the name Master. After retiring from competitive play, AlphaGo Master was succeeded by an even more powerful version known as AlphaGo Zero, which was completely self-taught without learning from human games. AlphaGo Zero was then generalized into a program known as AlphaZero, which played additional games, including chess and shogi. AlphaZero has in turn been succeeded by a program known as MuZero which learns without being taught the rules.

David Silver is a principal research scientist at Google DeepMind and a professor at University College London. He has led research on reinforcement learning with AlphaGo, AlphaZero and co-lead on AlphaStar.

Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google's own TensorFlow software. Google began using TPUs internally in 2015, and in 2018 made them available for third-party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.

AlphaGo Zero is a version of DeepMind's Go software AlphaGo. AlphaGo's team published an article in the journal Nature on 19 October 2017, introducing AlphaGo Zero, a version created without using data from human games, and stronger than any previous version. By playing games against itself, AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days.

AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go. This algorithm uses an approach similar to AlphaGo Zero.

Elmo is a computer shogi evaluation function and book file (joseki) created by Makoto Takizawa (瀧澤誠). It is designed to be used with a third-party shogi alpha–beta search engine.

Leela Chess Zero is a free, open-source, and deep neural network–based chess engine and volunteer computing project. Development has been spearheaded by programmer Gary Linscott, who is also a developer for the Stockfish chess engine. Leela Chess Zero was adapted from the Leela Zero Go engine, which in turn was based on Google's AlphaGo Zero project. One of the purposes of Leela Chess Zero was to verify the methods in the AlphaZero paper as applied to the game of chess.

AlphaStar is a computer program by DeepMind that plays the video game StarCraft II. It was unveiled to the public by name in January 2019. In a significant milestone for artificial intelligence, AlphaStar attained Grandmaster status in August 2019.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Artificial intelligence and machine learning techniques are used in video games for a wide variety of applications such as non-player character (NPC) control and procedural content generation (PCG). Machine learning is a subset of artificial intelligence that uses historical data to build predictive and analytical models. This is in sharp contrast to traditional methods of artificial intelligence such as search trees and expert systems.

Timothy P. Lillicrap is a Canadian neuroscientist and AI researcher, adjunct professor at University College London, and staff research scientist at Google DeepMind, where he has been involved in the AlphaGo and AlphaZero projects mastering the games of Go, Chess and Shogi. His research focuses on machine learning and statistics for optimal control and decision making, as well as using these mathematical frameworks to understand how the brain learns. He has developed algorithms and approaches for exploiting deep neural networks in the context of reinforcement learning, and new recurrent memory architectures for one-shot learning.

Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves".

AlphaDev is an artificial intelligence system developed by Google DeepMind to discover enhanced computer science algorithms using reinforcement learning. AlphaDev is based on AlphaZero, a system that mastered the games of chess, shogi and go by self-play. AlphaDev applies the same approach to finding faster algorithms for fundamental tasks such as sorting and hashing.

References

↑ Wiggers, Kyle (20 November 2019). "DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go". VentureBeat. Retrieved 22 July 2020.
↑ Friedel, Frederic. "MuZero figures out chess, rules and all". ChessBase GmbH. Retrieved 22 July 2020.
↑ Rodriguez, Jesus. "DeepMind Unveils MuZero, a New Agent that Mastered Chess, Shogi, Atari and Go Without Knowing the Rules". KDnuggets. Retrieved 22 July 2020.
↑ Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model". Nature. 588 (7839): 604–609. arXiv: 1911.08265 . Bibcode:2020Natur.588..604S. doi:10.1038/s41586-020-03051-4. PMID 33361790. S2CID 208158225.
↑ "What AlphaGo Can Teach Us About How People Learn". Wired. ISSN 1059-1028 . Retrieved 2020-12-25.
↑ Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (5 December 2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm". arXiv: 1712.01815 [cs.AI].
↑ Kapturowski, Steven; Ostrovski, Georg; Quan, John; Munos, Remi; Dabney, Will. RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING. ICLR 2019 – via Open Review.
↑ Shorten, Connor (2020-01-18). "The Evolution of AlphaGo to MuZero". Medium. Retrieved 2020-06-07.
↑ Shah, Rohin. "[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee - LessWrong 2.0". www.lesswrong.com. Retrieved 2020-06-07.
↑ Wu, Jun. "Reinforcement Learning, Deep Learning's Partner". Forbes. Retrieved 2020-07-15.
↑ "Machine Learning & Robotics: My (biased) 2019 State of the Field". cachestocaches.com. Retrieved 2020-07-15.
↑ Duvaud, Werner (2020-07-15), werner-duvaud/muzero-general , retrieved 2020-07-15
↑ van Seijen, Harm; Nekoei, Hadi; Racah, Evan; Chandar, Sarath (2020-07-06). "The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning". arXiv: 2007.03158 [cs.stat].
↑ Ye, Weirui; Liu, Shaohuai; Kurutach, Thanard; Abbeel, Pieter; Gao, Yang (2021-12-11). "Mastering Atari Games with Limited Data". arXiv: 2111.00210 [cs.LG].
↑ Antonoglou, Ioannis; Schrittwieser, Julian; Ozair, Serjil; Hubert, Thomas; Silver, David (2022-01-28). "Planning in Stochastic Environments with a Learned Model" . Retrieved 2023-12-12.

External links

Initial MuZero preprint
Open source implementations

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Wiggers, Kyle (20 November 2019). "DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go". VentureBeat. Retrieved 22 July 2020.

[2] Friedel, Frederic. "MuZero figures out chess, rules and all". ChessBase GmbH. Retrieved 22 July 2020.

[3] Rodriguez, Jesus. "DeepMind Unveils MuZero, a New Agent that Mastered Chess, Shogi, Atari and Go Without Knowing the Rules". KDnuggets. Retrieved 22 July 2020.

[4] Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model". Nature. 588 (7839): 604–609. arXiv: 1911.08265 . Bibcode:2020Natur.588..604S. doi:10.1038/s41586-020-03051-4. PMID 33361790. S2CID 208158225.

[5] "What AlphaGo Can Teach Us About How People Learn". Wired. ISSN 1059-1028 . Retrieved 2020-12-25.

[preprint-6] Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (5 December 2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm". arXiv: 1712.01815 [cs.AI].

[7] Kapturowski, Steven; Ostrovski, Georg; Quan, John; Munos, Remi; Dabney, Will. RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING. ICLR 2019 – via Open Review.

[8] Shorten, Connor (2020-01-18). "The Evolution of AlphaGo to MuZero". Medium. Retrieved 2020-06-07.

[9] Shah, Rohin. "[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee - LessWrong 2.0". www.lesswrong.com. Retrieved 2020-06-07.

[10] Wu, Jun. "Reinforcement Learning, Deep Learning's Partner". Forbes. Retrieved 2020-07-15.

[11] "Machine Learning & Robotics: My (biased) 2019 State of the Field". cachestocaches.com. Retrieved 2020-07-15.

[12] Duvaud, Werner (2020-07-15), werner-duvaud/muzero-general , retrieved 2020-07-15

[13] van Seijen, Harm; Nekoei, Hadi; Racah, Evan; Chandar, Sarath (2020-07-06). "The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning". arXiv: 2007.03158 [cs.stat].

[14] Ye, Weirui; Liu, Shaohuai; Kurutach, Thanard; Abbeel, Pieter; Gao, Yang (2021-12-11). "Mastering Atari Games with Limited Data". arXiv: 2111.00210 [cs.LG].

[15] Antonoglou, Ioannis; Schrittwieser, Julian; Ozair, Serjil; Hubert, Thomas; Silver, David (2022-01-28). "Planning in Stochastic Environments with a Learned Model" . Retrieved 2023-12-12.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]