MuZero is a
computer program
A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...
developed by
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
research company
DeepMind
DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...
to master games without knowing their rules. Its release in 2019 included benchmarks of its performance in
go,
chess
Chess is a board game for two players. It is an abstract strategy game that involves Perfect information, no hidden information and no elements of game of chance, chance. It is played on a square chessboard, board consisting of 64 squares arran ...
,
shogi
, also known as Japanese chess, is a Strategy game, strategy board game for two players. It is one of the most popular board games in Japan and is in the same family of games as chess, Western chess, chaturanga, xiangqi, Indian chess, and janggi. ...
, and a standard suite of
Atari
Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French holding company Atari SA (formerly Infogrames) and its focus is on "video games, consumer hardware, licensing and bl ...
games. The
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
uses an approach similar to
AlphaZero
AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and Go (game), go. This algorithm uses an approach similar to AlphaGo Zero.
On December 5, 2017, the DeepMind ...
. It matched AlphaZero's performance in chess and shogi, improved on its performance in
Go (setting a new world record), and improved on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain.
MuZero was trained via
self-play, with no access to rules, opening books, or endgame tablebases. The trained algorithm used the same convolutional and residual architecture as AlphaZero, but with 20 percent fewer computation steps per node in the search tree.
MuZero’s capacity to plan and learn effectively without explicit rules makes it a groundbreaking achievement in
reinforcement learning
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
and
AI, pushing the boundaries of what is possible in
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
.
History
On November 19, 2019, the DeepMind team released a
preprint
In academic publishing, a preprint is a version of a scholarly or scientific paper that precedes formal peer review and publication in a peer-reviewed scholarly or scientific journal. The preprint may be available, often as a non-typeset versi ...
introducing MuZero.
Derivation from AlphaZero
MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ)
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.
MuZero was derived directly from AZ code, sharing its rules for setting
hyperparameters. Differences between the approaches include:
* AZ's planning process uses a
simulator
A simulation is an imitative representation of a process or system that could exist in the real world. In this broad sense, simulation can often be used interchangeably with model. Sometimes a clear distinction between the two terms is made, in ...
. The simulator knows the rules of the game. It has to be explicitly programmed. A
neural network
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
then predicts the policy and value of a future position. Perfect knowledge of game rules is used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to the rules, and instead learns one with neural networks.
* AZ has a single model for the game (from board state to predictions); MZ has separate models for ''representation'' of the current state (from board state into its internal embedding), ''dynamics'' of states (how actions change representations of board states), and ''prediction'' of policy and value of a future position (given a state's representation).
* MZ's hidden model may be complex, and it may turn out it can host computation; exploring the details of the hidden model in a trained instance of MZ is a topic for future exploration.
* MZ does not expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with time discounting. AZ was designed for two-player games that could be won, drawn, or lost.
Comparison with R2D2
The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.
MuZero surpassed both R2D2's mean and median performance across the suite of games, though it did not do better in every game.
Training and results
MuZero used 16 third-generation
tensor processing units (TPUs) for training, and 1000 TPUs for selfplay for board games, with 800 simulations per step and 8 TPUs for training and 32 TPUs for selfplay for Atari games, with 50 simulations per step.
AlphaZero used 64 second-generation TPUs for training, and 5000 first-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful individually as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are comparable training setups.
R2D2 was trained for 5 days through 2M training steps.
Initial results
MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance in Go after 500,000 training steps and surpassed it by 1 million steps. It matched R2D2's mean and median performance across the Atari game suite after 500 thousand training steps and surpassed it by 1 million steps, though it never performed well on 6 games in the suite.
Reactions and related work
MuZero was viewed as a significant advancement over AlphaZero, and a generalizable step forward in unsupervised learning techniques. The work was seen as advancing understanding of how to compose systems from smaller components, a systems-level development more than a pure machine-learning development.
While only pseudocode was released by the development team, Werner Duvaud produced an open source implementation based on that.
MuZero has been used as a reference implementation in other work, for instance as a way to generate model-based behavior.
In late 2021, a more efficient variant of MuZero was proposed, named EfficientZero. It "achieves 194.3 percent mean human performance and 109.0 percent median performance on the Atari 100k benchmark with only two hours of real-time game experience".
In early 2022, a variant of MuZero was proposed to play stochastic games (for example
2048,
backgammon
Backgammon is a two-player board game played with counters and dice on tables boards. It is the most widespread Western member of the large family of tables games, whose ancestors date back at least 1,600 years. The earliest record of backgammo ...
), called Stochastic MuZero, which uses afterstate dynamics and chance codes to account for the stochastic nature of the environment when training the dynamics network.
See also
*
General game playing
General game playing (GGP) is the design of artificial intelligence programs to be able to play more than one game successfully. For many games like chess, computers are programmed to play these games using a specially designed algorithm, which c ...
*
Unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
References
External links
* Initia
MuZero preprintOpen source implementations
{{Artificial intelligence navbox
2019 software
AlphaGo
Applied machine learning