] Multi-agent reinforcement learning (MARL) is a sub-field of

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex

group dynamics Group dynamics is a system of behaviors and psychological processes occurring within a social group (''intra''group dynamics), or between social groups ( ''inter''group dynamics). The study of group dynamics can be useful in understanding decision- ...

. Multi-agent reinforcement learning is closely related to game theory and especially repeated games, as well as

multi-agent systems A multi-agent system (MAS or "self-organized system") is a computerized system composed of multiple interacting intelligent agents.Hu, J.; Bhowmick, P.; Jang, I.; Arvin, F.; Lanzon, A.,A Decentralized Cluster Formation Containment Framework fo ...

. Its study combines the pursuit of finding ideal algorithms that maximize rewards with a more sociological set of concepts. While research in single-agent reinforcement learning is concerned with finding the algorithm that gets the biggest number of points for one agent, research in multi-agent reinforcement learning evaluates and quantifies social metrics, such as cooperation, reciprocity, equity, social influence, language and discrimination.

Definition

Similarly to single-agent reinforcement learning, multi-agent reinforcement learning is modeled as some form of a Markov decision process (MDP). For example, * A set of environment states

S

. * One set of actions

\mathcal A_i

for each of the agents

i \in I = \

. *

P_\overrightarrow(s,s')=\Pr(s_=s'\mid s_t=s, \overrightarrow_t=\overrightarrow)

is the probability of transition (at time

t

) from state

s

to state

s'

under joint action

\overrightarrow

. *

\overrightarrow_\overrightarrow(s,s')

is the immediate joint reward after transition from

s

s'

with joint action

\overrightarrow

. In settings with

perfect information In economics, perfect information (sometimes referred to as "no hidden information") is a feature of perfect competition. With perfect information in a market, all consumers and producers have complete and instantaneous knowledge of all market pr ...

, such as the games of

chess Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...

and Go, the MDP would be fully observable. In settings with imperfect information, especially in real-world applications like

self-driving cars A self-driving car, also known as an autonomous car, driver-less car, or robotic car (robo-car), is a car that is capable of traveling without human input.Xie, S.; Hu, J.; Bhowmick, P.; Ding, Z.; Arvin, F.,Distributed Motion Planning for Sa ...

, each agent would access an observation that only has part of the information about the current state. In the partially observable setting, the core model is the partially observable stochastic game in the general case, and the Decentralized POMDP in the cooperative case.

Cooperation vs. competition

When multiple agents are acting in a shared environment their interests might be aligned or misaligned. MARL allows exploring all the different alignments and how they affect the agents' behavior: * In pure competition settings the agents' rewards are exactly opposite to each other, and therefore they are playing ''against'' each other. * Pure cooperation settings are the other extreme, in which agents get the exact same rewards, and therefore they are playing ''with'' each other. * Mixed-sum settings cover all the games that combine elements of both cooperation and competition.

Pure competition settings

When two agents are playing a

zero-sum game Zero-sum game is a mathematical representation in game theory and economic theory of a situation which involves two sides, where the result is an advantage for one side and an equivalent loss for the other. In other words, player one's gain is ...

, they are in pure competition with each other. Many traditional games such as

and Go fall under this category, as do two-player variants of modern games like

StarCraft ''StarCraft'' is a military science fiction media franchise created by Chris Metzen and James Phinney and owned by Blizzard Entertainment. The series, set in the beginning of the 26th century, centers on a galactic struggle for dominance am ...

. Because each agent can only win at the expense of the other agent, many complexities are stripped away. There's no prospect of communication or social dilemmas, as neither agent is incentivized to take actions that benefit its opponent. The

Deep Blue Deep Blue may refer to: Film * '' Deep Blues: A Musical Pilgrimage to the Crossroads'', a 1992 documentary film about Mississippi Delta blues music * ''Deep Blue'' (2001 film), a film by Dwight H. Little * ''Deep Blue'' (2003 film), a film us ...

and

AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by DeepMind Technologies a subsidiary of Google (now Alphabet Inc.). Subsequent versions of AlphaGo became increasingly powerful, including a version that competed u ...

projects demonstrate how to optimize the performance of agents in pure competition settings. One complexity that isn't stripped away in pure competition settings is autocurricula. As the agents' policy is improved using self-play, multiple layers of learning may occur.

Pure cooperation settings

MARL is used to explore how separate agents with identical interests can communicate and work together. Pure cooperation settings are explored in recreational cooperative games such as Overcooked, as well as real-world scenarios in

robotics Robotics is an interdisciplinarity, interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist human ...

. In pure cooperation settings all the agents get identical rewards, which means that social dilemmas do not occur. In pure cooperation settings, oftentimes there are an arbitrary number of coordination strategies, and agents converge to specific "conventions" when coordinating with each other. The notion of conventions has been studied in language and also alluded to in more general multi-agent collaborative tasks.

Mixed-sum settings

Multi give way (4 agents, each trying to reach a specific point)

Most real-world scenarios involving multiple agents have elements of both cooperation and competition. For example, when multiple

are planning their respective paths, each of them has interests that are diverging but not exclusive: Each car is minimizing the amount of time it's taking to reach its destination, but all cars have the shared interest of avoiding a traffic collision. Mixed-sum settings can be explored using classic

matrix games Matrix Games is a publisher of PC games, specifically strategy games and wargames. It is based in Ohio, US, and Surrey, UK. Their focus is primarily but not exclusively on wargames and turn-based strategy. The product line-up also includes spa ...

such as

prisoner's dilemma The Prisoner's Dilemma is an example of a game analyzed in game theory. It is also a thought experiment that challenges two completely rational agents to a dilemma: cooperate with their partner for mutual reward, or betray their partner ("de ...

, more complex sequential social dilemmas, and recreational mixed-sum games such as

Diplomacy Diplomacy comprises spoken or written communication by representatives of states (such as leaders and diplomats) intended to influence events in the international system.Ronald Peter Barston, ''Modern diplomacy'', Pearson Education, 2006, p. 1 ...

and

Among Us ''Among Us'' is a 2018 online multiplayer social deduction game developed and published by American game studio Innersloth. The game was inspired by the party game Mafia (party game), Mafia and the science fiction horror film ''The Thing (19 ...

. Mixed-sum settings can give rise to communication and social dilemmas.

Social dilemmas

As in game theory, much of the research in MARL revolves around

social dilemmas Social organisms, including human(s), live collectively in interacting populations. This interaction is considered social whether they are aware of it or not, and whether the exchange is voluntary or not. Etymology The word "social" derives from ...

, such as

chicken The chicken (''Gallus gallus domesticus'') is a domestication, domesticated junglefowl species, with attributes of wild species such as the grey junglefowl, grey and the Ceylon junglefowl that are originally from Southeastern Asia. Rooster ...

and

stag hunt In game theory, the stag hunt, sometimes referred to as the assurance game, trust dilemma or common interest game, describes a conflict between safety and social cooperation. The stag hunt problem originated with philosopher Jean-Jacques Roussea ...

. While game theory research might focus on

Nash equilibria In game theory, the Nash equilibrium, named after the mathematician John Nash, is the most common way to define the solution of a non-cooperative game involving two or more players. In a Nash equilibrium, each player is assumed to know the equ ...

and what an ideal policy for an agent would be, MARL research focuses on how the agents would learn these ideal policies using a trial-and-error process. The

algorithms that are used to train the agents are maximizing the agent's own reward; the conflict between the needs of the agents and the needs of the group is a subject of active research. Various techniques have been explored in order to induce cooperation in agents: Modifying the environment rules, adding intrinsic rewards, and more.

Sequential social dilemmas

Social dilemmas like prisoner's dilemma, chicken and stag hunt are "matrix games". Each agent takes only one action from a choice of two possible actions, and a simple 2x2 matrix is used to describe the reward that each agent will get, given the actions that each agent took. In humans and other living creatures, social dilemmas tend to be more complex. Agents take multiple actions over time, and the distinction between cooperating and defecting isn't as clear cut as in matrix games. The concept of a sequential social dilemma (SSD) was introduced in 2017 as an attempt to model that complexity. There is ongoing research into defining different kinds of SSDs and showing cooperative behavior in the agents that act in them.

Autocurricula

An autocurriculum (plural: autocurricula) is a reinforcement learning concept that's salient in multi-agent experiments. As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings, where each group of agents is racing to counter the current strategy of the opposing group. Th
Hide and Seek game
is an accessible example of an autocurriculum occurring in an adversarial setting. In this experiment, a team of seekers is competing against a team of hiders. Whenever one of the teams learns a new strategy, the opposing team adapts its strategy to give the best possible counter. When the hiders learn to use boxes to build a shelter, the seekers respond by learning to use a ramp to break into that shelter. The hiders respond by locking the ramps, making them unavailable for the seekers to use. The seekers then respond by "box surfing", exploiting a

glitch A glitch is a short-lived fault in a system, such as a transient fault that corrects itself, making it difficult to troubleshoot. The term is particularly common in the computing and electronics industries, in circuit bending, as well as amon ...

in the game to penetrate the shelter. Each "level" of learning is an emergent phenomenon, with the previous level as its premise. This results in a stack of behaviors, each dependent on its predecessor. Autocurricula in reinforcement learning experiments are compared to the stages of the evolution of life on earth and the development of

human culture Culture () is an umbrella term which encompasses the social behavior, institutions, and norms found in human societies, as well as the knowledge, beliefs, arts, laws, customs, capabilities, and habits of the individuals in these groups.Ty ...

. A major stage in evolution happened 2-3 billion years ago, when photosynthesizing life forms started to produce massive amounts of

oxygen Oxygen is the chemical element with the symbol O and atomic number 8. It is a member of the chalcogen group in the periodic table, a highly reactive nonmetal, and an oxidizing agent that readily forms oxides with most elements as we ...

, changing the balance of gases in the atmosphere. In the next stages of evolution, oxygen-breathing life forms evolved, eventually leading up to land

mammals Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fu ...

and human beings. These later stages could only happen after the photosynthesis stage made oxygen widely available. Similarly, human culture couldn't have gone through the

industrial revolution The Industrial Revolution was the transition to new manufacturing processes in Great Britain, continental Europe, and the United States, that occurred during the period from around 1760 to about 1820–1840. This transition included going f ...

in the 18th century without the resources and insights gained by the agricultural revolution at around 10,000 BC.

Applications

AI alignment

Multi-agent reinforcement learning has been used in research into

AI alignment In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards their designers’ intended goals and interests. An ''aligned'' AI system advances the intended objective; a ''misaligned'' AI system is compet ...

. The relationship between the different agents in a MARL setting can be compared to the relationship between a human and an AI agent. Research efforts in the intersection of these two fields attempt to simulate possible conflicts between a human's intentions and an AI agent's actions, and then explore which variables could be changed to prevent these conflicts.

Limitations

There are some inherent difficulties about multi-agent

deep reinforcement learning Deep reinforcement learning (deep RL) is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorpo ...

. The environment is not stationary anymore, thus the

Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov prop ...

is violated: transitions and rewards do not only depend on the current state of an agent.

Software

There are various tools and frameworks for working with multi-agent reinforcement learning environments:

References

{{reflist Reinforcement learning Multi-agent systems Deep learning Game theory