A partially observable Markov decision process (POMDP) is a generalization of a
Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model (the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
of different observations given the underlying state) and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations (or belief states) to the actions.
The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The general framework of Markov decision processes with
imperfect information
The imperfect ( abbreviated ) is a verb form that combines past tense (reference to a past time) and imperfective aspect (reference to a continuing or repeated event or state). It can have meanings similar to the English "was doing (something)" o ...
was described by
Karl Johan Åström in 1965 in the case of a discrete state space, and it was further studied in the
operations research
Operations research () (U.S. Air Force Specialty Code: Operations Analysis), often shortened to the initialism OR, is a branch of applied mathematics that deals with the development and application of analytical methods to improve management and ...
community where the acronym POMDP was coined. It was later adapted for problems in
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
and
automated planning
Automated planning and scheduling, sometimes denoted as simply AI planning, is a branch of artificial intelligence that concerns the realization of strategy, strategies or action sequences, typically for execution by intelligent agents, autonomou ...
by
Leslie P. Kaelbling and
Michael L. Littman.
An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes the expected reward (or minimizes the cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.
Definition
Formal definition
A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple
, where
*
is a set of states,
*
is a set of actions,
*
is a set of conditional transition probabilities between states,
*
is the reward function.
*
is a set of observations,
*
is a set of conditional observation probabilities, and
*
is the discount factor.
At each time period, the environment is in some state
. The agent takes an action
, which causes the environment to transition to state
with probability
. At the same time, the agent receives an observation
which depends on the new state of the environment,
, and on the just taken action,
, with probability
(or sometimes
depending on the sensor model). Finally, the agent receives a reward
equal to
. Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward:
, where
is the reward earned at time
. The discount factor
determines how much immediate rewards are favored over more distant rewards. When
the agent only cares about which action will yield the largest expected immediate reward; when
the agent cares about maximizing the expected sum of future rewards.
Discussion
Because the agent does not directly observe the environment's state, the agent must make decisions under uncertainty of the true environment state. However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the current state. A consequence of this property is that the optimal behavior may often include (information gathering) actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in the future.
It is instructive to compare the above definition with the definition of a
Markov decision process. An MDP does not include the observation set, because the agent always knows with certainty the environment's current state. Alternatively, an MDP can be reformulated as a POMDP by setting the observation set to be equal to the set of states and defining the observation conditional probabilities to deterministically select the observation that corresponds to the true state.
Belief update
After having taken the action
and observing
, an agent needs to update its belief in the state the environment may (or not) be in. Since the state is Markovian (by assumption), maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted
. Below we describe how this belief update is computed.
After reaching
, the agent observes
with probability
. Let
be a probability distribution over the state space
.
denotes the probability that the environment is in state
. Given
, then after taking action
and observing
,
:
where
is a normalizing constant with
.
Belief MDP
A Markovian belief state allows a POMDP to be formulated as a
Markov decision process where every belief is a state. The resulting ''belief MDP'' will thus be defined on a continuous state space (even if the "originating" POMDP has a finite number of states: there are infinite belief states (in
) because there are an infinite number of probability distributions over the states (of
)).
Formally, the belief MDP is defined as a tuple
where
*
is the set of belief states over the POMDP states,
*
is the same finite set of action as for the original POMDP,
*
is the belief state transition function,
*
is the reward function on belief states,
*
is the discount factor equal to the
in the original POMDP.
Of these,
and
need to be derived from the original POMDP.
is
where
is the value derived in the previous section and
The belief MDP reward function (
) is the expected reward from the POMDP reward function over the belief state distribution:
.
The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.
Policy and value function
Unlike the "originating" POMDP (where each action is available from only one state), in the corresponding Belief MDP all belief states allow all actions, since you (almost) always have ''some'' probability of believing you are in any (originating) state. As such,
specifies an action
for any belief
.
Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When
defines a cost, the objective becomes the minimization of the expected cost.
The expected reward for policy
starting from belief
is defined as
:
where
is the discount factor. The optimal policy
is obtained by optimizing the long-term reward.
:
where
is the initial belief.
The optimal policy, denoted by
, yields the highest expected reward value for each belief state, compactly represented by the optimal value function
. This value function is solution to the
Bellman optimality equation:
:
For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex. It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate
arbitrarily closely, whose shape remains convex. Value iteration applies
dynamic programming update to gradually improve on the value until convergence to an
-optimal value function, and preserves its piecewise linearity and convexity. By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.
Approximate POMDP solutions
In practice, POMDPs are often computationally
intractable to solve exactly. This intractability is often due to the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...
or the curse of history (the fact that optimal policies may depend on the entire history of actions and observations). To address these issues, computer scientists have developed various approximate POMDP solutions. These solutions typically attempt to approximate the problem or solution with a limited number of parameters, plan only over a small part of the belief space online, or summarize the action-observation history compactly.
Grid-based algorithms comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states.
For example, adaptive grids and point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.
Dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
using
PCA has also been explored.
Online planning algorithms approach large POMDPs by constructing a new policy for the current belief each time a new observation is received. Such a policy only needs to consider future beliefs reachable from the current belief, which is often only a very small part of the full belief space. This family includes variants of
Monte Carlo tree search
In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree.
MCTS ...
and heuristic search. Similar to MDPs, it is possible to construct online algorithms that find arbitrarily near-optimal policies and have no direct computational complexity dependence on the size of the state and observation spaces.
Another line of approximate solution techniques for solving POMDPs relies on using (a subset of) the history of previous observations, actions and rewards up to the current time step as a pseudo-state. Usual techniques for solving MDPs based on these pseudo-states can then be used (e.g.
Q-learning). Ideally the pseudo-states should contain the most important information from the whole history (to reduce bias) while being as compressed as possible (to reduce overfitting).
POMDP theory
Planning in POMDP is
undecidable in general. However, some settings have been identified to be decidable (see Table 2 in,
reproduced below). Different objectives have been considered. Büchi objectives are defined by
Büchi automata. Reachability is an example of a Büchi condition (for instance, reaching a good state in which all robots are home). coBüchi objectives correspond to traces that do not satisfy a given Büchi condition (for instance, not reaching a bad state in which some robot died). Parity objectives are defined via
parity game
A parity game is played on a colored directed graph, where each node has been colored by a priority – one of (usually) finitely many natural numbers. Two players, 0 and 1, move a (single, shared) token along the edges of the graph. The ow ...
s; they enable to define complex objectives such that reaching a good state every 10 timesteps. The objective can be satisfied:
* almost-surely, that is the probability to satisfy the objective is 1;
* positive, that is the probability to satisfy the objective is strictly greater than 0;
* quantitative, that is the probability to satisfy the objective is greater than a given threshold.
We also consider the finite memory case in which the agent is a
finite-state machine
A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
, and the general case in which the agent has an infinite memory.
Applications
POMDPs can be used to model many kinds of real-world problems. Notable applications include the use of a POMDP in management of patients with ischemic heart disease, assistive technology for persons with dementia,
[ the conservation of the critically endangered and difficult to detect Sumatran tigers] and aircraft collision avoidance.
One application is a teaching case, a crying baby problem, where a parent needs to sequentially decide whether to feed the baby based on the observation of whether the baby is crying or not, which is an imperfect representation of the actual baby's state of hunger.
References
{{reflist
External links
APPL
a fast point-based POMDP solver
Finite-state Controllers using Branch-and-Bound
An Exact POMDP Solver for Policies of a Bounded Size
pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP)
an R package which includes an interface to Tony Cassandra'
pomdp-solve
program.
POMDPs.jl
an interface for defining and solving MDPs and POMDPs in Julia and python with a variety of solvers.
pyPOMDP
a (PO)MDP toolbox (simulator, solver, learner, file reader) for Python by Oliver Stollmann and Bastian Migge
zmdp
a POMDP solver by Trey Smith
Dynamic programming
Markov processes
Stochastic control