Thompson sampling,
named after
William R. Thompson, is a
heuristic for choosing actions that addresses the
exploration-exploitation dilemma
The exploration-exploitation dilemma, also known as the explore-exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves c ...
in the
multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
Description
Consider a set of contexts
, a set of actions
, and rewards in
. In each round, the player obtains a context
, plays an action
and receives a reward
following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.
The elements of Thompson sampling are as follows:
# a likelihood function
;
# a set
of parameters
of the distribution of
;
# a prior distribution
on these parameters;
# past observations triplets
;
# a posterior distribution
, where
is the likelihood function.
Thompson sampling consists in playing the action
according to the probability that it maximizes the expected reward; action
is chosen with probability
:
where
is the
indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
.
In practice, the rule is implemented by sampling. In each round, parameters
are sampled from the posterior
, and an action
chosen that maximizes