HOME

TheInfoList



OR:

In
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the multi-armed bandit problem (sometimes called the ''K''- or ''N''-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a
gambler Gambling (also known as betting or gaming) is the wagering of something of value ("the stakes") on a random event with the intent of winning something else of value, where instances of strategy are discounted. Gambling thus requires three elem ...
at a row of
slot machines A slot machine (American English), fruit machine (British English) or poker machine ( Australian English and New Zealand English) is a gambling machine that creates a game of chance for its customers. Slot machines are also known pejoratively ...
(sometimes known as "
one-armed bandit A slot machine (American English), fruit machine (British English) or poker machine (Australian English and New Zealand English) is a gambling machine that creates a game of chance for its customers. Slot machines are also known pejoratively as ...
s"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of
stochastic scheduling Stochastic scheduling concerns scheduling problems involving random attributes, such as random processing times, random due dates, random weights, and stochastic machine breakdowns. Major applications arise in manufacturing systems, computer system ...
. In the problem, each machine provides a random reward from a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
specific to that machine, that is not known a-priori. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more
information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. In practice, multi-armed bandits have been used to model problems such as managing research projects in a large organization, like a science foundation or a
pharmaceutical company The pharmaceutical industry discovers, develops, produces, and markets drugs or pharmaceutical drugs for use as medications to be administered to patients (or self-administered), with the aim to cure them, vaccinate them, or alleviate sympto ...
. In early versions of the problem, the gambler begins with no initial knowledge about the machines.
Herbert Robbins Herbert Ellis Robbins (January 12, 1915 – February 12, 2001) was an American mathematician and statistician. He did research in topology, measure theory, statistics, and a variety of other fields. He was the co-author, with Richard Co ...
in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". A theorem, the
Gittins index The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminati ...
, first published by John C. Gittins, gives an optimal policy for maximizing the expected discounted reward.


Empirical motivation

The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize their decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. There are many practical applications of the bandit model, for example: *
clinical trial Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, dietar ...
s investigating the effects of different experimental treatments while minimizing patient losses,Press (1986) *
adaptive routing Dynamic routing, also called adaptive routing, is a process where a router can forward data via a different route for a given destination based on the current conditions of the communication circuits within a system. The term is most commonly asso ...
efforts for minimizing delays in a network, * financial portfolio design In these practical examples, the problem requires balancing reward maximization based on the knowledge already acquired with attempting new actions to further increase knowledge. This is known as the ''exploitation vs. exploration tradeoff'' in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. The model has also been used to control dynamic allocation of resources to different projects, answering the question of which project to work on, given uncertainty about the difficulty and payoff of each possibility. Originally considered by Allied scientists in
World War II World War II or the Second World War, often abbreviated as WWII or WW2, was a world war that lasted from 1939 to 1945. It involved the vast majority of the world's countries—including all of the great powers—forming two opposin ...
, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over
Germany Germany,, officially the Federal Republic of Germany, is a country in Central Europe. It is the second most populous country in Europe after Russia, and the most populous member state of the European Union. Germany is situated betwe ...
so that German scientists could also waste their time on it. The version of the problem now commonly analyzed was formulated by
Herbert Robbins Herbert Ellis Robbins (January 12, 1915 – February 12, 2001) was an American mathematician and statistician. He did research in topology, measure theory, statistics, and a variety of other fields. He was the co-author, with Richard Co ...
in 1952.


The multi-armed bandit model

The multi-armed bandit (short: ''bandit'' or MAB) can be seen as a set of real distributions B = \, each distribution being associated with the rewards delivered by one of the K \in \mathbb^+ levers. Let \mu_1, \dots , \mu_K be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon H is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The
regret Regret is the emotion of wishing one had made a different decision in the past, because the consequences of the decision were unfavorable. Regret is related to perceived opportunity. Its intensity varies over time after the decision, in regard ...
\rho after T rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards: \rho = T \mu^* - \sum_^T \widehat_t, where \mu^* is the maximal reward mean, \mu^* = \max_k \, and \widehat_t is the reward in round ''t''. A ''zero-regret strategy'' is a strategy whose average regret per round \rho / T tends to zero with probability 1 when the number of played rounds tends to infinity. Intuitively, zero-regret strategies are guaranteed to converge to a (not necessarily unique) optimal strategy if enough rounds are played.


Variations

A common formulation is the ''Binary multi-armed bandit'' or ''Bernoulli multi-armed bandit,'' which issues a reward of one with probability p, and otherwise a reward of zero. Another formulation of the multi-armed bandit has each arm representing an independent Markov machine. Each time a particular arm is played, the state of that machine advances to a new one, chosen according to the Markov state evolution probabilities. There is a reward depending on the current state of the machine. In a generalization called the "restless bandit problem", the states of non-played arms can also evolve over time. There has also been discussion of systems where the number of choices (about which arm to play) increases over time. Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite ( asymptotic) time horizons for both stochastic and non-stochastic arm payoffs.


Bandit strategies

A major breakthrough was the construction of optimal population selection strategies, or policies (that possess uniformly maximum convergence rate to the population with highest mean) in the work described below.


Optimal solutions

In the paper "Asymptotically efficient adaptive allocation rules", Lai and Robbins (following papers of Robbins and his co-workers going back to Robbins in the year 1952) constructed convergent population selection policies that possess the fastest rate of convergence (to the population with highest mean) for the case that the population reward distributions are the one-parameter exponential family. Then, in Katehakis and Robbins simplifications of the policy and the main proof were given for the case of normal populations with known variances. The next notable progress was obtained by Burnetas and Katehakis in the paper "Optimal adaptive policies for sequential allocation problems", where index based policies with uniformly maximum convergence rate were constructed, under more general conditions that include the case in which the distributions of outcomes from each population depend on a vector of unknown parameters. Burnetas and Katehakis (1996) also provided an explicit solution for the important case in which the distributions of outcomes follow arbitrary (i.e., non-parametric) discrete, univariate distributions. Later in "Optimal adaptive policies for Markov decision processes" Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend on unknown parameters. In this work, the authors constructed an explicit form for a class of adaptive policies with uniformly maximum convergence rate properties for the total expected finite horizon reward under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations. These inflations have recently been called the optimistic approach in the work of Tewari and Bartlett, Ortner Filippi, Cappé, and Garivier, and Honda and Takemura. For Bernoulli multi-armed bandits, Pilarski et al. studied computation methods of deriving fully optimal solutions (not just asymptotically) using dynamic programming in the paper "Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge." Via indexing schemes, lookup tables, and other techniques, this work provided practically applicable optimal solutions for Bernoulli bandits provided that time horizons and numbers of arms did not become excessively large. Pilarski et al. later extended this work in "Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI" to create a method of determining the optimal policy for Bernoulli bandits when rewards may not be immediately revealed following a decision and may be delayed. This method relies upon calculating expected values of reward outcomes which have not yet been revealed and updating posterior probabilities when rewards are revealed. When optimal solutions to multi-arm bandit tasks are used to derive the value of animals' choices, the activity of neurons in the amygdala and ventral striatum encodes the values derived from these policies, and can be used to decode when the animals make exploratory versus exploitative choices. Moreover, optimal policies better predict animals' choice behavior than alternative strategies (described below). This suggests that the optimal solutions to multi-arm bandit problems are biologically plausible, despite being computationally demanding. * UCBC (Historical Upper Confidence Bounds with clusters): The algorithm adapts UCB for a new setting such that it can incorporate both clustering and historical information. The algorithm incorporates the historical observations by utilizing both in the computation of the observed mean rewards and the uncertainty term. The algorithm incorporates the clustering information by playing at two levels: first picking a cluster using a UCB-like strategy at each time step, and subsequently picking an arm within the cluster, again using a UCB-like strategy.


Approximate solutions

Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.


Semi-uniform strategies

Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the ''best'' lever (based on previous observations) is always pulled except when a (uniformly) random action is taken. * Epsilon-greedy strategy: The best lever is selected for a proportion 1 - \epsilon of the trials, and a lever is selected at random (with uniform probability) for a proportion \epsilon. A typical parameter value might be \epsilon = 0.1, but this can vary widely depending on circumstances and predilections. * Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. For N trials in total, the exploration phase occupies \epsilon N trials and the exploitation phase (1 - \epsilon) N trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected. * Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of \epsilon decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish. * Adaptive epsilon-greedy strategy based on value differences (VDBE): Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress instead of manual tuning (Tokic, 2010). High fluctuations in the value estimates lead to a high epsilon (high exploration, low exploitation); low fluctuations to a low epsilon (low exploration, high exploitation). Further improvements can be achieved by a softmax-weighted action selection in case of exploratory actions (Tokic & Palm, 2011). * Adaptive epsilon-greedy strategy based on Bayesian ensembles (Epsilon-BMC): An adaptive epsilon adaptation strategy for reinforcement learning similar to VBDE, with monotone convergence guarantees. In this framework, the epsilon parameter is viewed as the expectation of a posterior distribution weighting a greedy agent (that fully trusts the learned reward) and uniform learning agent (that distrusts the learned reward). This posterior is approximated using a suitable Beta distribution under the assumption of normality of observed rewards. In order to address the possible risk of decreasing epsilon too quickly, uncertainty in the variance of the learned reward is also modeled and updated using a normal-gamma model. (Gimelfarb et al., 2019). * Contextual-Epsilon-greedy strategy: Similar to the epsilon-greedy strategy, except that the value of \epsilon is computed regarding the situation in experiment processes, which lets the algorithm be Context-Aware. It is based on dynamic exploration/exploitation and can adaptively balance the two aspects by deciding which situation is most relevant for exploration or exploitation, resulting in highly explorative behavior when the situation is not critical and highly exploitative behavior at critical situation.


Probability matching strategies

Probability matching strategies reflect the idea that the number of pulls for a given lever should ''match'' its actual probability of being the optimal lever. Probability matching strategies are also known as
Thompson sampling Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward wi ...
or Bayesian Bandits, and are surprisingly easy to implement if you can sample from the posterior for the mean value of each alternative. Probability matching strategies also admit solutions to so-called contextual bandit problems.


Pricing strategies

Pricing strategies establish a ''price'' for each lever. For example, as illustrated with the POKER algorithm, the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of highest price is always pulled.


Strategies with ethical constraints

*Behavior Constrained Thompson Sampling (BCTS): In this paper the authors detail a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints as a guide when making decisions in an online setting while still being reactive to reward feedback. To define this agent, the solution was to adopt a novel extension to the classical contextual multi-armed bandit setting and provide a new algorithm called Behavior Constrained Thompson Sampling (BCTS) that allows for online learning while obeying exogenous constraints. The agent learns a constrained policy that implements the observed behavioral constraints demonstrated by a teacher agent, and then uses this constrained policy to guide the reward-based online exploration and exploitation. These strategies minimize the assignment of any patient to an inferior arm ( "physician's duty"). In a typical case, they minimize expected successes lost (ESL), that is, the expected number of favorable outcomes that were missed because of assignment to an arm later proved to be inferior. Another version minimizes resources wasted on any inferior, more expensive, treatment.


Contextual bandit

A useful generalization of the multi-armed bandit is the contextual multi-armed bandit. At each iteration an agent still has to choose between arms, but they also see a d-dimensional feature vector, the context vector they can use together with the rewards of the arms played in the past to make the choice of the arm to play. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors.


Approximate solutions for contextual bandit

Many strategies exist that provide an approximate solution to the contextual bandit problem, and can be put into two broad categories detailed below.


Online linear bandits

* LinUCB ''(Upper Confidence Bound)'' algorithm: the authors assume a linear dependency between the expected reward of an action and its context and model the representation space using a set of linear predictors. *LinRel (Linear Associative Reinforcement Learning) algorithm: Similar to LinUCB, but utilizes
Singular-value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is relate ...
rather than
Ridge regression Ridge regression is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Als ...
to obtain an estimate of confidence. *HLINUCBC (Historic LINUCB with clusters): proposed in the paper, extends the LinUCB idea with both historical and clustering information.


Online non-linear bandits

* UCBogram algorithm: The nonlinear reward functions are estimated using a piecewise constant estimator called a ''regressogram'' in
nonparametric regression Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship ...
. Then, UCB is employed on each constant piece. Successive refinements of the partition of the context space are scheduled or chosen adaptively. * Generalized linear algorithms: The reward distribution follows a generalized linear model, an extension to linear bandits. * NeuralBandit algorithm: In this algorithm several neural networks are trained to modelize the value of rewards knowing the context, and it uses a multi-experts approach to choose online the parameters of multi-layer perceptrons. * KernelUCB algorithm: a kernelized non-linear version of linearUCB, with efficient implementation and finite-time analysis. * Bandit Forest algorithm: a random forest is built and analyzed w.r.t the random forest built knowing the joint distribution of contexts and rewards. * Oracle-based algorithm: The algorithm reduces the contextual bandit problem into a series of supervised learning problem, and does not rely on typical realizability assumption on the reward function.


Constrained contextual bandit

* Context Attentive Bandits or Contextual Bandit with Restricted Context: The authors consider a novel formulation of the multi-armed bandit model, which is called contextual bandit with restricted context, where only a limited number of features can be accessed by the learner at every iteration. This novel formulation is motivated by different online problems arising in clinical trials, recommender systems and attention modeling. Herein, they adapt the standard multi-armed bandit algorithm known as Thompson Sampling to take advantage of the restricted context setting, and propose two novel algorithms, called the Thompson Sampling with Restricted Context (TSRC) and the Windows Thompson Sampling with Restricted Context (WTSRC),for handling stationary and nonstationary environments, respectively.. In practice, there is usually a cost associated with the resource consumed by each action and the total cost is limited by a budget in many applications such as crowdsourcing and clinical trials. Constrained contextual bandit (CCB) is such a model that considers both the time and budget constraints in a multi-armed bandit setting. A. Badanidiyuru et al. first studied contextual bandits with budget constraints, also referred to as Resourceful Contextual Bandits, and show that a O(\sqrt) regret is achievable. However, their work focuses on a finite set of policies, and the algorithm is computationally inefficient. A simple algorithm with logarithmic regret is proposed in: * UCB-ALP algorithm: The framework of UCB-ALP is shown in the right figure. UCB-ALP is a simple algorithm that combines the UCB method with an Adaptive Linear Programming (ALP) algorithm, and can be easily deployed in practical systems. It is the first work that show how to achieve logarithmic regret in constrained contextual bandits. Although is devoted to a special case with single budget constraint and fixed cost, the results shed light on the design and analysis of algorithms for more general CCB problems.


Adversarial bandit

Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration, an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest generalizations of the bandit problem as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.


Example: Iterated prisoner's dilemma

An example often considered for adversarial bandits is the
iterated prisoner's dilemma The Prisoner's Dilemma is an example of a game analyzed in game theory. It is also a thought experiment that challenges two completely rational agents to a dilemma: cooperate with their partner for mutual reward, or betray their partner ("def ...
. In this example, each adversary has two arms to pull. They can either Deny or Confess. Standard stochastic bandit algorithms don't work very well with these iterations. For example, if the opponent cooperates in the first 100 rounds, defects for the next 200, then cooperate in the following 300, etc. then algorithms such as UCB won't be able to react very quickly to these changes. This is because after a certain point sub-optimal arms are rarely pulled to limit exploration and focus on exploitation. When the environment changes the algorithm is unable to adapt or may not even detect the change.


Approximate solutions


Exp3

EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. 002b Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multi-armed bandits with side information eldin et al., 2011and to multi-armed bandits in the mixed stochastic-adversarial setting ubeck and Slivkins, 2012 The paper presented an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environment


=Algorithm

= Parameters: Real \gamma \in (0,1] Initialisation: \omega_i(1) = 1 for i = 1,...,K For each t = 1, 2, ..., T 1. Set p_i(t) = (1 - \gamma)\frac + \frac       i = 1,...,K 2. Draw i_t randomly according to the probabilities p_1(t),...,p_K(t) 3. Receive reward x_(t) \in ,1/math> 4. For j = 1,...,K set:     \hat_j(t) = \beginx_j(t)/p_j(t) & \textj = i_t \\0, & \text\end     \omega_j(t+1) = \omega_j(t) \exp(\gamma\hat_j(t)/K)


=Explanation

= Exp3 chooses an arm at random with probability (1 - \gamma) it prefers arms with higher weights (exploit), it chooses with probability \gamma to uniformly randomly explore. After receiving the rewards the weights are updated. The exponential growth significantly increases the weight of good arms.


=Regret analysis

= The (external) regret of the Exp3 algorithm is at most O(\sqrt)


Follow the perturbed leader (FPL) algorithm


=Algorithm

= Parameters: Real \eta Initialisation: \forall i: R_i(1) = 0 For each t = 1,2,...,T 1. For each arm generate a random noise from an exponential distribution \forall i : Z_i(t) \sim Exp(\eta) 2. Pull arm I(t): I(t)=arg\max_i\ Add noise to each arm and pull the one with the highest value 3. Update value: R_(t+1) = R_(t) + x_(t) The rest remains the same


=Explanation

= We follow the arm that we think has the best performance so far adding exponential noise to it to provide exploration.


Exp3 vs FPL


Infinite-armed bandit

In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable K. In the infinite armed case, introduced by Agrawal (1995), the "arms" are a continuous variable in K dimensions.


Non-stationary bandit

This framework refers to the multi-armed bandit problem in a ''non-stationary'' setting (i.e., in presence of
concept drift In predictive analytics and machine learning, concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become ...
). In the non-stationary setting, it is assumed that the expected reward for an arm k can change at every time step t \in \mathcal: \mu_^k \neq \mu_t^k. Thus, \mu_t^k no longer represents the whole sequence of expected (stationary) rewards for arm k. Instead, \mu^k denotes the sequence of expected rewards for arm k, defined as \mu^k = \_^T. A ''dynamic oracle'' represents the optimal policy to be compared with other policies in the non-stationary setting. The dynamic oracle optimises the expected reward at each step t \in \mathcal by always selecting the best arm, with expected reward of \mu_t^*. Thus, the cumulative expected reward \mathcal(T) for the dynamic oracle at final time step T is defined as: \mathcal(T) = \sum_^T Hence, the ''regret'' \rho^\pi(T) for policy \pi is computed as the difference between \mathcal(T) and the cumulative expected reward at step T for policy \pi: \rho^\pi(T) = \sum_^T - \mathbb_\pi^\mu \left \sum_^T \right= \mathcal(T) - \mathbb_\pi^\mu \left \sum_^T \right Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play. A number of algorithms were presented to deal with this case, including Discounted UCB and Sliding-Window UCB. A similar approach based on Thompson Sampling algorithm is the f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS) proposed by Cavenaghi et al. The f-dsw TS algorithm exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. Another work by Burtini et al. introduces a weighted least squares Thompson sampling approach (WLS-TS), which proves beneficial in both the known and unknown non-stationary cases.Improving Online Marketing Experiments with Drifting Multi-armed Bandits, Giuseppe Burtini, Jason Loeppky, Ramon Lawrence, 2015 In the known non-stationary case, the authors in produce an alternative solution, a variant of UCB named Adjusted Upper Confidence Bound (A-UCB) which assumes a stochastic model and provide upper-bounds of the regret.


Other variants

Many variants of the problem have been proposed in recent years.


Dueling bandit

The dueling bandit variant was introduced by Yue et al. (2012) to model the exploration-versus-exploitation tradeoff for relative feedback. In this variant the gambler is allowed to pull two levers at the same time, but they only get a binary feedback telling which lever provided the best reward. The difficulty of this problem stems from the fact that the gambler has no way of directly observing the reward of their actions. The earliest algorithms for this problem are InterleaveFiltering, Beat-The-Mean. The relative feedback of dueling bandits can also lead to voting paradoxes. A solution is to take the
Condorcet winner An electoral system satisfies the Condorcet winner criterion () if it always chooses the Condorcet winner when one exists. The candidate who wins a majority of the vote in every head-to-head election against each of the other candidatesthat is, a ...
as a reference. More recently, researchers have generalized algorithms from traditional MAB to dueling bandits: Relative Upper Confidence Bounds (RUCB), Relative EXponential weighing (REX3), Copeland Confidence Bounds (CCB), Relative Minimum Empirical Divergence (RMED), and Double Thompson Sampling (DTS).


Collaborative bandit

The
collaborative filtering Collaborative filtering (CF) is a technique used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook Recommender Systems Handbook, Springer, 2011, pp. 1-35 Collaborative filtering ...
bandits (i.e., COFIBA) was introduced by Li and Karatzoglou and Gentile (SIGIR 2016), where the classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data. These approaches are far from ideal in highly dynamic recommendation domains such as news recommendation and computational advertisement, where the set of items and users is very fluid. In this work, they investigate an adaptive clustering technique for content recommendation based on exploration-exploitation strategies in contextual multi-armed bandit settings. Their algorithm (COFIBA, pronounced as "Coffee Bar") takes into account the collaborative effects that arise due to the interaction of the users with the items, by dynamically grouping users based on the items under consideration and, at the same time, grouping items based on the similarity of the clusterings induced over the users. The resulting algorithm thus takes advantage of preference patterns in the data in a way akin to collaborative filtering methods. They provide an empirical analysis on medium-size real-world datasets, showing scalability and increased prediction performance (as measured by click-through rate) over state-of-the-art methods for clustering bandits. They also provide a regret analysis within a standard linear stochastic noise setting.


Combinatorial bandit

The Combinatorial Multiarmed Bandit (CMAB) problem arises when instead of a single discrete variable to choose from, an agent needs to choose values for a set of variables. Assuming each variable is discrete, the number of possible choices per iteration is exponential in the number of variables. Several CMAB settings have been studied in the literature, from settings where the variables are binary to more general setting where each variable can take an arbitrary set of values.


Bandits with temporal structure

Multi-armed bandit are mainly studied under two extreme settings known as stochastic and adversarial. The two aforementioned settings, however, does not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, authors introduce and study dynamic MAB with stochastic temporal structures. In this problem, the expected reward of each arm follows an auto-regressive (AR) model that governs the temporal structure of the rewards. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to get explored continuously over time.


See also

*
Gittins index The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminati ...
 – a powerful, general strategy for analyzing bandit problems. *
Greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
*
Optimal stopping In mathematics, the theory of optimal stopping or early stopping : (For French translation, secover storyin the July issue of ''Pour la Science'' (2009).) is concerned with the problem of choosing a time to take a particular action, in order to ...
*
Search theory In microeconomics, search theory studies buyers or sellers who cannot instantly find a trading partner, and must therefore search for a partner prior to transacting. Search theory clarifies how buyers and sellers choose when to acknowledge a coo ...
*
Stochastic scheduling Stochastic scheduling concerns scheduling problems involving random attributes, such as random processing times, random due dates, random weights, and stochastic machine breakdowns. Major applications arise in manufacturing systems, computer system ...


References

. . .


Further reading

* *. *. *. *. *. * . * *


External links


MABWiser
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
implementation of bandit strategies in Python and Matlab.
Contextual
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project
open source implementation of bandit strategies.
Banditlib
Open-Source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case
* Tutorial: Introduction to Bandits: Algorithms and Theory
Part1Part2


a classic example (with known answer) of the exploitation vs. exploration tradeoff.


S. Bubeck and N. Cesa-Bianchi A Survey on Bandits

A Survey on Contextual Multi-armed Bandits
a survey/tutorial for Contextual Bandits.


Animated, interactive plots
illustrating Epsilon-greedy,
Thompson sampling Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward wi ...
, and Upper Confidence Bound exploration/exploitation balancing strategies. {{DEFAULTSORT:Multi-Armed Bandit Sequential methods Sequential experiments Stochastic optimization Machine learning