Mixture of experts (MoE) is a

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

technique where multiple expert

networks Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...

(learners) are used to divide a problem space into homogeneous regions. MoE represents a form of

ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statist ...

. They were also called committee machines.

Basic theory

MoE always has the following components, but they are implemented and combined differently according to the problem being solved: * Experts

f_1, ..., f_n

, each taking the same input

x

, and producing outputs

f_1(x), ..., f_n(x)

. * A weighting function (also known as a gating function)

w

, which takes input

x

and produces a vector of outputs

(w(x)_1, ..., w(x)_n)

. This may or may not be a probability distribution, but in both cases, its entries are non-negative. *

\theta = (\theta_0, \theta_1, ..., \theta_n)

is the set of parameters. The parameter

\theta_0

is for the weighting function. The parameters

\theta_1, \dots, \theta_n

are for the experts. * Given an input

x

, the mixture of experts produces a single output by combining

f_1(x), ..., f_n(x)

according to the weights

w(x)_1, ..., w(x)_n

in some way, usually by

f(x) = \sum_i w(x)_i f_i(x)

. Both the experts and the weighting function are trained by minimizing some

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

, generally via

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

. There is much freedom in choosing the precise form of experts, the weighting function, and the loss function.

Meta-pi network

The meta-pi network, reported by Hampshire and Waibel, uses

f(x) = \sum_i w(x)_i f_i(x)

as the output. The model is trained by performing gradient descent on the mean-squared error loss

L := \frac 1N \sum_k \, y_k - f(x_k)\, ^2

. The experts may be arbitrary functions. In their original publication, they were solving the problem of classifying

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

s in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network" (essentially a multilayered convolution network over the mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers.

Adaptive mixtures of local experts

The adaptive mixtures of local experts uses a

Gaussian mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

. Each expert simply predicts a Gaussian distribution, and totally ignores the input. Specifically, the

i

-th expert predicts that the output is

y \sim N(\mu_i, I)

, where

\mu_i

is a learnable parameter. The weighting function is a linear-softmax function:

w(x)_i = \frac

The mixture of experts predict that the output is distributed according to the log-probability density function:

\ln f_\theta(y, x) 
= \ln\left \mu_i, I)\right = \ln\left 2\pi)^ \sum_i \frac e^\right /math>It is trained by maximal likelihood estimation, that is, gradient ascent on f(y, x) . The gradient for the i -th expert is \nabla_ \ln f_\theta(y, x) = 
\frac\; (y-\mu_i) and the gradient for the weighting function is \nabla_\ln f_\theta(y, x) = \beginx\\ 1\end \frac 
(f_(x)- f_\theta(y, x)) For each input-output pair (x, y), the weighting function is changed to increase the weight on all experts that performed above average, and decrease the weight on all experts that performed below average. This encourages the weighting function to learn to select only the experts that make the right predictions for each input.

The i -th expert is changed to make its prediction closer to y, but the amount of change is proportional to w(x)_i N(y, \mu_i, I) . This has a Bayesian interpretation. Given input x, the

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

that expert

i

is the right one is

w(x)_i

, and

N(y, \mu_i, I)

is the

likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...

of evidence

y

. So,

\frac

is the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

for expert

i

, and so the rate of change for the

i

-th expert is proportional to its posterior probability. In words, the experts that, in hindsight, seemed like the good experts to consult, are asked to learn on the example. The experts that, in hindsight, were not, are left alone. The combined effect is that the experts become specialized: Suppose two experts are both good at predicting a certain kind of input, but one is slightly better, then the weighting function would eventually learn to favor the better one. After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input. Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region. This has a positive feedback effect, causing each expert to move apart from the rest and take care of a local region alone (thus the name "''local'' experts").

Hierarchical MoE

Hierarchical mixtures of experts uses multiple levels of gating in a tree. Each gating is a probability distribution over the next level of gatings, and the experts are on the leaf nodes of the tree. They are similar to

decision trees A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

. For example, a 2-level hierarchical MoE would have a first order gating function

w_i

, and second order gating functions

w_

and experts

f_

. The total prediction is then

\sum_i w_i(x) \sum_j w_(x) f_(x)

Variants

The mixture of experts, being similar to the gaussian mixture model, can also be trained by the expectation-maximization algorithm, just like gaussian mixture models. Specifically, during the expectation step, the "burden" for explaining each data point is assigned over the experts, and during the maximization step, the experts are trained to improve the explanations they got a high burden for, while the gate is trained to improve its burden assignment. This can converge faster than gradient ascent on the log-likelihood. The choice of gating function is often softmax. Other than that, gating may use gaussian distributions and exponential families. Instead of performing a weighted sum of all the experts, in hard MoE, only the highest ranked expert is chosen. That is,

f(x) = f_(x)

. This can accelerate training and inference time. The experts can use more general forms of multivariant gaussian distributions. For example, proposed

f_i(y, x) = N(y ,  A_i x + b_i, \Sigma_i)

, where

A_i, b_i, \Sigma_i

are learnable parameters. In words, each expert learns to do linear regression, with a learnable uncertainty estimate. One can use different experts than gaussian distributions. For example, one can use

Laplace distribution In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as two exponen ...

, or

Student's t-distribution In probability theory and statistics, Student's distribution (or simply the distribution) t_\nu is a continuous probability distribution that generalizes the Normal distribution#Standard normal distribution, standard normal distribu ...

. For binary classification, it also proposed

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

experts, with

f_i(y, x) = \begin
\frac, & y = 0 \\
1-\frac, & y= 1
\end

where

\beta_, \beta_

are learnable parameters. This is later generalized for multi-class classification, with

multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...

experts. One paper proposed mixture of softmaxes for autoregressive language modelling. Specifically, consider a language model that given a previous text

c

, predicts the next word

x

. The network encodes the text into a vector

v_c

, and predicts the probability distribution of the next word as

\mathrm( v_c W )

for an embedding matrix

W

. In mixture of softmaxes, the model outputs multiple vectors

v_, \dots, v_

, and predict the next word as

\sum_^n p_i \; \mathrm( v_ W_i )

, where

p_i

is a probability distribution by a linear-softmax operation on the activations of the hidden neurons within the model. The original paper demonstrated its effectiveness for

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s. This was later found to work for Transformers as well.

Deep learning

The previous section described MoE as it was used before the era of

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

. After deep learning, MoE found applications in running the largest models, as a simple way to perform '' conditional computation'': only parts of the model are used, the parts chosen according to what the input is. The earliest paper that applies MoE to deep learning dates back to 2013, which proposed to use a different gating network at each layer in a deep neural network. Specifically, each gating is a linear-ReLU-linear-softmax network, and each expert is a linear-ReLU network. Since the output from the gating is not sparse, all expert outputs are needed, and no conditional computation is performed. The key goal when using MoE in deep learning is to reduce computing cost. Consequently, for each query, only a small subset of the experts should be queried. This makes MoE in deep learning different from classical MoE. In classical MoE, the output for each query is a weighted sum of ''all'' experts' outputs. In deep learning MoE, the output for each query can only involve a few experts' outputs. Consequently, the key design choice in MoE becomes routing: given a batch of queries, how to route the queries to the best experts.

Sparsely-gated MoE layer

The sparsely-gated MoE layer, published by researchers from

Google Brain Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...

, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them. Specifically, in a MoE layer, there are feedforward networks

f_1, ..., f_n

, and a gating network

w

. The gating network is defined by

w(x) = \mathrm(\mathrm_k(W x + \text))

, where

\mathrm_k

is a function that keeps the top-k entries of a vector the same, but sets all other entries to

-\infty

. The addition of noise helps with load balancing. The choice of

k

is a hyperparameter that is chosen according to application. Typical values are

k = 1, 2

. The

k = 1

version is also called the Switch Transformer. The original Switch Transformer was applied to a T5 language model. As demonstration, they trained a series of models for machine translation with alternating layers of MoE and

LSTM Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hi ...

, and compared with deep LSTM models. Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters.

Load balancing

Vanilla MoE tend to have issues of load balancing: some experts are consulted often, while other experts rarely or not at all. To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions. This is improved by Switch Transformer into a single auxiliary loss function. Specifically, let

n

be the number of experts, then for a given batch of queries

\

, the auxiliary loss for the batch is

n\sum_^n f_i P_i

Here,

f_i = \frac 1T \#(\texti)

is the fraction of tokens that chose expert

i

, and

P_i = \frac 1T \sum_^T \frac

is the fraction of weight on expert

i

. This loss is minimized at

1

, precisely when every expert has equal weight

1/n

in all situations.Researchers at DeepSeek designed a variant of MoE, with "shared experts" that are always queried, and "routed experts" that might not be. They found that standard load balancing encourages the experts to be equally consulted, but this then causes experts to replicate the same core capacity, such as English grammar. They proposed the shared experts to learn core capacities that are often used, and let the routed experts to learn the peripheral capacities that are rarely used. They also proposed "auxiliary-loss-free load balancing strategy", which does not use auxiliary loss. Instead, each expert

i

has an extra "expert bias"

b_i

. If an expert is being neglected, then their bias increases, and vice versa. During token assignment, each token picks the top-k experts, but with the bias added in. That is:

f(x) = \sum_ w(x)_i f_i(x)

Note that the expert bias matters for picking the experts, but not in adding up the responses from the experts.

Capacity factor

Suppose there are

n

experts in a layer. For a given batch of queries

\

, each query is routed to one or more experts. For example, if each query is routed to one expert as in Switch Transformers, and if the experts are load-balanced, then each expert should expect on average

T/n

queries in a batch. In practice, the experts cannot expect perfect load balancing: in some batches, one expert might be underworked, while in other batches, it would be overworked. Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned, load balancing is important. The capacity factor is sometimes used to enforce a hard constraint on load balancing. Each expert is only allowed to process up to

c \cdot T/n

queries in a batch. The ST-MoE report found

c \in .25, 2 /math> to work well in practice.

Routing

In the original sparsely-gated MoE, only the top-k experts are queried, and their outputs are weighted-summed. There are other methods. Generally speaking, routing is an

assignment problem The assignment problem is a fundamental combinatorial optimization problem. In its most general form, the problem is as follows: :The problem instance has a number of ''agents'' and a number of ''tasks''. Any agent can be assigned to perform any t ...

: How to assign tokens to experts, such that a variety of constraints are followed (such as throughput, load balancing, etc.)? There are typically three classes of routing algorithm: the experts choose the tokens ("expert choice"), the tokens choose the experts (the original sparsely-gated MoE), and a global assigner matching experts and tokens. During inference, the MoE works over a large batch of tokens at any time. If the tokens were to choose the experts, then some experts might few tokens, while a few experts get so many tokens that it exceeds their maximum batch size, so they would have to ignore some of the tokens. Similarly, if the experts were to choose the tokens, then some tokens might not be picked by any expert. This is the "token drop" problem. Dropping a token is not necessarily a serious problem, since in Transformers, due to residual connections, if a token is "dropped", it does not disappear. Instead, its vector representation simply passes through the feedforward layer without change. Other approaches include solving it as a constrained linear programming problem, using

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

to train the routing algorithm (since picking an expert is a discrete action, like in RL). The token-expert match may involve no learning ("static routing"): It can be done by a deterministic

hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...

or a random number generator.

Applications to transformer models

MoE layers are used in the largest transformer models, for which learning and inferring over the full model is too costly. They are typically sparsely-gated, with sparsity 1 or 2. In Transformer models, the MoE layers are often used to select the feedforward layers (typically a linear-ReLU-linear network), appearing in each Transformer block after the multiheaded attention. This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger. For example, in the Palm-540B model, 90% of parameters are in its feedforward layers. A trained Transformer can be converted to a MoE by duplicating its feedforward layers, with randomly initialized gating, then trained further. This is a technique called "sparse upcycling". There are a large number of design choices involved in Transformer MoE that affect the training stability and final performance. The OLMoE report describes these in some detail. , models large enough to use MoE tend to be

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s, where each expert has on the order of 10 billion parameters. Other than language models, Vision MoE is a Transformer model with MoE layers. They demonstrated it by training a model with 15 billion parameters. MoE Transformer has also been applied for diffusion models. A series of large language models from

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

used MoE. GShard uses MoE with up to top-2 experts per layer. Specifically, the top-1 expert is always selected, and the top-2th expert is selected with probability proportional to that experts' weight according to the gating function. Later, GLaM demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts. Switch Transformers use top-1 in all MoE layers. The NLLB-200 by

Meta AI Meta AI is a research division of Meta (formerly Facebook) that develops artificial intelligence and augmented reality technologies. History The foundation of laboratory was announced in 2013, under the name Facebook Artificial Intelligence ...

is a machine translation model for 200 languages. Each MoE layer uses a hierarchical MoE with two levels. On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts. If using the experts, then another gating function computes the weights and chooses the top-2 experts. MoE large language models can be adapted for downstream tasks by instruction tuning. In December 2023,

Mistral AI Mistral AI SAS () is a French artificial intelligence (AI) startup, headquartered in Paris. Founded in 2023, it specializes in open-weight large language models (LLMs), with both open-source and proprietary AI models. Namesake The company is ...

released Mixtral 8x7B under Apache 2.0 license. It is a MoE language model with 46.7B parameters, 8 experts, and sparsity 2. They also released a version finetuned for instruction following. In March 2024, Databricks released DBRX. It is a MoE language model with 132B parameters, 16 experts, and sparsity 4. They also released a version finetuned for instruction following.

Basic theory

Meta-pi network

Adaptive mixtures of local experts

Hierarchical MoE

Variants

Deep learning

Sparsely-gated MoE layer

Load balancing

Capacity factor

Routing

Applications to transformer models

See also

References

Further reading