Mixture of experts

Last updated February 04, 2025 • 13 min readFrom Wikipedia, The Free Encyclopedia

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions.^[1] MoE represents a form of ensemble learning.^[2]

Basic theory

MoE always has the following components, but they are implemented and combined differently according to the problem being solved:

Experts $f_{1},...,f_{n}$ , each taking the same input $x$ , and producing outputs $f_{1}(x),...,f_{n}(x)$ .
A weighting function (also known as a gating function) $w$ , which takes input $x$ and produces a vector of outputs $(w(x)_{1},...,w(x)_{n})$ . This may or may not be a probability distribution, but in both cases, its entries are non-negative.
$\theta =(\theta _{0},\theta _{1},...,\theta _{n})$ is the set of parameters. The parameter $\theta _{0}$ is for the weighting function. The parameters $\theta _{1},\dots ,\theta _{n}$ are for the experts.
Given an input $x$ , the mixture of experts produces a single output by combining $f_{1}(x),...,f_{n}(x)$ according to the weights $w(x)_{1},...,w(x)_{n}$ in some way, usually by $f(x)=\sum _{i}w(x)_{i}f_{i}(x)$ .

Both the experts and the weighting function are trained by minimizing some loss function, generally via gradient descent. There is much freedom in choosing the precise form of experts, the weighting function, and the loss function.

Meta-pi network

The meta-pi network, reported by Hampshire and Waibel,^[3] uses $f(x)=\sum _{i}w(x)_{i}f_{i}(x)$ as the output. The model is trained by performing gradient descent on the mean-squared error loss $L:={\frac {1}{N}}\sum _{k}\|y_{k}-f(x_{k})\|^{2}$ . The experts may be arbitrary functions.

In their original publication, they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network"^[4] (essentially a multilayered convolution network over the mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers.

Adaptive mixtures of local experts

The adaptive mixtures of local experts^[5]^[6] uses a gaussian mixture model. Each expert simply predicts a gaussian distribution, and totally ignores the input. Specifically, the $i$ -th expert predicts that the output is $y\sim N(\mu _{i},I)$ , where $\mu _{i}$ is a learnable parameter. The weighting function is a linear-softmax function: $w(x)_{i}={\frac {e^{k_{i}^{T}x+b_{i}}}{\sum _{j}e^{k_{j}^{T}x+b_{j}}}}$ The mixture of experts predict that the output is distributed according to the probability density function: $f_{\theta }(y|x)=\ln \left[\sum _{i}{\frac {e^{k_{i}^{T}x+b_{i}}}{\sum _{j}e^{k_{j}^{T}x+b_{j}}}}N(y|\mu _{i},I)\right]=\ln \left[(2\pi )^{-d/2}\sum _{i}{\frac {e^{k_{i}^{T}x+b_{i}}}{\sum _{j}e^{k_{j}^{T}x+b_{j}}}}e^{-{\frac {1}{2}}\|y-\mu _{i}\|^{2}}\right]$ It is trained by maximal likelihood estimation, that is, gradient ascent on $f(y|x)$ . The gradient for the $i$ -th expert is

$\nabla _{\mu _{i}}f_{\theta }(y|x)={\frac {w(x)_{i}N(y|\mu _{i},I)}{\sum _{j}w(x)_{j}N(y|\mu _{j},I)}}\;(y-\mu _{i})$

and the gradient for the weighting function is $\nabla _{[k_{i},b_{i}]}f_{\theta }(y|x)={\begin{bmatrix}x\\1\end{bmatrix}}{\frac {w(x)_{i}}{\sum _{j}w(x)_{j}N(y|\mu _{j},I)}}(f_{i}(x)-f_{\theta }(y|x))$

For each input-output pair $(x,y)$ , the weighting function is changed to increase the weight on all experts that performed above average, and decrease the weight on all experts that performed below average. This encourages the weighting function to learn to select only the experts that make the right predictions for each input.

The $i$ -th expert is changed to make its prediction closer to $y$ , but the amount of change is proportional to $w(x)_{i}N(y|\mu _{i},I)$ . This has a Bayesian interpretation. Given input $x$ , the prior probability that expert $i$ is the right one is $w(x)_{i}$ , and $N(y|\mu _{i},I)$ is the likelihood of evidence $y$ . So, ${\frac {w(x)_{i}N(y|\mu _{i},I)}{\sum _{j}w(x)_{j}N(y|\mu _{j},I)}}$ is the posterior probability for expert $i$ , and so the rate of change for the $i$ -th expert is proportional to its posterior probability.

In words, the experts that, in hindsight, seemed like the good experts to consult, are asked to learn on the example. The experts that, in hindsight, were not, are left alone.

The combined effect is that the experts become specialized: Suppose two experts are both good at predicting a certain kind of input, but one is slightly better, then the weighting function would eventually learn to favor the better one. After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input. Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region. This has a positive feedback effect, causing each expert to move apart from the rest and take care of a local region alone (thus the name "local experts").

Hierarchical MoE

Hierarchical mixtures of experts^[7]^[8] uses multiple levels of gating in a tree. Each gating is a probability distribution over the next level of gatings, and the experts are on the leaf nodes of the tree. They are similar to decision trees.

For example, a 2-level hierarchical MoE would have a first order gating function $w_{i}$ , and second order gating functions $w_{j|i}$ and experts $f_{j|i}$ . The total prediction is then $\sum _{i}w_{i}(x)\sum _{j}w_{j|i}(x)f_{j|i}(x)$ .

Variants

The mixture of experts, being similar to the gaussian mixture model, can also be trained by the expectation-maximization algorithm, just like gaussian mixture models. Specifically, during the expectation step, the "burden" for explaining each data point is assigned over the experts, and during the maximization step, the experts are trained to improve the explanations they got a high burden for, while the gate is trained to improve its burden assignment. This can converge faster than gradient ascent on the log-likelihood.^[8]^[9]

The choice of gating function is often softmax. Other than that, gating may use gaussian distributions ^[10] and exponential families.^[9]

Instead of performing a weighted sum of all the experts, in hard MoE,^[11] only the highest ranked expert is chosen. That is, $f(x)=f_{\arg \max _{i}w_{i}(x)}(x)$ . This can accelerate training and inference time.^[12]

The experts can use more general forms of multivariant gaussian distributions. For example,^[7] proposed $f_{i}(y|x)=N(y|A_{i}x+b_{i},\Sigma _{i})$ , where $A_{i},b_{i},\Sigma _{i}$ are learnable parameters. In words, each expert learns to do linear regression, with a learnable uncertainty estimate.

One can use different experts than gaussian distributions. For example, one can use Laplace distribution,^[13] or Student's t-distribution.^[14] For binary classification, it also proposed logistic regression experts, with $f_{i}(y|x)={\begin{cases}{\frac {1}{1+e^{\beta _{i}^{T}x+\beta _{i,0}}}},&y=0\\1-{\frac {1}{1+e^{\beta _{i}^{T}x+\beta _{i,0}}}},&y=1\end{cases}}$ where $\beta _{i},\beta _{i,0}$ are learnable parameters. This is later generalized for multi-class classification, with multinomial logistic regression experts.^[15]

One paper proposed mixture of softmaxes for autoregressive language modelling.^[16] Specifically, consider a language model that given a previous text $c$ , predicts the next word $x$ . The network encodes the text into a vector $v_{c}$ , and predicts the probability distribution of the next word as $\mathrm {Softmax} (v_{c}W)$ for an embedding matrix $W$ . In mixture of softmaxes, the model outputs multiple vectors $v_{c,1},\dots ,v_{c,n}$ , and predict the next word as $\sum _{i=1}^{n}p_{i}\;\mathrm {Softmax} (v_{c,i}W_{i})$ , where $p_{i}$ is a probability distribution by a linear-softmax operation on the activations of the hidden neurons within the model. The original paper demonstrated its effectiveness for recurrent neural networks. This was later found to work for Transformers as well.^[17]

Deep learning

The previous section described MoE as it was used before the era of deep learning. After deep learning, MoE found applications in running the largest models, as a simple way to perform conditional computation : only parts of the model are used, the parts chosen according to what the input is.^[18]

The earliest paper that applies MoE to deep learning dates back to 2013,^[19] which proposed to use a different gating network at each layer in a deep neural network. Specifically, each gating is a linear-ReLU-linear-softmax network, and each expert is a linear-ReLU network. Since the output from the gating is not sparse, all expert outputs are needed, and no conditional computation is performed.

The key goal when using MoE in deep learning is to reduce computing cost. Consequently, for each query, only a small subset of the experts should be queried. This makes MoE in deep learning different from classical MoE. In classical MoE, the output for each query is a weighted sum of all experts' outputs. In deep learning MoE, the output for each query can only involve a few experts' outputs. Consequently, the key design choice in MoE becomes routing: given a batch of queries, how to route the queries to the best experts.

Sparsely-gated MoE layer

The sparsely-gated MoE layer,^[20] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them. Specifically, in a MoE layer, there are feedforward networks $f_{1},...,f_{n}$ , and a gating network $w$ . The gating network is defined by $w(x)=\mathrm {softmax} (\mathrm {top} _{k}(Wx+{\text{noise}}))$ , where $\mathrm {top} _{k}$ is a function that keeps the top-k entries of a vector the same, but sets all other entries to $-\infty$ . The addition of noise helps with load balancing.

The choice of $k$ is a hyperparameter that is chosen according to application. Typical values are $k=1,2$ . The $k=1$ version is also called the Switch Transformer. The original Switch Transformer was applied to a T5 language model.^[21]

As demonstration, they trained a series of models for machine translation with alternating layers of MoE and LSTM, and compared with deep LSTM models.^[22] Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters.

Load balancing

Vanilla MoE tend to have issues of load balancing: some experts are consulted often, while other experts rarely or not at all. To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions. This is improved by Switch Transformer^[21] into a single auxiliary loss function. Specifically, let $n$ be the number of experts, then for a given batch of queries $\{x_{1},x_{2},...,x_{T}\}$ , the auxiliary loss for the batch is $n\sum _{i=1}^{n}f_{i}P_{i}$ Here, $f_{i}={\frac {1}{T}}\#({\text{queries sent to expert }}i)$ is the fraction of tokens that chose expert $i$ , and $P_{i}={\frac {1}{T}}\sum _{j=1}^{T}{\frac {w_{i}(x_{j})}{\sum _{i'\in {\text{experts}}}w_{i'}(x_{j})}}$ is the fraction of weight on expert $i$ . This loss is minimized at $1$ , precisely when every expert has equal weight $1/n$ in all situations.

Researchers at DeepSeek designed a variant of MoE, with "shared experts" that are always queried, and "routed experts" that might not be. They found that standard load balancing encourages the experts to be equally consulted, but this then causes experts to replicate the same core capacity, such as English grammar. They proposed the shared experts to learn core capacities that are often used, and let the routed experts to learn the peripheral capacities that are rarely used.^[24]

They also proposed "auxiliary-loss-free load balancing strategy", which does not use auxiliary loss. Instead, each expert $i$ has an extra "expert bias" $b_{i}$ . If an expert is being neglected, then their bias increases, and vice versa. During token assignment, each token picks the top-k experts, but with the bias added in. That is:^[25] $f(x)=\sum _{i{\text{ is in the top-k of }}\{w(x)_{j}+b_{j}\}_{j}}w(x)_{i}f_{i}(x)$ Note that the expert bias matters for picking the experts, but not in adding up the responses from the experts.

Capacity factor

Suppose there are $n$ experts in a layer. For a given batch of queries $\{x_{1},x_{2},...,x_{T}\}$ , each query is routed to one or more experts. For example, if each query is routed to one expert as in Switch Transformers, and if the experts are load-balanced, then each expert should expect on average $T/n$ queries in a batch. In practice, the experts cannot expect perfect load balancing: in some batches, one expert might be underworked, while in other batches, it would be overworked.

Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned, load balancing is important. The capacity factor is sometimes used to enforce a hard constraint on load balancing. Each expert is only allowed to process up to $c\cdot T/n$ queries in a batch. The ST-MoE report found $c\in [1.25,2]$ to work well in practice.^[26]

Routing

In the original sparsely-gated MoE, only the top-k experts are queried, and their outputs are weighted-summed. There are other methods.^[26] Generally speaking, routing is an assignment problem: How to assign tokens to experts, such that a variety of constraints are followed (such as throughput, load balancing, etc)? There are typically three classes of routing algorithm: the experts choose the tokens ("expert choice"),^[27] the tokens choose the experts (the original sparsely-gated MoE), and a global assigner matching experts and tokens.^[28]

During inference, the MoE works over a large batch of tokens at any time. If the tokens were to choose the experts, then some experts might few tokens, while a few experts get so many tokens that it exceeds their maximum batch size, so they would have to ignore some of the tokens. Similarly, if the experts were to choose the tokens, then some tokens might not be picked by any expert. This is the "token drop" problem. Dropping a token is not necessarily a serious problem, since in Transformers, due to residual connections, if a token is "dropped", it does not disappear. Instead, its vector representation simply passes through the feedforward layer without change.^[28]

Other approaches include solving it as a constrained linear programming problem,^[29] using reinforcement learning to train the routing algorithm (since picking an expert is a discrete action, like in RL).^[30] The token-expert match may involve no learning ("static routing"): It can be done by a deterministic hash function ^[31] or a random number generator.^[32]

Applications to transformer models

MoE layers are used in the largest transformer models, for which learning and inferring over the full model is too costly. They are typically sparsely-gated, with sparsity 1 or 2. In Transformer models, the MoE layers are often used to select the feedforward layers (typically a linear-ReLU-linear network), appearing in each Transformer block after the multiheaded attention. This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger. For example, in the Palm-540B model, 90% of parameters are in its feedforward layers.^[33]

A trained Transformer can be converted to a MoE by duplicating its feedforward layers, with randomly initialized gating, then trained further. This is a technique called "sparse upcycling".^[34]

There are a large number of design choices involved in Transformer MoE that affect the training stability and final performance. The OLMoE report describes these in some detail.^[35]

As of 2023^[update], models large enough to use MoE tend to be large language models, where each expert has on the order of 10 billion parameters. Other than language models, Vision MoE^[36] is a Transformer model with MoE layers. They demonstrated it by training a model with 15 billion parameters. MoE Transformer has also been applied for diffusion models.^[37]

A series of large language models from Google used MoE. GShard^[38] uses MoE with up to top-2 experts per layer. Specifically, the top-1 expert is always selected, and the top-2th expert is selected with probability proportional to that experts' weight according to the gating function. Later, GLaM^[39] demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts. Switch Transformers^[21] use top-1 in all MoE layers.

The NLLB-200 by Meta AI is a machine translation model for 200 languages.^[40] Each MoE layer uses a hierarchical MoE with two levels. On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts. If using the experts, then another gating function computes the weights and chooses the top-2 experts.^[41]

MoE large language models can be adapted for downstream tasks by instruction tuning.^[42]

In December 2023, Mistral AI released Mixtral 8x7B under Apache 2.0 license. It is a MoE language model with 46.7B parameters, 8 experts, and sparsity 2. They also released a version finetuned for instruction following.^[43]^[44]

In March 2024, Databricks released DBRX. It is a MoE language model with 132B parameters, 16 experts, and sparsity 4. They also released a version finetuned for instruction following.^[45]^[46]

Related Research Articles

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

Recurrent neural networks (RNNs) are a class of artificial neural network commonly used for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function:

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Residual neural network</span> Type of artificial neural network

A residual neural network is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won the ImageNet Large Scale Visual Recognition Challenge of that year.

A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

T5 is a series of large language models developed by Google AI introduced in 2019. Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range. This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

References

↑ Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66–67: 178–200. Bibcode:2016MSSP...66..178B. doi:10.1016/j.ymssp.2015.05.009.
↑ Rokach, Lior (November 2009). Pattern Classification Using Ensemble Methods. Series in Machine Perception and Artificial Intelligence. Vol. 75. WORLD SCIENTIFIC. p. 142. doi:10.1142/7238. ISBN 978-981-4271-06-6 . Retrieved 14 November 2024.
↑ Hampshire, J.B.; Waibel, A. (July 1992). "The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 14 (7): 751–769. doi:10.1109/34.142911.
↑ Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, Kevin J. Lang (1995). "Phoneme Recognition Using Time-Delay Neural Networks*". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation. Psychology Press. doi:10.4324/9780203763247. ISBN 978-0-203-76324-7.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Nowlan, Steven; Hinton, Geoffrey E (1990). "Evaluation of Adaptive Mixtures of Competing Experts". Advances in Neural Information Processing Systems. 3. Morgan-Kaufmann.
↑ Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
1 2 Jordan, Michael; Jacobs, Robert (1991). "Hierarchies of adaptive experts". Advances in Neural Information Processing Systems. 4. Morgan-Kaufmann.
1 2 Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl: 1721.1/7206 . ISSN 0899-7667.
1 2 Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl: 1721.1/6620 . ISSN 0893-6080.
↑ Xu, Lei; Jordan, Michael; Hinton, Geoffrey E (1994). "An Alternative Model for Mixtures of Experts". Advances in Neural Information Processing Systems. 7. MIT Press.
↑ Collobert, Ronan; Bengio, Samy; Bengio, Yoshua (2001). "A Parallel Mixture of SVMs for Very Large Scale Problems". Advances in Neural Information Processing Systems. 14. MIT Press.
↑ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "12: Applications". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
↑ Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN 0167-9473.
↑ Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv: 1701.07429 . doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
↑ Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.
↑ Yang, Zhilin; Dai, Zihang; Salakhutdinov, Ruslan; Cohen, William W. (2017-11-10). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model". arXiv: 1711.03953 [cs.CL].
↑ Narang, Sharan; Chung, Hyung Won; Tay, Yi; Fedus, William; Fevry, Thibault; Matena, Michael; Malkan, Karishma; Fiedel, Noah; Shazeer, Noam (2021-02-23). "Do Transformer Modifications Transfer Across Implementations and Applications?". arXiv: 2102.11972 [cs.LG].
↑ Bengio, Yoshua; Léonard, Nicholas; Courville, Aaron (2013). "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation". arXiv: 1308.3432 [cs.LG].
↑ Eigen, David; Ranzato, Marc'Aurelio; Sutskever, Ilya (2013). "Learning Factored Representations in a Deep Mixture of Experts". arXiv: 1312.4314 [cs.LG].
↑ Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv: 1701.06538 [cs.LG].
1 2 3 Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv: 2101.03961 . ISSN 1532-4435.
↑ Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv: 1609.08144 [cs.CL].
↑ DeepSeek-AI; Liu, Aixin; Feng, Bei; Wang, Bin; Wang, Bingxuan; Liu, Bo; Zhao, Chenggang; Dengr, Chengqi; Ruan, Chong (19 June 2024), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, arXiv: 2405.04434 .
↑ Dai, Damai; Deng, Chengqi; Zhao, Chenggang; Xu, R. X.; Gao, Huazuo; Chen, Deli; Li, Jiashi; Zeng, Wangding; Yu, Xingkai (11 January 2024), DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, arXiv: 2401.06066
↑ DeepSeek-AI; Liu, Aixin; Feng, Bei; Xue, Bing; Wang, Bingxuan; Wu, Bochao; Lu, Chengda; Zhao, Chenggang; Deng, Chengqi (2024-12-27), DeepSeek-V3 Technical Report, arXiv, doi:10.48550/arXiv.2412.19437, arXiv:2412.19437
1 2 Zoph, Barret; Bello, Irwan; Kumar, Sameer; Du, Nan; Huang, Yanping; Dean, Jeff; Shazeer, Noam; Fedus, William (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models". arXiv: 2202.08906 [cs.CL].
↑ Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew M.; Chen, Zhifeng; Le, Quoc V.; Laudon, James (2022-12-06). "Mixture-of-Experts with Expert Choice Routing". Advances in Neural Information Processing Systems. 35: 7103–7114. arXiv: 2202.09368 .
1 2 Fedus, William; Dean, Jeff; Zoph, Barret (2022-09-04), A Review of Sparse Expert Models in Deep Learning, arXiv, doi:10.48550/arXiv.2209.01667, arXiv:2209.01667
↑ Lewis, Mike; Bhosale, Shruti; Dettmers, Tim; Goyal, Naman; Zettlemoyer, Luke (2021-07-01). "BASE Layers: Simplifying Training of Large, Sparse Models". Proceedings of the 38th International Conference on Machine Learning. PMLR: 6265–6274. arXiv: 2103.16716 .
↑ Bengio, Emmanuel; Bacon, Pierre-Luc; Pineau, Joelle; Precup, Doina (2015). "Conditional Computation in Neural Networks for faster models". arXiv: 1511.06297 [cs.LG].
↑ Roller, Stephen; Sukhbaatar, Sainbayar; szlam, arthur; Weston, Jason (2021). "Hash Layers For Large Sparse Models". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 17555–17566.
↑ Zuo, Simiao; Liu, Xiaodong; Jiao, Jian; Kim, Young Jin; Hassan, Hany; Zhang, Ruofei; Zhao, Tuo; Gao, Jianfeng (2022-02-03), Taming Sparsely Activated Transformer with Stochastic Experts, arXiv, doi:10.48550/arXiv.2110.04260, arXiv:2110.04260
↑ "Transformer Deep Dive: Parameter Counting". Transformer Deep Dive: Parameter Counting. Retrieved 2023-10-10.
↑ Komatsuzaki, Aran; Puigcerver, Joan; Lee-Thorp, James; Ruiz, Carlos Riquelme; Mustafa, Basil; Ainslie, Joshua; Tay, Yi; Dehghani, Mostafa; Houlsby, Neil (2023-02-17). "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints". arXiv: 2212.05055 [cs.LG].
↑ Muennighoff, Niklas; Soldaini, Luca; Groeneveld, Dirk; Lo, Kyle; Morrison, Jacob; Min, Sewon; Shi, Weijia; Walsh, Pete; Tafjord, Oyvind (2024-09-03), OLMoE: Open Mixture-of-Experts Language Models, arXiv: 2409.02060
↑ Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, André; Keysers, Daniel; Houlsby, Neil (2021). "Scaling Vision with Sparse Mixture of Experts". Advances in Neural Information Processing Systems. 34: 8583–8595. arXiv: 2106.05974 .
↑ Fei, Zhengcong; Fan, Mingyuan; Yu, Changqian; Li, Debang; Huang, Junshi (2024-07-16). "Scaling Diffusion Transformers to 16 Billion Parameters". arXiv: 2407.11633 [cs.CV].
↑ Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv: 2006.16668 [cs.CL].
↑ Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". arXiv: 2112.06905 [cs.CL].
↑ "200 languages within a single AI model: A breakthrough in high-quality machine translation". ai.facebook.com. 2022-06-19. Archived from the original on 2023-01-09.
↑ NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv: 2207.04672 [cs.CL].
↑ Shen, Sheng; Hou, Le; Zhou, Yanqi; Du, Nan; Longpre, Shayne; Wei, Jason; Chung, Hyung Won; Zoph, Barret; Fedus, William; Chen, Xinyun; Vu, Tu; Wu, Yuexin; Chen, Wuyang; Webson, Albert; Li, Yunxuan (2023). "Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models". arXiv: 2305.14705 [cs.CL].
↑ AI, Mistral (2023-12-11). "Mixtral of experts". mistral.ai. Retrieved 2024-02-04.
↑ Jiang, Albert Q.; Sablayrolles, Alexandre; Roux, Antoine; Mensch, Arthur; Savary, Blanche; Bamford, Chris; Chaplot, Devendra Singh; Casas, Diego de las; Hanna, Emma Bou (2024-01-08). "Mixtral of Experts". arXiv: 2401.04088 [cs.LG].
↑ "Introducing DBRX: A New State-of-the-Art Open LLM". Databricks. 2024-03-27. Retrieved 2024-03-28.
↑ Knight, Will. "Inside the Creation of the World's Most Powerful Open Source AI Model". Wired. ISSN 1059-1028 . Retrieved 2024-03-28.