Mixture of experts

Last updated

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. [1] It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

Contents

Basic theory

In mixture of experts, we always have the following ingredients, but they are constructed and combined differently.

Both the experts and the weighting function are trained by minimizing some form of loss function, generally by gradient descent. There is a lot of freedom in choosing the precise form of experts, the weighting function, and the loss function.

Meta-pi network

The meta-pi network, reported by Hampshire and Waibel, [2] uses as the output. The model is trained by performing gradient descent on the mean-squared error loss . The experts may be arbitrary functions.

In their original publication, they were solving the problem of classifying phonemes in speech signal from 6 different Japanese speakers, 2 females and 4 males. They trained 6 experts, each being a "time-delayed neural network" [3] (essentially a multilayered convolution network over the mel spectrogram). They found that the resulting mixture of experts dedicated 5 experts for 5 of the speakers, but the 6th (male) speaker does not have a dedicated expert, instead his voice was classified by a linear combination of the experts for the other 3 male speakers.

Adaptive mixtures of local experts

The adaptive mixtures of local experts [4] [5] uses a gaussian mixture model. Each expert simply predicts a gaussian distribution, and totally ignores the input. Specifically, the -th expert predicts that the output is , where is a learnable parameter. The weighting function is a linear-softmax function:

The mixture of experts predict that the output is distributed according to the probability density function:

It is trained by maximal likelihood estimation, that is, gradient ascent on . The gradient for the -th expert is

and the gradient for the weighting function is

For each input-output pair , the weighting function is changed to increase the weight on all experts that performed above average, and decrease the weight on all experts that performed below average. This encourages the weighting function to learn to select only the experts that make the right predictions for each input.

The -th expert is changed to make its prediction closer to , but the amount of change is proportional to . This has a Bayesian interpretation. Given input , the prior probability that expert is the right one is , and is the likelihood of evidence . So, is the posterior probability for expert , and so the rate of change for the -th expert is proportional to its posterior probability.

In words, the experts that, in hindsight, seemed like the good experts to consult, are asked to learn on the example. The experts that, in hindsight, were not, are left alone.

The combined effect is that the experts become specialized: Suppose two experts are both good at predicting a certain kind of input, but one is slightly better, then the weighting function would eventually learn to favor the better one. After that happens, the lesser expert is unable to obtain a high gradient signal, and becomes even worse at predicting such kind of input. Conversely, the lesser expert can become better at predicting other kinds of input, and increasingly pulled away into another region. This has a positive feedback effect, causing each expert to move apart from the rest and take care of a local region alone (thus the name "local experts").

Hierarchical MoE

Hierarchical mixtures of experts [6] [7] uses multiple levels of gating in a tree. Each gating is a probability distribution over the next level of gatings, and the experts are on the leaf nodes of the tree. They are similar to decision trees.

For example, a 2-level hierarchical MoE would have a first order gating function , and second order gating functions and experts . The total prediction is then .

Variants

The mixture of experts, being similar to the gaussian mixture model, can also be trained by the expectation-maximization algorithm, just like gaussian mixture models. Specifically, during the expectation step, the "burden" for explaining each data point is assigned over the experts, and during the maximization step, the experts are trained to improve the explanations they got a high burden for, while the gate is trained to improve its burden assignment. This can converge faster than gradient ascent on the log-likelihood. [7] [8]

The choice of gating function is often a softmax gating. Other than that, [9] proposed using gaussian distributions, and [8] proposed using exponential families.

Instead of performing a weighted sum of all the experts, in hard MoE [10] only the highest ranked expert is chosen. That is, . This can accelerate training and inference time. [11]

The experts can use more general forms of multivariant gaussian distributions. For example, [6] proposed , where are learnable parameters. In words, each expert learns to do linear regression, with a learnable uncertainty estimate.

One can use different experts than gaussian distributions. For example, one can use Laplace distribution, [12] or Student's t-distribution. [13] For binary classification, it also proposed logistic regression experts, with

where are learnable parameters. This is later generalized for multi-class classification, with multinomial logistic regression experts. [14]

Deep learning

The previous section described MoE as it was used before the era of deep learning. After deep learning, MoE found applications in running the largest models, as a simple way to perform conditional computation : only parts of the model are used, the parts chosen according to what the input is. [15]

The earliest paper that applies MoE to deep learning is "Learning Factored Representations in a Deep Mixture of Experts" (Eigen, Ranzato, Sutskever) [16] which proposes to use a different gating network at each layer in a deep neural network. Specifically, each gating is a linear-ReLU-linear-softmax network, and each expert is a linear-ReLU network.

The key design desideratum for MoE in deep learning is to reduce computing cost. Consequently, for each query, only a small subset of the experts should be queried. This makes MoE in deep learning different from classical MoE. In classical MoE, the output for each query is a weighted sum of all experts' outputs. In deep learning MoE, the output for each query can only involve a few experts' outputs. Consequently, the key design choice in MoE becomes routing: given a batch of queries, how to route the queries to the best experts.

Sparsely-gated MoE layer

The sparsely-gated MoE layer, [17] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them. Specifically, in a MoE layer, there are feedforward networks , and a gating network . The gating network is defined by , where is a function that keeps the top-k entries of a vector the same, but sets all other entries to . The addition of noise helps with load balancing.

The choice of is a hyperparameter that is chosen according to application. Typical values are . The version is also called the Switch Transformer. [18]

As demonstration, they trained a series of models for machine translation with alternating layers of MoE and LSTM, and compared with deep LSTM models. [19] Table 3 shows that the MoE models used less inference time compute, despite having 30x more parameters.

Vanilla MoE tend to have issues of load balancing: some experts are consulted often, while other experts rarely or not at all. To encourage the gate to select each expert with equal frequency (proper load balancing) within each batch, each MoE layer has two auxiliary loss functions. This is improved by [18] into a single auxiliary loss function. Specifically, let be the number of experts, then for a given batch of queries , the auxiliary loss for the batch is

Here, is the fraction of time where expert is ranked highest, and is the fraction of weight on expert . This loss is minimized at , precisely when every expert has equal weight in all situations.

Routing

In sparsely-gated MoE, only the top-k experts are queried, and their outputs are weighted-summed. There are other methods. [20]

In Hash MoE, [21] routing is performed deterministically by a hash function, fixed before learning begins. For example, if the model is a 4-layered Transformer, and input is a token for word "eat", and the hash of "eat" is , then the token would be routed to the 1st expert in layer 1, 4th expert in layer 2, etc. Despite its simplicity, it achieves competitive performance as sparsely gated MoE with .

In soft MoE, suppose in each batch, each expert can process queries, then there are queries that can be assigned per batch. Now for each batch of queries , the soft MoE layer computes an array , such that is a probability distribution over queries, and the -th expert's -th query is . [22] However, this does not work with autoregressive modelling, since the weights over one token depends on all other tokens'. [23]

Other approaches include solving it as a constrained linear programming problem, [24] making each expert choose the top-k queries it wants (instead of each query choosing the top-k experts for it), [25] using reinforcement learning to train the routing algorithm (since picking an expert is a discrete action, like in RL). [26]

Capacity factor

Suppose there are experts in a layer. For a given batch of queries , each query is routed to one or more experts. For example, if each query is routed to one expert as in Switch Transformers, and if the experts are load-balanced, then each expert should expect on average queries in a batch. In practice, the experts cannot expect perfect load balancing: in some batches, one expert might be underworked, while in other batches, it would be overworked.

Since the inputs cannot move through the layer until every expert in the layer has finished the queries it is assigned, load balancing is important. As a hard constraint on load balancing, there is the capacity factor: each expert is only allowed to process up to queries in a batch. [20] found to work in practice.

Applications to transformer models

MoE layers are used in the largest transformer models, for which learning and inferring over the full model is too costly. They are typically sparsely-gated, with sparsity 1 or 2. In Transformer models, the MoE layers are often used to select the feedforward layers (typically a linear-ReLU-linear network), appearing in each Transformer block after the multiheaded attention. This is because the feedforward layers take up an increasing portion of the computing cost as models grow larger. For example, in the Palm-540B model, 90% of parameters are in its feedforward layers. [27]

As of 2023, models large enough to use MoE tend to be large language models, where each expert has on the order of 10 billion parameters. Other than language models, Vision MoE [28] is a Transformer model with MoE layers. They demonstrated it by training a model with 15 billion parameters.

A series of large language models from Google used MoE. GShard [29] uses MoE with up to top-2 experts per layer. Specifically, the top-1 expert is always selected, and the top-2th expert is selected with probability proportional to that experts' weight according to the gating function. Later, GLaM [30] demonstrated a language model with 1.2 trillion parameters, each MoE layer using top-2 out of 64 experts. Switch Transformers [18] use top-1 in all MoE layers.

The NLLB-200 by Meta AI is a machine translation model for 200 languages. [31] Each MoE layer uses a hierarchical MoE with two levels. On the first level, the gating function chooses to use either a "shared" feedforward layer, or to use the experts. If using the experts, then another gating function computes the weights and chooses the top-2 experts. [32]

MoE large language models can be adapted for downstream tasks by instruction tuning. [33]

In December 2023, Mistral AI released Mixtral 8x7B under Apache 2.0 license. It is a MoE language model with 46.7B parameters, 8 experts, and sparsity 2. They also released a version finetuned for instruction following. [34] [35]

In March 2024, Databricks released DBRX. It is a MoE language model with 132B parameters, 16 experts, and sparsity 4. They also released a version finetuned for instruction following. [36] [37]

Further reading

See also

Related Research Articles

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors, in which case the mixture distribution is a multivariate distribution.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture models are used for clustering, under the name model-based clustering, and also for density estimation.

A Hopfield network is a spin glass system used to model neural networks, based on Ernst Ising's work with Wilhelm Lenz on the Ising model of magnetic materials. Hopfield networks were first described with respect to recurrent neural networks by Shun'ichi Amari in 1972 and with respect to biological neural networks by William Little in 1974, and were popularised by John Hopfield in 1982. Hopfield networks serve as content-addressable ("associative") memory systems with binary threshold nodes, or with continuous variables. Hopfield networks also provide a model for understanding human memory.

In machine learning, backpropagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the optimization algorithm to compute the network parameter updates.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.

<span class="mw-page-title-main">Rectifier (neural networks)</span> Activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the positive part of its argument:

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable to be learned can be described by a reduced number of variables in the input space . Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in .

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a seminal deep learning model in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

Modern Hopfield networks are generalizations of the classical Hopfield networks that break the linear scaling relationship between the number of input features and the number of stored memories. This is achieved by introducing stronger non-linearities leading to super-linear memory storage capacity as a function of the number of feature neurons. The network still requires a sufficient number of hidden neurons.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

References

  1. Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66–67: 178–200. Bibcode:2016MSSP...66..178B. doi:10.1016/j.ymssp.2015.05.009.
  2. Hampshire, J.B.; Waibel, A. (July 1992). "The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 14 (7): 751–769. doi:10.1109/34.142911.
  3. Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, Kevin J. Lang (1995). "Phoneme Recognition Using Time-Delay Neural Networks*". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation. Psychology Press. doi:10.4324/9780203763247. ISBN   978-0-203-76324-7.{{cite book}}: CS1 maint: multiple names: authors list (link)
  4. Nowlan, Steven; Hinton, Geoffrey E (1990). "Evaluation of Adaptive Mixtures of Competing Experts". Advances in Neural Information Processing Systems. 3. Morgan-Kaufmann.
  5. Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN   0899-7667. PMID   31141872. S2CID   572361.
  6. 1 2 Jordan, Michael; Jacobs, Robert (1991). "Hierarchies of adaptive experts". Advances in Neural Information Processing Systems. 4. Morgan-Kaufmann.
  7. 1 2 Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl: 1721.1/7206 . ISSN   0899-7667.
  8. 1 2 Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl: 1721.1/6620 . ISSN   0893-6080.
  9. Xu, Lei; Jordan, Michael; Hinton, Geoffrey E (1994). "An Alternative Model for Mixtures of Experts". Advances in Neural Information Processing Systems. 7. MIT Press.
  10. Collobert, Ronan; Bengio, Samy; Bengio, Yoshua (2001). "A Parallel Mixture of SVMs for Very Large Scale Problems". Advances in Neural Information Processing Systems. 14. MIT Press.
  11. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "12: Applications". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN   978-0-262-03561-3.
  12. Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN   0167-9473.
  13. Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv: 1701.07429 . doi:10.1016/j.neunet.2016.03.002. ISSN   0893-6080. PMID   27093693. S2CID   3171144.
  14. Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN   0893-6080. PMID   12662629.
  15. Bengio, Yoshua; Léonard, Nicholas; Courville, Aaron (2013). "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation". arXiv: 1308.3432 [cs.LG].
  16. Eigen, David; Ranzato, Marc'Aurelio; Sutskever, Ilya (2013). "Learning Factored Representations in a Deep Mixture of Experts". arXiv: 1312.4314 [cs.LG].
  17. Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv: 1701.06538 [cs.LG].
  18. 1 2 3 Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv: 2101.03961 . ISSN   1532-4435.
  19. Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv: 1609.08144 [cs.CL].
  20. 1 2 Zoph, Barret; Bello, Irwan; Kumar, Sameer; Du, Nan; Huang, Yanping; Dean, Jeff; Shazeer, Noam; Fedus, William (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models". arXiv: 2202.08906 [cs.CL].
  21. Roller, Stephen; Sukhbaatar, Sainbayar; szlam, arthur; Weston, Jason (2021). "Hash Layers For Large Sparse Models". Advances in Neural Information Processing Systems. 34. Curran Associates: 17555–17566. arXiv: 2106.04426 .
  22. Puigcerver, Joan; Riquelme, Carlos; Mustafa, Basil; Houlsby, Neil (2023). "From Sparse to Soft Mixtures of Experts". arXiv: 2308.00951 [cs.LG].
  23. Wang, Phil (2023-10-04), lucidrains/soft-moe-pytorch , retrieved 2023-10-08
  24. Lewis, Mike; Bhosale, Shruti; Dettmers, Tim; Goyal, Naman; Zettlemoyer, Luke (2021-07-01). "BASE Layers: Simplifying Training of Large, Sparse Models". Proceedings of the 38th International Conference on Machine Learning. PMLR: 6265–6274. arXiv: 2103.16716 .
  25. Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew M.; Chen, Zhifeng; Le, Quoc V.; Laudon, James (2022-12-06). "Mixture-of-Experts with Expert Choice Routing". Advances in Neural Information Processing Systems. 35: 7103–7114. arXiv: 2202.09368 .
  26. Bengio, Emmanuel; Bacon, Pierre-Luc; Pineau, Joelle; Precup, Doina (2015). "Conditional Computation in Neural Networks for faster models". arXiv: 1511.06297 [cs.LG].
  27. "Transformer Deep Dive: Parameter Counting". Transformer Deep Dive: Parameter Counting. Retrieved 2023-10-10.
  28. Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, André; Keysers, Daniel; Houlsby, Neil (2021). "Scaling Vision with Sparse Mixture of Experts". Advances in Neural Information Processing Systems. 34: 8583–8595. arXiv: 2106.05974 .
  29. Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv: 2006.16668 [cs.CL].
  30. Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". arXiv: 2112.06905 [cs.CL].
  31. "200 languages within a single AI model: A breakthrough in high-quality machine translation". ai.facebook.com. 2022-06-19. Archived from the original on 2023-01-09.
  32. NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv: 2207.04672 [cs.CL].
  33. Shen, Sheng; Hou, Le; Zhou, Yanqi; Du, Nan; Longpre, Shayne; Wei, Jason; Chung, Hyung Won; Zoph, Barret; Fedus, William; Chen, Xinyun; Vu, Tu; Wu, Yuexin; Chen, Wuyang; Webson, Albert; Li, Yunxuan (2023). "Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models". arXiv: 2305.14705 [cs.CL].
  34. AI, Mistral (2023-12-11). "Mixtral of experts". mistral.ai. Retrieved 2024-02-04.
  35. Jiang, Albert Q.; Sablayrolles, Alexandre; Roux, Antoine; Mensch, Arthur; Savary, Blanche; Bamford, Chris; Chaplot, Devendra Singh; Casas, Diego de las; Hanna, Emma Bou (2024-01-08), Mixtral of Experts, arXiv: 2401.04088 , retrieved 2024-02-04
  36. "Introducing DBRX: A New State-of-the-Art Open LLM". Databricks. 2024-03-27. Retrieved 2024-03-28.
  37. Knight, Will. "Inside the Creation of the World's Most Powerful Open Source AI Model". Wired. ISSN   1059-1028 . Retrieved 2024-03-28.