List of large language models

Last updated

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

Contents

This page lists notable large language models.

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

NameRelease date [a] DeveloperNumber of parameters (billion) [b] Corpus sizeTraining cost (petaFLOP-day)License [c] Notes
GPT-1 June 2018 OpenAI 0.1171 [1] MIT [2] First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT October 2018 Google 0.340 [3] 3.3 billion words [3] 9 [4] Apache 2.0 [5] An early and influential language model. [6] Encoder-only and thus not built to be prompted or generative. [7] Training took 4 days on 64 TPUv2 chips. [8]
T5 October 2019Google11 [9] 34 billion tokens [9] Apache 2.0 [10] Base model for many Google projects, such as Imagen. [11]
XLNet June 2019 Google 0.340 [12] 33 billion words330Apache 2.0 [13] An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days. [14]
GPT-2 February 2019 OpenAI 1.5 [15] 40GB [16] (~10 billion tokens) [17] 28 [18] MIT [19] Trained on 32 TPUv3 chips for 1 week. [18]
GPT-3 May 2020OpenAI175 [20] 300 billion tokens [17] 3640 [21] proprietaryA fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022. [22]
GPT-NeoMarch 2021 EleutherAI 2.7 [23] 825 GiB [24] MIT [25] The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3. [25]
GPT-J June 2021 EleutherAI 6 [26] 825 GiB [24] 200 [27] Apache 2.0GPT-3-style language model
Megatron-Turing NLGOctober 2021 [28] Microsoft and Nvidia 530 [29] 338.6 billion tokens [29] 38000 [30] Restricted web accessTrained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours. [30]
Ernie 3.0 TitanDecember 2021 Baidu 260 [31] 4 TbProprietaryChinese-language LLM. Ernie Bot is based on this model.
Claude [32] December 2021 Anthropic 52 [33] 400 billion tokens [33] betaFine-tuned for desirable behavior in conversations. [34]
GLaM (Generalist Language Model)December 2021Google1200 [35] 1.6 trillion tokens [35] 5600 [35] ProprietarySparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
GopherDecember 2021 DeepMind 280 [36] 300 billion tokens [37] 5833 [38] ProprietaryLater developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)January 2022Google137 [39] 1.56T words, [39] 168 billion tokens [37] 4110 [40] ProprietarySpecialized for response generation in conversations.
GPT-NeoXFebruary 2022 EleutherAI 20 [41] 825 GiB [24] 740 [27] Apache 2.0based on the Megatron architecture
Chinchilla March 2022 DeepMind 70 [42] 1.4 trillion tokens [42] [37] 6805 [38] ProprietaryReduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model)April 2022Google540 [43] 768 billion tokens [42] 29,250 [38] ProprietaryTrained for ~60 days on ~6000 TPU v4 chips. [38] As of October 2024, it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)May 2022 Meta 175 [44] 180 billion tokens [45] 310 [27] Non-commercial research [d] GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published. [46]
YaLM 100BJune 2022 Yandex 100 [47] 1.7TB [47] Apache 2.0English-Russian model based on Microsoft's Megatron-LM.
MinervaJune 2022Google540 [48] 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server [48] ProprietaryFor solving "mathematical and scientific questions using step-by-step reasoning". [49] Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM July 2022Large collaboration led by Hugging Face 175 [50] 350 billion tokens (1.6TB) [51] Responsible AIEssentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
GalacticaNovember 2022 Meta 120106 billion tokens [52] unknownCC-BY-NC-4.0Trained on scientific text and modalities.
AlexaTM (Teacher Models)November 2022 Amazon 20 [53] 1.3 trillion [54] proprietary [55] bidirectional sequence-to-sequence architecture
Neuro-sama December 2022IndependentUnknownUnknownprivately-ownedA language model designed for live-streaming on Twitch.
LLaMA (Large Language Model Meta AI)February 2023 Meta AI 65 [56] 1.4 trillion [56] 6300 [57] Non-commercial research [e] Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters. [56]
GPT-4 March 2023OpenAIUnknown [f] (According to rumors: 1760) [59] UnknownUnknownproprietaryAvailable for ChatGPT Plus users and used in several products.
ChameleonJune 2024 Meta AI 34 [60] 4.4 trillion
Cerebras-GPTMarch 2023 Cerebras 13 [61] 270 [27] Apache 2.0Trained with Chinchilla formula.
FalconMarch 2023 Technology Innovation Institute 40 [62] 1 trillion tokens, from RefinedWeb (filtered web text corpus) [63] plus some "curated corpora". [64] 2800 [57] Apache 2.0 [65]
BloombergGPTMarch 2023 Bloomberg L.P. 50363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets [66] ProprietaryTrained on financial data from proprietary sources, for financial tasks.
PanGu-Σ March 2023 Huawei 1085329 billion tokens [67] Proprietary
OpenAssistant [68] March 2023 LAION 171.5 trillion tokensApache 2.0Trained on crowdsourced open data
Jurassic-2 [69] March 2023 AI21 Labs UnknownUnknownProprietaryMultilingual [70]
PaLM 2 (Pathways Language Model 2)May 2023Google340 [71] 3.6 trillion tokens [71] 85,000 [57] ProprietaryWas used in Bard chatbot. [72]
Llama 2July 2023Meta AI70 [73] 2 trillion tokens [73] 21,000Llama 2 license1.7 million A100-hours. [74]
Claude 2 July 2023AnthropicUnknownUnknownUnknownProprietaryUsed in Claude chatbot. [75]
Granite 13b July 2023 IBM UnknownUnknownUnknownProprietaryUsed in IBM Watsonx. [76]
Mistral 7BSeptember 2023 Mistral AI 7.3 [77] UnknownApache 2.0
Claude 2.1 November 2023AnthropicUnknownUnknownUnknownProprietaryUsed in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages. [78]
Grok-1 [79] November 2023 xAI 314UnknownUnknownApache 2.0Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter). [80]
Gemini 1.0 December 2023 Google DeepMind UnknownUnknownUnknownProprietaryMultimodal model, comes in three sizes. Used in the chatbot of the same name. [81]
Mixtral 8x7BDecember 2023 Mistral AI 46.7UnknownUnknownApache 2.0Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. [82] Mixture of experts model, with 12.9 billion parameters activated per token. [83]
Mixtral 8x22BApril 2024 Mistral AI 141UnknownUnknownApache 2.0 [84]
Phi-2 December 2023Microsoft2.71.4T tokens419 [85] MITTrained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs. [85]
Gemini 1.5 February 2024 Google DeepMind UnknownUnknownUnknownProprietaryMultimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens. [86]
Gemini Ultra February 2024 Google DeepMind UnknownUnknownUnknown
GemmaFebruary 2024 Google DeepMind 76T tokensUnknownGemma Terms of Use [87]
Claude 3 March 2024AnthropicUnknownUnknownUnknownProprietaryIncludes three models, Haiku, Sonnet, and Opus. [88]
Nova October 2024 Rubik's AI UnknownUnknownUnknownProprietaryIncludes three models, Nova-Instant, Nova-Air, and Nova-Pro.
DBRX March 2024 Databricks and Mosaic ML 13612T TokensDatabricks Open Model LicenseTraining cost 10 million USD.
Fugaku-LLMMay 2024 Fujitsu, Tokyo Institute of Technology, etc.13380B TokensThe largest model ever trained on CPU-only, on the Fugaku. [89]
Phi-3 April 2024Microsoft14 [90] 4.8T TokensMITMicrosoft markets them as "small language model". [91]
Granite Code Models May 2024 IBM UnknownUnknownUnknownApache 2.0
Qwen2June 2024 Alibaba Cloud 72 [92] 3T TokensMultiple sizes, the smallest being 0.5B.
Nemotron-4June 2024 Nvidia 3409T Tokens200,000NVIDIA Open Model LicenseTrained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024. [93] [94]
Llama 3.1July 2024Meta AI40515.6T tokens440,000Llama 3 license405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs. [95] [96]
DeepSeek V3December 2024 DeepSeek 67114.8T tokens440,00DeepSeek License2.788M hours on H800 GPUs. [97]
Amazon NovaDecember 2024 Amazon UnknownUnknownUnknownProprietaryIncludes three models, Nova Micro, Nova Lite, and Nova Pro [98]

See also

Notes

  1. This is the date that documentation describing the model's architecture was first released.
  2. In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
  3. This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.
  4. The smaller models including 66B are publicly available, while the 175B model is available on request.
  5. Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
  6. As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..." [58]

Related Research Articles

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

<span class="mw-page-title-main">Cerebras</span> American semiconductor company

Cerebras Systems Inc. is an American artificial intelligence (AI) company with offices in Sunnyvale, San Diego, Toronto, and Bangalore, India. Cerebras builds computer systems for complex AI deep learning applications.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

OpenAI Codex is an artificial intelligence model developed by OpenAI. It parses natural language and generates code in response. It powers GitHub Copilot, a programming autocompletion tool for select IDEs, like Visual Studio Code and Neovim. Codex is a descendant of OpenAI's GPT-3 model, fine-tuned for use in programming applications.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

Hugging Face, Inc. is an American company incorporated under the Delaware General Corporation Law and based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

Prompt injection is a family of related computer security exploits carried out by getting a machine learning model which was trained to follow human-given instructions to follow instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">GPT-J</span> Open source artificial intelligence text generating language model developed by EleutherAI

GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.

<span class="mw-page-title-main">EleutherAI</span> Artificial intelligence research collective

EleutherAI is a grass-roots non-profit artificial intelligence (AI) research group. The group, considered an open-source version of OpenAI, was formed in a Discord server in July 2020 by Connor Leahy, Sid Black, and Leo Gao to organize a replication of GPT-3. In early 2023, it formally incorporated as the EleutherAI Institute, a non-profit research institute.

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained neural network model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen". A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter-efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.

The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.

<span class="mw-page-title-main">Llama (language model)</span> Large language model by Meta AI

Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024.

Vicuna LLM is an omnibus Large Language Model used in AI research. Its methodology is to enable the public at large to contrast and compare the accuracy of LLMs "in the wild" and to vote on their output; a question-and-answer chat format is used. At the beginning of each round two LLM chatbots from a diverse pool of nine are presented randomly and anonymously, their identities only being revealed upon voting on their answers. The user has the option of either replaying ("regenerating") a round, or beginning an entirely fresh one with new LLMs. Based on Llama 2, it is an open source project, and it itself has become the subject of academic research in the burgeoning field. A non-commercial, public demo of the Vicuna-13b model is available to access using LMSYS.

llama.cpp Software library for LLM inference

llama.cpp is an open source software library that performs inference on various large language models such as Llama. It is co-developed alongside the GGML project, a general-purpose tensor library.

References

  1. "Improving language understanding with unsupervised learning". openai.com. June 11, 2018. Archived from the original on 2023-03-18. Retrieved 2023-03-18.
  2. "finetune-transformer-lm". GitHub. Archived from the original on 19 May 2023. Retrieved 2 January 2024.
  3. 1 2 Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv: 1810.04805v2 [cs.CL].
  4. Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models". The Next Platform. Archived from the original on 2023-06-20. Retrieved 2023-06-20.
  5. "BERT". March 13, 2023. Archived from the original on January 13, 2021. Retrieved March 13, 2023 via GitHub.
  6. Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus. 151 (2): 127–138. doi: 10.1162/daed_a_01905 . S2CID   248377870. Archived from the original on 2023-11-17. Retrieved 2023-03-09.
  7. Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris (2022). "Bidirectional Language Models Are Also Few-shot Learners". arXiv: 2209.14500 [cs.LG].
  8. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv: 1810.04805v2 [cs.CL].
  9. 1 2 Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. arXiv: 1910.10683 . ISSN   1533-7928.
  10. google-research/text-to-text-transfer-transformer, Google Research, 2024-04-02, archived from the original on 2024-03-29, retrieved 2024-04-04
  11. "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Archived from the original on 2024-03-27. Retrieved 2024-04-04.
  12. "Pretrained models — transformers 2.0.0 documentation". huggingface.co. Archived from the original on 2024-08-05. Retrieved 2024-08-05.
  13. "xlnet". GitHub. Archived from the original on 2 January 2024. Retrieved 2 January 2024.
  14. Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January 2020). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". arXiv: 1906.08237 [cs.CL].
  15. "GPT-2: 1.5B Release". OpenAI. 2019-11-05. Archived from the original on 2019-11-14. Retrieved 2019-11-14.
  16. "Better language models and their implications". openai.com. Archived from the original on 2023-03-16. Retrieved 2023-03-13.
  17. 1 2 "OpenAI's GPT-3 Language Model: A Technical Overview". lambdalabs.com. 3 June 2020. Archived from the original on 27 March 2023. Retrieved 13 March 2023.
  18. 1 2 "openai-community/gpt2-xl · Hugging Face". huggingface.co. Archived from the original on 2024-07-24. Retrieved 2024-07-24.
  19. "gpt-2". GitHub. Archived from the original on 11 March 2023. Retrieved 13 March 2023.
  20. Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch. Archived from the original on 16 March 2023. Retrieved 9 March 2023.
  21. Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (May 28, 2020). "Language Models are Few-Shot Learners". arXiv: 2005.14165v4 [cs.CL].
  22. "ChatGPT: Optimizing Language Models for Dialogue". OpenAI. 2022-11-30. Archived from the original on 2022-11-30. Retrieved 2023-01-13.
  23. "GPT Neo". March 15, 2023. Archived from the original on March 12, 2023. Retrieved March 12, 2023 via GitHub.
  24. 1 2 3 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv: 2101.00027 [cs.CL].
  25. 1 2 Iyer, Abhishek (15 May 2021). "GPT-3's free alternative GPT-Neo is something to be excited about". VentureBeat. Archived from the original on 9 March 2023. Retrieved 13 March 2023.
  26. "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront". www.forefront.ai. Archived from the original on 2023-03-09. Retrieved 2023-02-28.
  27. 1 2 3 4 Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom, Marvin; Hestness, Joel (2023-04-01). "Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster". arXiv: 2304.03208 [cs.LG].
  28. Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". Microsoft Research. Archived from the original on 13 March 2023. Retrieved 13 March 2023.
  29. 1 2 Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv: 2201.11990 [cs.CL].
  30. 1 2 Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong (2022-07-21), DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, arXiv: 2201.05596
  31. Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang, Junyuan; Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai, Yangfan; Chen, Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu, Tian; Zeng, Wei; Li, Ge; Gao, Wen; Wang, Haifeng (December 23, 2021). "ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation". arXiv: 2112.12731 [cs.CL].
  32. "Product". Anthropic. Archived from the original on 16 March 2023. Retrieved 14 March 2023.
  33. 1 2 Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. (9 December 2021). "A General Language Assistant as a Laboratory for Alignment". arXiv: 2112.00861 [cs.CL].
  34. Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. (15 December 2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv: 2212.08073 [cs.CL].
  35. 1 2 3 Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM". ai.googleblog.com. Archived from the original on 2023-03-12. Retrieved 2023-03-09.
  36. "Language modelling at scale: Gopher, ethical considerations, and retrieval". www.deepmind.com. 8 December 2021. Archived from the original on 20 March 2023. Retrieved 20 March 2023.
  37. 1 2 3 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. (29 March 2022). "Training Compute-Optimal Large Language Models". arXiv: 2203.15556 [cs.CL].
  38. 1 2 3 4 Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways Archived 2023-06-10 at the Wayback Machine
  39. 1 2 Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything". ai.googleblog.com. Archived from the original on 2022-03-25. Retrieved 2023-03-09.
  40. Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-Tze; Jin, Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven; Ghafouri, Amin; Menegali, Marcelo (2022-01-01). "LaMDA: Language Models for Dialog Applications". arXiv: 2201.08239 [cs.CL].
  41. Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. (2022-05-01). GPT-NeoX-20B: An Open-Source Autoregressive Language Model. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. Vol. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. pp. 95–136. Archived from the original on 2022-12-10. Retrieved 2022-12-19.
  42. 1 2 3 Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of compute-optimal large language model training". Deepmind Blog. Archived from the original on 13 April 2022. Retrieved 9 March 2023.
  43. Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Archived from the original on 2022-04-04. Retrieved 2023-03-09.
  44. Susan Zhang; Mona Diab; Luke Zettlemoyer. "Democratizing access to large-scale language models with OPT-175B". ai.facebook.com. Archived from the original on 2023-03-12. Retrieved 2023-03-12.
  45. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv: 2205.01068 [cs.CL].
  46. "metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq". GitHub. Retrieved 2024-10-18.
  47. 1 2 Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay (2022-06-22), YaLM 100B, archived from the original on 2023-06-16, retrieved 2023-03-18
  48. 1 2 Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant (30 June 2022). "Solving Quantitative Reasoning Problems with Language Models". arXiv: 2206.14858 [cs.CL].
  49. "Minerva: Solving Quantitative Reasoning Problems with Language Models". ai.googleblog.com. 30 June 2022. Retrieved 20 March 2023.
  50. Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature. 615 (7951): 202–205. Bibcode:2023Natur.615..202A. doi:10.1038/d41586-023-00641-w. PMID   36890378. S2CID   257380916. Archived from the original on 16 March 2023. Retrieved 9 March 2023.
  51. "bigscience/bloom · Hugging Face". huggingface.co. Archived from the original on 2023-04-12. Retrieved 2023-03-13.
  52. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science". arXiv: 2211.09085 [cs.CL].
  53. "20B-parameter Alexa model sets new marks in few-shot learning". Amazon Science. 2 August 2022. Archived from the original on 15 March 2023. Retrieved 12 March 2023.
  54. Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. (3 August 2022). "AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model". arXiv: 2208.01448 [cs.CL].
  55. "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". aws.amazon.com. 17 November 2022. Archived from the original on 13 March 2023. Retrieved 13 March 2023.
  56. 1 2 3 "Introducing LLaMA: A foundational, 65-billion-parameter large language model". Meta AI. 24 February 2023. Archived from the original on 3 March 2023. Retrieved 9 March 2023.
  57. 1 2 3 "The Falcon has landed in the Hugging Face ecosystem". huggingface.co. Archived from the original on 2023-06-20. Retrieved 2023-06-20.
  58. "GPT-4 Technical Report" (PDF). OpenAI . 2023. Archived (PDF) from the original on March 14, 2023. Retrieved March 14, 2023.
  59. Schreiner, Maximilian (2023-07-11). "GPT-4 architecture, datasets, costs and more leaked". THE DECODER. Archived from the original on 2023-07-12. Retrieved 2024-07-26.
  60. Dickson, Ben (22 May 2024). "Meta introduces Chameleon, a state-of-the-art multimodal model". VentureBeat.
  61. Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models". Cerebras. Archived from the original on March 28, 2023. Retrieved March 28, 2023.
  62. "Abu Dhabi-based TII launches its own version of ChatGPT". tii.ae. Archived from the original on 2023-04-03. Retrieved 2023-04-03.
  63. Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien (2023-06-01). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only". arXiv: 2306.01116 [cs.CL].
  64. "tiiuae/falcon-40b · Hugging Face". huggingface.co. 2023-06-09. Retrieved 2023-06-20.
  65. UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free Archived 2024-02-08 at the Wayback Machine , 31 May 2023
  66. Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur, Prabhanjan; Rosenberg, David; Mann, Gideon (March 30, 2023). "BloombergGPT: A Large Language Model for Finance". arXiv: 2303.17564 [cs.LG].
  67. Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun (March 19, 2023). "PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing". arXiv: 2303.10845 [cs.CL].
  68. Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations – Democratizing Large Language Model Alignment". arXiv: 2304.07327 [cs.CL].
  69. Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI". www.timesofisrael.com. Archived from the original on 2023-07-24. Retrieved 2023-07-24.
  70. Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race". TechCrunch. Archived from the original on 2023-07-24. Retrieved 2023-07-24.
  71. 1 2 Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its predecessor". CNBC . Archived from the original on 16 May 2023. Retrieved 18 May 2023.
  72. "Introducing PaLM 2". Google. May 10, 2023. Archived from the original on May 18, 2023. Retrieved May 18, 2023.
  73. 1 2 "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model". Meta AI. 2023. Archived from the original on 2024-01-05. Retrieved 2023-07-19.
  74. "llama/MODEL_CARD.md at main · meta-llama/llama". GitHub. Archived from the original on 2024-05-28. Retrieved 2024-05-28.
  75. "Claude 2". anthropic.com. Archived from the original on 15 December 2023. Retrieved 12 December 2023.
  76. Nirmal, Dinesh (2023-09-07). "Building AI for business: IBM's Granite foundation models". IBM Blog. Archived from the original on 2024-07-22. Retrieved 2024-08-11.
  77. "Announcing Mistral 7B". Mistral. 2023. Archived from the original on 2024-01-06. Retrieved 2023-10-06.
  78. "Introducing Claude 2.1". anthropic.com. Archived from the original on 15 December 2023. Retrieved 12 December 2023.
  79. xai-org/grok-1, xai-org, 2024-03-19, archived from the original on 2024-05-28, retrieved 2024-03-19
  80. "Grok-1 model card". x.ai. Retrieved 12 December 2023.
  81. "Gemini – Google DeepMind". deepmind.google. Archived from the original on 8 December 2023. Retrieved 12 December 2023.
  82. Franzen, Carl (11 December 2023). "Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance". VentureBeat. Archived from the original on 11 December 2023. Retrieved 12 December 2023.
  83. "Mixtral of experts". mistral.ai. 11 December 2023. Archived from the original on 13 February 2024. Retrieved 12 December 2023.
  84. AI, Mistral (2024-04-17). "Cheaper, Better, Faster, Stronger". mistral.ai. Archived from the original on 2024-05-05. Retrieved 2024-05-05.
  85. 1 2 Hughes, Alyssa (12 December 2023). "Phi-2: The surprising power of small language models". Microsoft Research. Archived from the original on 12 December 2023. Retrieved 13 December 2023.
  86. "Our next-generation model: Gemini 1.5". Google. 15 February 2024. Archived from the original on 16 February 2024. Retrieved 16 February 2024. This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we've also successfully tested up to 10 million tokens.
  87. "Gemma" via GitHub.
  88. "Introducing the next generation of Claude". www.anthropic.com. Archived from the original on 2024-03-04. Retrieved 2024-03-04.
  89. "Fugaku-LLM/Fugaku-LLM-13B · Hugging Face". huggingface.co. Archived from the original on 2024-05-17. Retrieved 2024-05-17.
  90. "Phi-3". azure.microsoft.com. 23 April 2024. Archived from the original on 2024-04-27. Retrieved 2024-04-28.
  91. "Phi-3 Model Documentation". huggingface.co. Archived from the original on 2024-05-13. Retrieved 2024-04-28.
  92. "Qwen2". GitHub . Archived from the original on 2024-06-17. Retrieved 2024-06-17.
  93. "nvidia/Nemotron-4-340B-Base · Hugging Face". huggingface.co. 2024-06-14. Archived from the original on 2024-06-15. Retrieved 2024-06-15.
  94. "Nemotron-4 340B | Research". research.nvidia.com. Archived from the original on 2024-06-15. Retrieved 2024-06-15.
  95. "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta
  96. "llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models". GitHub. Archived from the original on 2024-07-23. Retrieved 2024-07-23.
  97. deepseek-ai/DeepSeek-V3, DeepSeek, 2024-12-26, retrieved 2024-12-26
  98. Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3, Amazon, 2024-12-27, retrieved 2024-12-27