A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.
| Name | Release date [a] | Developer | Number of parameters (billion) [b] | Corpus size | Training cost (petaFLOP- | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-1 | June 11, 2018 | OpenAI | 0.117 | Unknown | 1 [1] | MIT [2] | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. [3] |
| BERT | October 2018 | 0.340 [4] | 3.3 billion words [4] | 9 [5] | Apache 2.0 [6] | An early and influential language model. [7] Encoder-only and thus not built to be prompted or generative. [8] Training took 4 days on 64 TPUv2 chips. [9] | |
| T5 | October 2019 | 11 [10] | 34 billion tokens [10] | Apache 2.0 [11] | Base model for many Google projects, such as Imagen. [12] | ||
| XLNet | June 2019 | 0.340 [13] | 33 billion words | 330 | Apache 2.0 [14] | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days. [15] | |
| GPT-2 | February 2019 | OpenAI | 1.5 [16] | 40GB [17] (~10 billion tokens) [18] | 28 [19] | MIT [20] | Trained on 32 TPUv3 chips for 1 week. [19] |
| GPT-3 | May 2020 | OpenAI | 175 [21] | 300 billion tokens [18] | 3640 [22] | Proprietary | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022. [23] |
| GPT-Neo | March 2021 | EleutherAI | 2.7 [24] | 825 GiB [25] | Unknown | MIT [26] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3. [26] |
| GPT-J | June 2021 | EleutherAI | 6 [27] | 825 GiB [25] | 200 [28] | Apache 2.0 | GPT-3-style language model |
| Megatron-Turing NLG | October 2021 [29] | Microsoft and Nvidia | 530 [30] | 338.6 billion tokens [30] | 38000 [31] | Unreleased | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours [31] |
| Ernie 3.0 Titan | December 2021 | Baidu | 260 [32] | 4TB | Unknown | Proprietary | Chinese-language LLM. Ernie Bot is based on this model. |
| Claude [33] | December 2021 | Anthropic | 52 [34] | 400 billion tokens [34] | Unknown | Proprietary | Fine-tuned for desirable behavior in conversations. [35] |
| GLaM (Generalist Language Model) | December 2021 | 1200 [36] | 1.6 trillion tokens [36] | 5600 [36] | Proprietary | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. | |
| Gopher | December 2021 | DeepMind | 280 [37] | 300 billion tokens [38] | 5833 [39] | Proprietary | Later developed into the Chinchilla model. |
| LaMDA (Language Models for Dialog Applications) | January 2022 | 137 [40] | 1.56T words, [40] 168 billion tokens [38] | 4110 [41] | Proprietary | Specialized for response generation in conversations. | |
| GPT-NeoX | February 2022 | EleutherAI | 20 [42] | 825 GiB [25] | 740 [28] | Apache 2.0 | based on the Megatron architecture |
| Chinchilla | March 2022 | DeepMind | 70 [43] | 1.4 trillion tokens [43] [38] | 6805 [39] | Proprietary | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
| PaLM (Pathways Language Model) | April 2022 | 540 [44] | 768 billion tokens [43] | 29,250 [39] | Proprietary | Trained for ~60 days on ~6000 TPU v4 chips. [39] | |
| OPT (Open Pretrained Transformer) | May 2022 | Meta | 175 [45] | 180 billion tokens [46] | 310 [28] | Non-commercial research [d] | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published. [47] |
| YaLM 100B | June 2022 | Yandex | 100 [48] | 1.7TB [48] | Unknown | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM |
| Minerva | June 2022 | 540 [49] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server [49] | Unknown | Proprietary | For solving "mathematical and scientific questions using step-by-step reasoning". [50] Initialized from PaLM models, then finetuned on mathematical and scientific data. | |
| BLOOM | July 2022 | Large collaboration led by Hugging Face | 175 [51] | 350 billion tokens (1.6TB) [52] | Unknown | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) |
| Galactica | November 2022 | Meta | 120 | 106 billion tokens [53] | Unknown | CC-BY-NC-4.0 | Trained on scientific text and modalities. |
| AlexaTM (Teacher Models) | November 2022 | Amazon | 20 [54] | 1.3 trillion [55] | Unknown | Proprietary [56] | Bidirectional sequence-to-sequence architecture |
| Llama | February 2023 | Meta AI | 65 [57] | 1.4 trillion [57] | 6300 [58] | Non-commercial research [e] | Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters. [57] |
| GPT-4 | March 2023 | OpenAI | Unknown [f] (According to rumors: 1760) [60] | Unknown | Unknown, estimated 230,000 | Proprietary | Available for all ChatGPT users now and used in several products. |
| Cerebras-GPT | March 2023 | Cerebras | 13 [61] | 270 [28] | Apache 2.0 | Trained with Chinchilla formula. | |
| Falcon | March 2023 | Technology Innovation Institute | 40 [62] | 1 trillion tokens, from RefinedWeb (filtered web text corpus) [63] plus some "curated corpora". [64] | 2800 [58] | Apache 2.0 [65] | |
| BloombergGPT | March 2023 | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets [66] | Unknown | Unreleased | Trained on financial data from proprietary sources, for financial tasks |
| PanGu-Σ | March 2023 | Huawei | 1085 | 329 billion tokens [67] | Unknown | Proprietary | |
| OpenAssistant [68] | March 2023 | LAION | 17 | 1.5 trillion tokens | Unknown | Apache 2.0 | Trained on crowdsourced open data |
| Jurassic-2 [69] | March 2023 | AI21 Labs | Unknown | Unknown | Unknown | Proprietary | Multilingual [70] |
| PaLM 2 (Pathways Language Model 2) | May 2023 | 340 [71] | 3.6 trillion tokens [71] | 85,000 [58] | Proprietary | Was used in Bard chatbot. [72] | |
| YandexGPT | May 17, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. |
| Llama 2 | July 2023 | Meta AI | 70 [73] | 2 trillion tokens [73] | 21,000 | Llama 2 license | 1.7 million A100-hours. [74] |
| Claude 2 | July 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot. [75] |
| Granite 13b | July 2023 | IBM | Unknown | Unknown | Unknown | Proprietary | Used in IBM Watsonx. [76] |
| Mistral 7B | September 2023 | Mistral AI | 7.3 [77] | Unknown | Unknown | Apache 2.0 | |
| YandexGPT 2 | September 7, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. |
| Claude 2.1 | November 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages. [78] |
| Grok 1 [79] | November 2023 | xAI | 314 | Unknown | Unknown | Apache 2.0 | Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter). [80] |
| Gemini 1.0 | December 2023 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, comes in three sizes. Used in the chatbot of the same name. [81] |
| Mixtral 8x7B | December 2023 | Mistral AI | 46.7 | Unknown | Unknown | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. [82] Mixture of experts model, with 12.9 billion parameters activated per token. [83] |
| DeepSeek-LLM | November 29, 2023 | DeepSeek | 67 | 2T tokens [84] : table 2 | 12,000 | DeepSeek License | Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B [84] : figure 5 |
| Phi-2 | December 2023 | Microsoft | 2.7 | 1.4T tokens | 419 [85] | MIT | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs. [85] |
| Gemini 1.5 | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens. [86] |
| Gemini Ultra | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | |
| Gemma | February 2024 | Google DeepMind | 7 | 6T tokens | Unknown | Gemma Terms of Use [87] | |
| Claude 3 | March 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes three models, Haiku, Sonnet, and Opus. [88] |
| DBRX | March 2024 | Databricks and Mosaic ML | 136 | 12T tokens | Unknown | Databricks Open Model License [89] [90] | Training cost 10 million USD |
| YandexGPT 3 Pro | March 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. |
| Fugaku-LLM | May 2024 | Fujitsu, Tokyo Institute of Technology, etc. | 13 | 380B tokens | Unknown | Fugaku-LLM Terms of Use [91] | The largest model ever trained on CPU-only, on the Fugaku [92] |
| Chameleon | May 2024 | Meta AI | 34 [93] | 4.4 trillion | Unknown | Non-commercial research [94] | |
| Mixtral 8x22B | April 17, 2024 | Mistral AI | 141 | Unknown | Unknown | Apache 2.0 | [95] |
| Phi-3 | April 23, 2024 | Microsoft | 14 [96] | 4.8T tokens | Unknown | MIT | Microsoft markets them as "small language model". [97] |
| Granite Code Models | May 2024 | IBM | Unknown | Unknown | Unknown | Apache 2.0 | |
| YandexGPT 3 Lite | May 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. |
| Qwen2 | June 2024 | Alibaba Cloud | 72 [98] | 3T tokens | Unknown | Qwen License | Multiple sizes, the smallest being 0.5B. |
| DeepSeek-V2 | June 2024 | DeepSeek | 236 | 8.1T tokens | 28,000 | DeepSeek License | 1.4M hours on H800. [99] |
| Nemotron-4 | June 2024 | Nvidia | 340 | 9T tokens | 200,000 | NVIDIA Open Model License [100] [101] | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024. [102] [103] |
| Claude 3.5 | June 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Initially, only one model, Sonnet, was released. [104] In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available. [105] |
| Llama 3.1 | July 2024 | Meta AI | 405 | 15.6T tokens | 440,000 | Llama 3 license | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs. [106] [107] |
| Grok-2 | August 14, 2024 | xAI | Unknown | Unknown | Unknown | xAI Community License Agreement [108] [109] | Originally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025. [110] [111] |
| OpenAI o1 | September 12, 2024 | OpenAI | Unknown | Unknown | Unknown | Proprietary | Reasoning model. [112] |
| YandexGPT 4 Lite and Pro | October 24, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice chatbot. |
| Mistral Large | November 2024 | Mistral AI | 123 | Unknown | Unknown | Mistral Research License | Upgraded over time. The latest version is 24.11. [113] |
| Pixtral | November 2024 | Mistral AI | 123 | Unknown | Unknown | Mistral Research License | Multimodal. There is also a 12B version which is under Apache 2 license. [113] |
| Phi-4 | December 12, 2024 | Microsoft | 14 [114] | 9.8T tokens | Unknown | MIT | Microsoft markets them as "small language model". [115] |
| DeepSeek-V3 | December 2024 | DeepSeek | 671 | 14.8T tokens | 56,000 | MIT | 2.788M hours on H800 GPUs. [116] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025. [117] |
| Amazon Nova | December 2024 | Amazon | Unknown | Unknown | Unknown | Proprietary | Includes three models, Nova Micro, Nova Lite, and Nova Pro [118] |
| DeepSeek-R1 | January 2025 | DeepSeek | 671 | Not applicable | Unknown | MIT | No pretraining. Reinforcement-learned upon V3-Base. [119] [120] |
| Qwen2.5 | January 2025 | Alibaba | 72 | 18T tokens | Unknown | Qwen License | 7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants. [121] |
| MiniMax-Text-01 | January 2025 | Minimax | 456 | 4.7T tokens [122] | Unknown | Minimax Model license | [123] [122] |
| Gemini 2.0 | February 2025 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Three models released: Flash, Flash-Lite and Pro [124] [125] [126] |
| Claude 3.7 | February 24, 2025 | Anthropic | Unknown | Unknown | Unknown | Proprietary | One model, Sonnet 3.7. [127] |
| YandexGPT 5 Lite Pretrain and Pro | February 25, 2025 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice Neural Network chatbot. |
| GPT-4.5 | February 27, 2025 | OpenAI | Unknown | Unknown | Unknown | Proprietary | Largest non-reasoning model. [128] |
| Grok 3 | February 2025 | xAI | Unknown | Unknown | Unknown, estimated 5,800,000 | Proprietary | Training cost claimed "10x the compute of previous state-of-the-art models". [129] |
| Gemini 2.5 | March 25, 2025 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Three models released: Flash, Flash-Lite and Pro [130] |
| YandexGPT 5 Lite Instruct | March 31, 2025 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice Neural Network chatbot. |
| Llama 4 | April 5, 2025 | Meta AI | 400 | 40T tokens | Unknown | Llama 4 license | [131] [132] |
| OpenAI o3 and o4-mini | April 16, 2025 | OpenAI | Unknown | Unknown | Unknown | Proprietary | Reasoning models. [133] |
| Qwen3 | April 2025 | Alibaba Cloud | 235 | 36T tokens | Unknown | Apache 2.0 | Multiple sizes, the smallest being 0.6B. [134] |
| Claude 4 | May 22, 2025 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes two models, Sonnet and Opus. [135] |
| Grok 4 | July 9, 2025 | xAI | Unknown | Unknown | Unknown | Proprietary | |
| GLM-4.5 | July 29, 2025 | Zhipu AI | 355 | 22T tokens | Unknown | MIT | Released in 335B and 106B sizes. [136] Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix. [137] |
| GPT-OSS | August 5, 2025 | OpenAI | 117 | Unknown | Unknown | Apache 2.0 | Released in 20B and 120B sizes. [138] |
| Claude 4.1 | August 5, 2025 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes one model, Opus. [139] |
| GPT-5 | August 7, 2025 | OpenAI | Unknown | Unknown | Unknown | Proprietary | Includes three models, GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes thinking abilities. [140] [141] |
| DeepSeek-V3.1 | August 21, 2025 | DeepSeek | 671 | 15.639T | MIT | Training size: 14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases (630B + 209B) [142] It is a hybrid model that can switch between thinking and non-thinking modes. [143] | |
| YandexGPT 5.1 Pro | August 28, 2025 | Yandex | Unknown | Unknown | Unknown | Proprietary | Used in Alice Neural Network chatbot. |
| Apertus | September 2, 2025 | ETH Zurich and EPF Lausanne | 70 | 15 trillion [144] | Unknown | Apache 2.0 | It's said to be the first LLM to be compliant with EU's Artificial Intelligence Act. [145] |
| Claude 4.5 | September 29, 2025 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Only one variant is available, Sonnet. [146] |
| DeepSeek-V3.2-Exp | September 29, 2025 | DeepSeek | 685 | MIT | This experimental model built upon v3.1-Terminus uses a custom efficient mechanism tagged DeepSeek Sparse Attention (DSA). [147] [148] [149] | ||
| GLM-4.6 | September 30, 2025 | Zhipu AI | 357 | Apache 2.0 | [150] [151] [152] | ||
| Alice AI LLM 1.0 | October 28, 2025 | Yandex | Unknown | Unknown | Unknown | Proprietary | Available in Alice AI chatbot. |
| Timeline of major LLM releases (2024–present) |
|---|
![]() |
This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we've also successfully tested up to 10 million tokens.
{{cite web}}: CS1 maint: url-status (link){{cite web}}: CS1 maint: url-status (link)