This article needs additional citations for verification .(February 2026) |
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
For the training cost column, 1 petaFLOP-day equals 1 petaFLOP/sec × 1 day, or 8.64×1019 FLOP (floating point operations). Only the cost of the largest model is shown. The number of parameters is measured in billions, [a] and the training cost is measured in petaFLOP-days.
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-1 | Jun 11 | OpenAI | 0.117B | Unknown | 1 [1] | MIT [2] | |
| BERT | Oct 2018 | 0.340B [4] | 3.3 billion words [4] | 9 [5] | Apache 2.0 [6] | An early and influential language model. [7] Encoder-only and thus not built to be prompted or generative. [8] Training took 4 days on 64 TPUv2 chips. [9] |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| T5 | Oct 2019 | 11B [10] | 34 billion tokens [10] | Unknown | Apache 2.0 [11] | Base model for Google projects like Imagen. [12] | |
| XLNet | Jun 2019 | 0.340B [13] | 33 billion words | 330 | Apache 2.0 [14] | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days. [15] | |
| GPT-2 | Feb 2019 | OpenAI | 1.5B [16] | 40GB [17] (~10 billion tokens) [18] | 28 [19] | MIT [20] | Trained on 32 TPUv3 chips for 1 week. [19] |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| GPT-Neo | Mar 2021 | EleutherAI | 2.7B [24] | 825 GiB [25] | Unknown | MIT [26] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3. [26] |
| GPT-J | Jun 2021 | EleutherAI | 6B [27] | 825 GiB [25] | 200 [28] | Apache 2.0 | |
| Megatron-Turing NLG | Oct 2021 [29] | Microsoft and Nvidia | 530B [30] | 338.6 billion tokens [30] | 38000 [31] | Unreleased | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours. [31] |
| Ernie 3.0 Titan | Dec 2021 | Baidu | 260B [32] | 4TB | Unknown | Proprietary | |
| Claude [33] | Dec 2021 | Anthropic | 52B [34] | 400 billion tokens [34] | Unknown | Proprietary | Fine-tuned for desirable behavior in conversations. [35] |
| GLaM (Generalist Language Model) | Dec 2021 | 1200B [36] | 1.6 trillion tokens [36] | 5600 [36] | Proprietary | ||
| Gopher | Dec 2021 | Google DeepMind | 280B [37] | 300 billion tokens [38] | 5833 [39] | Proprietary |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| LaMDA (Language Models for Dialog Applications) | Jan 2022 | 137B [40] | 1.56T words, [40] 168 billion tokens [38] | 4110 [41] | Proprietary | ||
| GPT-NeoX | Feb 2022 | EleutherAI | 20B [42] | 825 GiB [25] | 740 [28] | Apache 2.0 | |
| Chinchilla | Mar 2022 | Google DeepMind | 70B [43] | 1.4 trillion tokens [43] [38] | 6805 [39] | Proprietary | |
| PaLM (Pathways Language Model) | Apr 2022 | 540B [44] | 768 billion tokens [43] | 29,250 [39] | Proprietary | ||
| OPT (Open Pretrained Transformer) | May 2022 | Meta | 175B [45] | 180 billion tokens [46] | 310 [28] | Non-commercial research [d] | GPT-3 architecture with some adaptations from Megatron. The training logbook written by the team was published. [47] |
| YaLM 100B | Jun 2022 | Yandex | 100B [48] | 1.7TB [48] | Unknown | Apache 2.0 | |
| Minerva | Jun 2022 | 540B [49] | 38.5B tokens from webpages filtered for math content and from arXiv [49] | Unknown | Proprietary | For solving "mathematical and scientific questions using step-by-step reasoning". [50] | |
| BLOOM | Jul 2022 | Large collaboration led by Hugging Face | 175B [51] | 350 billion tokens (1.6TB) [52] | Unknown | Responsible AI | |
| Galactica | Nov 2022 | Meta | 120B | 106 billion tokens [53] | Unknown | CC-BY-NC-4.0 | |
| AlexaTM (Teacher Models) | Nov 2022 | Amazon | 20B [54] | 1.3 trillion [55] | Unknown | Proprietary [56] |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| Llama | Feb 2023 | Meta AI | 65B [57] | 1.4 trillion [57] | 6300 [58] | Non-commercial research [e] | |
| GPT-4 | Mar 2023 | OpenAI | Unknown [f] (According to rumors: 1760) [60] | Unknown | Unknown, estimated 230,000 | Proprietary | |
| Cerebras-GPT | Mar 2023 | Cerebras | 13B [61] | 270 [28] | Apache 2.0 | ||
| Falcon | Mar 2023 | Technology Innovation Institute | 40B [62] | 1 trillion tokens, from RefinedWeb (filtered web text corpus) [63] plus some "curated corpora". [64] | 2800 [58] | Apache 2.0 [65] | |
| BloombergGPT | Mar 2023 | Bloomberg L.P. | 50B | 363 billion tokens from Bloomberg's proprietary data sources, plus 345 billion tokens from general purpose datasets [66] | Unknown | Unreleased | Designed for financial tasks. [66] |
| PanGu-Σ | Mar 2023 | Huawei | 1085B | 329 billion tokens [67] | Unknown | Proprietary | |
| OpenAssistant [68] | Mar 2023 | LAION | 17B | 1.5 trillion tokens | Unknown | Apache 2.0 | |
| Jurassic-2 [69] [70] | Mar 2023 | AI21 Labs | Unknown | Unknown | Unknown | Proprietary | |
| PaLM 2 (Pathways Language Model 2) | May 2023 | 340B [71] | 3.6 trillion tokens [71] | 85,000 [58] | Proprietary | ||
| YandexGPT | May 17, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Llama 2 | Jul 2023 | Meta AI | 70B [73] | 2 trillion tokens [73] | 21,000 | Llama 2 | Trained over 3.3 million GPU (A100) hours. [74] |
| Claude 2 | Jul 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in the Claude chatbot. [75] |
| Granite 13b | Jul 2023 | IBM | Unknown | Unknown | Unknown | Proprietary | Used in IBM Watsonx. [76] |
| Mistral 7B | Sep 2023 | Mistral AI | 7.3B [77] | Unknown | Unknown | Apache 2.0 | |
| YandexGPT 2 | Sep 7, 2023 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Claude 2.1 | Nov 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in the Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages. [78] |
| Grok-1 [79] | Nov 2023 | xAI | 314B | Unknown | Unknown | Apache 2.0 | |
| Gemini 1.0 | Dec 2023 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, comes in three sizes. Used in the chatbot of the same name. [81] |
| Mixtral 8x7B | Dec 2023 | Mistral AI | 46.7B | Unknown | Unknown | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. [82] Mixture of experts model, with 12.9 billion parameters activated per token. [83] |
| DeepSeek-LLM | Nov 29, 2023 | DeepSeek | 67B | 2T tokens [84] : table 2 | 12,000 | DeepSeek | Trained on English and Chinese text. Used 1024 training FLOPs for 67B model, 10b FLOPs for 7B. [84] : figure 5 |
| Phi-2 | Dec 2023 | Microsoft | 2.7B | 1.4T tokens | 419 [85] | MIT | Trained on real and synthetic "textbook-quality" data over 14 days on 96 A100 GPUs. [85] |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| Gemini 1.5 | Feb 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model based on a MoE architecture. Context window above 1 million tokens. [86] |
| Gemini Ultra | Feb 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | |
| Gemma | Feb 2024 | Google DeepMind | 7B | 6T tokens | Unknown | Gemma Terms of Use [87] | |
| OLMo | Feb 2024 | Allen Institute for AI | 7B [88] | 2T tokens [89] | Unknown | Apache 2.0 | |
| Claude 3 | Mar 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes three models: Haiku, Sonnet, and Opus. [90] |
| DBRX | Mar 2024 | Databricks and Mosaic ML | 136B | 12T tokens | Unknown | Databricks Open Model [91] [92] | |
| YandexGPT 3 Pro | Mar 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Fugaku-LLM [93] | May 2024 | Fujitsu, Tokyo Institute of Technology, Tohoku University, RIKEN, etc. | 13 | 380B tokens | Unknown | Fugaku-LLM Terms of Use [94] | |
| Chameleon | May 2024 | Meta AI | 34B [96] | 4.4 trillion | Unknown | Non-commercial research [97] | |
| Mixtral 8x22B [98] | Apr 17, 2024 | Mistral AI | 141B | Unknown | Unknown | Apache 2.0 | |
| Phi-3 | Apr 23, 2024 | Microsoft | 14B [99] | 4.8T tokens[ citation needed ] | Unknown | MIT | Marketed by Microsoft as a "small language model". [100] |
| Granite Code Models | May 2024 | IBM | Unknown | Unknown | Unknown | Apache 2.0 | |
| YandexGPT 3 Lite | May 28, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Qwen2 | Jun 2024 | Alibaba Cloud | 72B [101] | 3T tokens | Unknown | Various | |
| DeepSeek-V2 | Jun 2024 | DeepSeek | 236B | 8.1T tokens | 28,000 | DeepSeek | 1.4M hours on H800. [102] |
| Nemotron-4 | Jun 2024 | Nvidia | 340B | 9T tokens | 200,000 | NVIDIA Open Model [103] [104] | |
| Claude 3.5 | Jun 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| Llama 3.1 | Jul 2024 | Meta AI | 405B | 15.6T tokens | 440,000 | Llama 3 | |
| Grok-2 | Aug 14, 2024 | xAI | Unknown | Unknown | Unknown | xAI Community License Agreement [111] [112] | |
| OpenAI o1 | Sep 12, 2024 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| Sarvam-1 | Oct 24, 2024 | Sarvam AI | 2B | ~2T tokens | Unknown | Sarvam AI Research | |
| YandexGPT 4 Lite and Pro | Oct 24, 2024 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Mistral Large | Nov 2024 | Mistral AI | 123B | Unknown | Unknown | Mistral Research | Upgraded over time. The latest version is 24.11. [119] |
| Pixtral | Nov 2024 | Mistral AI | 123B | Unknown | Unknown | Mistral Research | Multimodal. There is also a 12B version which is under Apache 2 license. [119] |
| OLMo 2 | Nov 2024 | Allen Institute for AI | 32 [120] [121] | 6.6T tokens [121] | 15,000 [121] | Apache 2.0 | |
| Phi-4 | Dec 12, 2024 | Microsoft | 14 [122] | 9.8T tokens | Unknown | MIT | Marketed by Microsoft as a "small language model". [123] |
| DeepSeek-V3 | Dec 2024 | DeepSeek | 671B | 14.8T tokens | 56,000 | MIT | |
| Amazon Nova | Dec 2024 | Amazon | Unknown | Unknown | Unknown | Proprietary | Includes three models: Nova Micro, Nova Lite, and Nova Pro. [126] |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| DeepSeek-R1 | Jan 20 | DeepSeek | 671B | Not applicable | Unknown | MIT | |
| Qwen2.5 | Jan 26 | Alibaba | 72B | 18T tokens | Unknown | Various | 7 dense models with parameter counts from 0.5B to 72B. Alibaba also released 2 MoE variants. [129] |
| MiniMax-Text-01 | Jan 14 | Minimax | 456B | 4.7T tokens [130] | Unknown | Minimax Model | |
| Gemini 2.0 | Feb 5 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | |
| Grok 3 | Feb 19 | xAI | Unknown | Unknown | Unknown | Proprietary | Training cost claimed to be "10x the compute of previous state-of-the-art models". [135] |
| Claude 3.7 | Feb 24 | Anthropic | Unknown | Unknown | Unknown | Proprietary | One model, Sonnet 3.7. [136] |
| YandexGPT 5 Lite Pretrain and Pro | Feb 25 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| GPT-4.5 | Feb 27 | OpenAI | Unknown | Unknown | Unknown | Proprietary | OpenAI's largest non-reasoning model at the time. [137] |
| Gemini 2.5 | Mar 25 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Three models released: Flash, Flash-Lite and Pro. [138] |
| YandexGPT 5 Lite Instruct | Mar 31 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Llama 4 | Apr 5 | Meta AI | 400B | 40T tokens | Unknown | Llama 4 | |
| OpenAI o3 and o4-mini | Apr 16 | OpenAI | Unknown | Unknown | Unknown | Proprietary | Reasoning models. [141] |
| Qwen3 | Apr 28 | Alibaba Cloud | 235B | 36T tokens | Unknown | Apache 2.0 | Multiple sizes, the smallest being 0.6B. [142] |
| Claude 4 | May 22 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes two models, Sonnet and Opus. [143] |
| Sarvam-M | May 23 | Sarvam AI | 24B | Unknown | Unknown | Apache 2.0 | |
| Grok 4 | Jul 9 | xAI | Unknown | Unknown | Unknown | Proprietary | |
| Param-1 | Jul 21 | BharatGen | 2.9B [146] | 5T tokens "focus[ed] on India’s linguistic landscape" [146] | Unknown | Apache 2.0 | |
| GLM-4.5 | Jul 29 | Z.ai | 355B | 22T tokens [148] [g] | Unknown | MIT | Released in 335B and 106B sizes. [149] |
| GPT-OSS | Aug 5 | OpenAI | 117B | Unknown | Unknown | Apache 2.0 | Released in 20B and 120B sizes. [150] |
| Claude 4.1 | Aug 5 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes one model, Opus. [151] |
| GPT-5 | Aug 7 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| DeepSeek-V3.1 | Aug 21 | DeepSeek | 671B | 15.639T | Unknown | MIT | |
| YandexGPT 5.1 Pro | Aug 28 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Apertus | Sep 2 | ETH Zurich and EPF Lausanne | 70B | 15 trillion [156] | Unknown | Apache 2.0 | |
| Claude Sonnet 4.5 | Sep 29 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| GLM-4.6 | Sep 30 | Z.ai | 357B | Unknown | Unknown | Apache 2.0 | |
| Alice AI LLM 1.0 | Oct 28 | Yandex | Unknown | Unknown | Unknown | Proprietary | |
| Gemini 3 | Nov 18 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Models released: Deep Think and Pro. [162] |
| Olmo 3 [163] | Nov 20 | Allen Institute for AI | 32 | 5.9T tokens [164] | Unknown | Apache 2.0 | Includes 7B and 32B parameter versions, alongside reasoning and instruction-following models. [164] |
| Claude Opus 4.5 | Nov 24 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Largest model in the Claude family. [165] |
| DeepSeek-V3.2 | Dec 1 | DeepSeek | 685B | Unknown | Unknown | MIT | |
| GPT 5.2 | Dec 11 | OpenAI | Unknown | Unknown | Unknown | Proprietary | It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers. [169] |
| GLM-4.7 | Dec 22 | Z.ai | 355B | Unknown | Unknown | Apache 2.0 |
| Name | Release date [b] | Developer | Number of parameters | Corpus size | Training cost | License [c] | Notes |
|---|---|---|---|---|---|---|---|
| Qwen3-Max-Thinking | Jan 26 | Alibaba Cloud | Unknown | Unknown | Unknown | Proprietary | Proprietary reasoning model with adaptive tool-use, test-time scaling, and iterative self-reflection. [170] |
| Kimi K2.5 | Jan 27 | Moonshot AI | 1040B | 15T tokens | Unknown | Modified MIT | |
| Step-3.5-Flash | Feb 12 | StepFun | 196B | Unknown | Unknown | Apache 2.0 | |
| Claude Opus 4.6 | Feb 5 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| GPT-5.3-Codex | Feb 5 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| GLM-5 | Feb 12 | Z.ai | 754B | Unknown | Unknown | MIT | |
| Claude Sonnet 4.6 | Feb 17 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| Param-2 | Feb 17 | BharatGen | 17B | ~22T tokens | Unknown | BharatGen Research [177] | Mixture-of-experts model, successor of Param-1; many more Indic languages are supported. Trained on H100 GPUs for 24 days. [178] |
| Sarvam-105B | Feb 18 [h] | Sarvam AI | 105B | Unknown | Unknown | Apache 2.0 | |
| Sarvam-30B | ~16T tokens | ||||||
| GPT-5.4 | Mar 5 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| Mistral Small 4 | Mar 17 | Mistral AI | 119B | Unknown | Unknown | Apache 2.0 | |
| MiMo-V2-Pro | Mar 18 | Xiaomi | 1000B [185] | Unknown | Unknown | Proprietary | Mixture-of-experts (MoE) model with more than 1 trillion parameters (43 billion active). Designed for agentic scenarios. Initially available on OpenRouter under the codename "Hunter Alpha" before official release. [186] |
| Gemma 4 | Apr 2 | Google DeepMind | 31B | Unknown | Unknown | Apache 2.0 | |
| GLM-5.1 | Apr 7 | Z.ai | 754B | Unknown | Unknown | MIT | |
| Qwen3.6 (Qwen3.6-35B-A3B) | Apr 15 | Alibaba Cloud | 35B | Unknown | Unknown | Apache 2.0 | |
| Claude Opus 4.7 | Apr 16 | Anthropic | Unknown | Unknown | Unknown | Proprietary | |
| GPT-5.5 | Apr 23 | OpenAI | Unknown | Unknown | Unknown | Proprietary | |
| DeepSeek-V4-Flash | Apr 24 | DeepSeek | 284B | 32T | Unknown | MIT | Preview release [193] |
| DeepSeek-V4-Pro | 1.6T |
This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we've also successfully tested up to 10 million tokens.
{{cite web}}: CS1 maint: url-status (link){{cite web}}: CS1 maint: url-status (link)We further enhance Qwen3-Max-Thinking with two key innovations: (1) adaptive tool-use capabilities [...]; and (2) advanced test-time scaling techniques [...]. [...] We limit [parallel trajectories] and redirect saved computation to iterative self-reflection guided by a "take-experience" mechanism.