List of language model benchmarks

Last updated February 13, 2025

Language model benchmarks are standardized tests designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning.

Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field.

Evaluation methods

Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime.

The benchmark scores are of the following kinds:

pass@n: The model is given $n$ attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems.
cons@n: The model is given $n$ attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting".^[1]

The pass@n score can be estimated more accurately by making $N>n$ attempts, and use the unbiased estimator $1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}$ , where $c$ is the number of correct attempts.^[2]

Language

WSC (Winograd schema challenge): 273 sentences with ambiguous pronouns. The task is to determine what the pronoun refers to.^[3]
WinoGrande: A larger version of WSC with 44,000 items. Designed to be still challenging to the SOTA models of the time (2019) since the original had been saturated. This dataset consists of fill-in-the-blank style sentences, as opposed to the pronoun format of previous datasets.^[4]^[5]

HellaSwag: 10,000 descriptions of activities or events, each with four candidate endings; the model must choose the most plausible ending.^[6]^[7]
SQuAD (Stanford Question Answering Dataset): 100,000+ questions posed by crowdworkers on a set of Wikipedia articles. The task is, given a passage from Wikipedia and a question, find a span of text in the text that answers the question.^[8]
GLUE (General Language Understanding Evaluation): collection of 9 benchmarks designed for testing general language understanding. The tasks are in the format of sentence- or sentence-pair. There are over 1M items.^[9]^[10]
SuperGLUE: An update to GLUE. Designed to be still challenging to the SOTA models of the time (2019) since the original had been saturated. Includes 8 additional tasks (e.g. logical reasoning, commonsense inference, coreference resolution).^[11]
LAMBADA: 10,000 narrative passages from books, each with a missing last word that humans can guess if given the full passage but not from the last sentence alone.^[12]
TruthfulQA: 817 questions in health, law, finance and politics with common misconceptions.^[13]
ARC (AI2 Reasoning Challenge): Multiple choice questions, with a Challenge Set (2590 questions) and an Easy Set (5197 questions).^[14]
Big-Bench (Beyond the Imitation Game): A benchmark collection of 204 tasks.^[15] A particular subset of 23 tasks is called Big-Bench Hard.^[16]

Agency

GAIA: 450 questions with unambiguous answers that require information that can be obtained by browsing the Internet, requiring different levels of tooling and autonomy to solve. Divided into 3 difficulty levels.^[17]

Reasoning

Mathematics

GSM8K (Grade School Math): 8.5K linguistically diverse elementary school math word problems that require 2 to 8 basic arithmetic operations to solve.^[18]
MMLU (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.^[19]
MATH: 12,500 competition-level math problems divided into difficulty levels 1 to 5 (as the Art of Problem Solving), with AIME problems being level 5.^[20]
MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition.^[21]
TheoremQA: 800 questions that test for the use of 350 theorems from math, physics, electric engineering, computer science, and finance.^[22]
MiniF2F (mini formal-to-formal): 488 Olympiad-level mathematics problems from AIME, AMC, and IMO, stated in formal languages (Metamath, Lean, Isabelle (partially) and HOL Light (partially)).^[23]
Omni-MATH: 4428 competition-level math problems with human annotation.^[24]
FrontierMath: Several hundred questions from areas of modern math that are difficult for professional mathematicians to solve. Many questions have integer answers, so that answers can be verified automatically. Held-out to prevent contamination.^[25]

Programming

APPS: 10,000 problems from Codewars, AtCoder, Kattis, and Codeforces.^[26]
HumanEval: 164 problems where the solution is always a python function, often just a few lines long.^[27]
CodeElo: 387 contest problems from Codeforces during 2024, annotated with metadata such as contest divisions, problem difficulty ratings, and problem algorithm tags. Benchmarking is run by directly submitting to Codeforces, resulting in an Elo rating. Limited to 8 submissions per problem.^[28]
SWE-Bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue.^[29] There are 3 subsets: Lite (300 problems that are faster to run), Verified (human-validated subset of 500 problems reviewed by software engineers).
SWE-Bench Multimodal: a variant of SWE-Bench, with 619 task instances from 17 popular JavaScript repositories, each featuring images that are required for solving the task.^[30]

General

GPQA (Google-Proof Q&A): 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve. The "Diamond" subset contains the 198 hardest questions in it.^[31]
AGIEval: questions from 20 official, public, and high-standard admission and qualification exams, such as SAT, Gaokao, law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.^[32]
OlympicArena: 11,163 problems from 62 distinct Olympic competitions.^[33]
OlympiadBench: 8,476 math and physics problems in English and Chinese, sourced from International Olympiads, Chinese Olympiads, and Gaokao.^[34]
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Something similar to a Raven's Progressive Matrices test.^[35]
LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks.^[36]
Humanity's Last Exam: 3,000 questions across over a hundred academic subjects, with a held-out private dataset left unreleased to prevent contamination. 10% of questions requires both image and text comprehension and the rest are fully text-based. 80% of questions are scored by exact-match, and the rest are multiple-choice.^[37]

References

↑ DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv: 2501.12948
↑ Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374
↑ Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012). The Winograd Schema Challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning.
↑ Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (2023-07-11). "The defeat of the Winograd Schema Challenge". Artificial Intelligence. 325: 103971. arXiv: 2201.02387 . doi:10.1016/j.artint.2023.103971. ISSN 0004-3702. S2CID 245827747.
↑ Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale". arXiv: 1907.10641 [cs.CL].
↑ Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019-05-19), HellaSwag: Can a Machine Really Finish Your Sentence?, arXiv: 1905.07830
↑ "HellaSwag". rowanzellers.com. Retrieved 2025-02-06.
↑ Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (2016-10-11), SQuAD: 100,000+ Questions for Machine Comprehension of Text, arXiv: 1606.05250
↑ Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv: 1804.07461 [cs.CL].
↑ "GLUE Benchmark". gluebenchmark.com. Retrieved 2019-02-25.
↑ Wang, Alex; Pruksachatkun, Yada; Nangia, Nikita; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2020-02-13), SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, arXiv: 1905.00537
↑ Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham, Quan Ngoc; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco; Boleda, Gemma; Fernández, Raquel (2016-06-20), The LAMBADA dataset: Word prediction requiring a broad discourse context, arXiv: 1606.06031
↑ Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022-05-08), TruthfulQA: Measuring How Models Mimic Human Falsehoods, arXiv: 2109.07958
↑ Clark, Peter; Cowhey, Isaac; Etzioni, Oren; Khot, Tushar; Sabharwal, Ashish; Schoenick, Carissa; Tafjord, Oyvind (2018-03-14), Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv: 1803.05457
↑ Srivastava, Aarohi; Rastogi, Abhinav; Rao, Abhishek; Shoeb, Abu Awal Md; Abid, Abubakar; Fisch, Adam; Brown, Adam R.; Santoro, Adam; Gupta, Aditya (2023-06-12), Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, arXiv: 2206.04615
↑ Suzgun, Mirac; Scales, Nathan; Schärli, Nathanael; Gehrmann, Sebastian; Tay, Yi; Chung, Hyung Won; Chowdhery, Aakanksha; Le, Quoc V.; Chi, Ed H. (2022-10-17), Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, arXiv: 2210.09261
↑ Mialon, Grégoire; Fourrier, Clémentine; Swift, Craig; Wolf, Thomas; LeCun, Yann; Scialom, Thomas (2023-11-21), GAIA: a benchmark for General AI Assistants, arXiv: 2311.12983
↑ Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv: 2110.14168
↑ Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv: 2009.03300
↑ Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv: 2103.03874
↑ math-eval (2025-01-26), math-eval/MathEval , retrieved 2025-01-27
↑ Chen, Wenhu; Yin, Ming; Ku, Max; Lu, Pan; Wan, Yixin; Ma, Xueguang; Xu, Jianyu; Wang, Xinyi; Xia, Tony (December 2023). Bouamor, Houda; Pino, Juan; Bali, Kalika (eds.). "TheoremQA: A Theorem-driven Question Answering Dataset". Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics: 7889–7901. doi:10.18653/v1/2023.emnlp-main.489.
↑ openai/miniF2F, OpenAI, 2025-02-01, retrieved 2025-02-03
↑ Gao, Bofei; Song, Feifan; Yang, Zhe; Cai, Zefan; Miao, Yibo; Dong, Qingxiu; Li, Lei; Ma, Chenghao; Chen, Liang (2024-12-24), Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models, arXiv: 2410.07985
↑ Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv: 2411.04872
↑ Hendrycks, Dan; Basart, Steven; Kadavath, Saurav; Mazeika, Mantas; Arora, Akul; Guo, Ethan; Burns, Collin; Puranik, Samir; He, Horace (2021-11-08), Measuring Coding Challenge Competence With APPS, arXiv, doi:10.48550/arXiv.2105.09938, arXiv:2105.09938
↑ Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374
↑ "CodeElo". codeelo-bench.github.io. Retrieved 2025-02-13.
↑ Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv: 2310.06770
↑ "SWE-bench". www.swebench.com. Retrieved 2025-02-11.
↑ Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv: 2311.12022
↑ Cui, Ruixiang (2025-02-03), ruixiangcui/AGIEval , retrieved 2025-02-03
↑ "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI". gair-nlp.github.io. Retrieved 2025-02-03.
↑ He, Chaoqun; Luo, Renjie; Bai, Yuzhuo; Hu, Shengding; Thai, Zhen Leng; Shen, Junhao; Hu, Jinyi; Han, Xu; Huang, Yujie (2024-06-06), OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, arXiv, doi:10.48550/arXiv.2402.14008, arXiv:2402.14008
↑ "ARC Prize". ARC Prize. Retrieved 2025-01-27.
↑ "LiveBench". livebench.ai. Retrieved 2025-01-27.
↑ "Humanity's Last Exam". lastexam.ai. Retrieved 2025-02-02.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv: 2501.12948

[:42-2] Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374

[Hector-3] Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012). The Winograd Schema Challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning.

[4] Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (2023-07-11). "The defeat of the Winograd Schema Challenge". Artificial Intelligence. 325: 103971. arXiv: 2201.02387 . doi:10.1016/j.artint.2023.103971. ISSN 0004-3702. S2CID 245827747.

[Sakaguchi-5] Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale". arXiv: 1907.10641 [cs.CL].

[6] Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019-05-19), HellaSwag: Can a Machine Really Finish Your Sentence?, arXiv: 1905.07830

[7] "HellaSwag". rowanzellers.com. Retrieved 2025-02-06.

[8] Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (2016-10-11), SQuAD: 100,000+ Questions for Machine Comprehension of Text, arXiv: 1606.05250

[9] Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv: 1804.07461 [cs.CL].

[10] "GLUE Benchmark". gluebenchmark.com. Retrieved 2019-02-25.

[11] Wang, Alex; Pruksachatkun, Yada; Nangia, Nikita; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2020-02-13), SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, arXiv: 1905.00537

[12] Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham, Quan Ngoc; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco; Boleda, Gemma; Fernández, Raquel (2016-06-20), The LAMBADA dataset: Word prediction requiring a broad discourse context, arXiv: 1606.06031

[13] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022-05-08), TruthfulQA: Measuring How Models Mimic Human Falsehoods, arXiv: 2109.07958

[14] Clark, Peter; Cowhey, Isaac; Etzioni, Oren; Khot, Tushar; Sabharwal, Ashish; Schoenick, Carissa; Tafjord, Oyvind (2018-03-14), Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv: 1803.05457

[15] Srivastava, Aarohi; Rastogi, Abhinav; Rao, Abhishek; Shoeb, Abu Awal Md; Abid, Abubakar; Fisch, Adam; Brown, Adam R.; Santoro, Adam; Gupta, Aditya (2023-06-12), Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, arXiv: 2206.04615

[16] Suzgun, Mirac; Scales, Nathan; Schärli, Nathanael; Gehrmann, Sebastian; Tay, Yi; Chung, Hyung Won; Chowdhery, Aakanksha; Le, Quoc V.; Chi, Ed H. (2022-10-17), Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, arXiv: 2210.09261

[17] Mialon, Grégoire; Fourrier, Clémentine; Swift, Craig; Wolf, Thomas; LeCun, Yann; Scialom, Thomas (2023-11-21), GAIA: a benchmark for General AI Assistants, arXiv: 2311.12983

[:2-18] Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv: 2110.14168

[19] Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv: 2009.03300

[20] Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv: 2103.03874

[21] th-eval (2025-01-26), math-eval/MathEval , retrieved 2025-01-27

[22] Chen, Wenhu; Yin, Ming; Ku, Max; Lu, Pan; Wan, Yixin; Ma, Xueguang; Xu, Jianyu; Wang, Xinyi; Xia, Tony (December 2023). Bouamor, Houda; Pino, Juan; Bali, Kalika (eds.). "TheoremQA: A Theorem-driven Question Answering Dataset". Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics: 7889–7901. doi:10.18653/v1/2023.emnlp-main.489.

[23] openai/miniF2F, OpenAI, 2025-02-01, retrieved 2025-02-03

[24] Gao, Bofei; Song, Feifan; Yang, Zhe; Cai, Zefan; Miao, Yibo; Dong, Qingxiu; Li, Lei; Ma, Chenghao; Chen, Liang (2024-12-24), Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models, arXiv: 2410.07985

[25] Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv: 2411.04872

[26] Hendrycks, Dan; Basart, Steven; Kadavath, Saurav; Mazeika, Mantas; Arora, Akul; Guo, Ethan; Burns, Collin; Puranik, Samir; He, Horace (2021-11-08), Measuring Coding Challenge Competence With APPS, arXiv, doi:10.48550/arXiv.2105.09938, arXiv:2105.09938

[:4-27] Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374

[28] "CodeElo". codeelo-bench.github.io. Retrieved 2025-02-13.

[29] Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv: 2310.06770

[30] "SWE-bench". www.swebench.com. Retrieved 2025-02-11.

[31] Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv: 2311.12022

[32] Cui, Ruixiang (2025-02-03), ruixiangcui/AGIEval , retrieved 2025-02-03

[33] "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI". gair-nlp.github.io. Retrieved 2025-02-03.

[34] He, Chaoqun; Luo, Renjie; Bai, Yuzhuo; Hu, Shengding; Thai, Zhen Leng; Shen, Junhao; Hu, Jinyi; Han, Xu; Huang, Yujie (2024-06-06), OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, arXiv, doi:10.48550/arXiv.2402.14008, arXiv:2402.14008

[35] "ARC Prize". ARC Prize. Retrieved 2025-01-27.

[36] "LiveBench". livebench.ai. Retrieved 2025-01-27.

[37] "Humanity's Last Exam". lastexam.ai. Retrieved 2025-02-02.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]