List of language model benchmarks

Last updated

Language model benchmarks are standardized tests designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning.

Contents

Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field.

Evaluation methods

Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime.

The benchmark scores are of the following kinds:

The pass@n score can be estimated more accurately by making attempts, and use the unbiased estimator , where is the number of correct attempts. [2]

Language

Agency

Reasoning

Mathematics

Programming

General

See also

References

  1. DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv: 2501.12948
  2. Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374
  3. Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012). The Winograd Schema Challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning.
  4. Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (2023-07-11). "The defeat of the Winograd Schema Challenge". Artificial Intelligence. 325: 103971. arXiv: 2201.02387 . doi:10.1016/j.artint.2023.103971. ISSN   0004-3702. S2CID   245827747.
  5. Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale". arXiv: 1907.10641 [cs.CL].
  6. Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019-05-19), HellaSwag: Can a Machine Really Finish Your Sentence?, arXiv: 1905.07830
  7. "HellaSwag". rowanzellers.com. Retrieved 2025-02-06.
  8. Rajpurkar, Pranav; Zhang, Jian; Lopyrev, Konstantin; Liang, Percy (2016-10-11), SQuAD: 100,000+ Questions for Machine Comprehension of Text, arXiv: 1606.05250
  9. Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv: 1804.07461 [cs.CL].
  10. "GLUE Benchmark". gluebenchmark.com. Retrieved 2019-02-25.
  11. Wang, Alex; Pruksachatkun, Yada; Nangia, Nikita; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2020-02-13), SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, arXiv: 1905.00537
  12. Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham, Quan Ngoc; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco; Boleda, Gemma; Fernández, Raquel (2016-06-20), The LAMBADA dataset: Word prediction requiring a broad discourse context, arXiv: 1606.06031
  13. Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022-05-08), TruthfulQA: Measuring How Models Mimic Human Falsehoods, arXiv: 2109.07958
  14. Clark, Peter; Cowhey, Isaac; Etzioni, Oren; Khot, Tushar; Sabharwal, Ashish; Schoenick, Carissa; Tafjord, Oyvind (2018-03-14), Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv: 1803.05457
  15. Srivastava, Aarohi; Rastogi, Abhinav; Rao, Abhishek; Shoeb, Abu Awal Md; Abid, Abubakar; Fisch, Adam; Brown, Adam R.; Santoro, Adam; Gupta, Aditya (2023-06-12), Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, arXiv: 2206.04615
  16. Suzgun, Mirac; Scales, Nathan; Schärli, Nathanael; Gehrmann, Sebastian; Tay, Yi; Chung, Hyung Won; Chowdhery, Aakanksha; Le, Quoc V.; Chi, Ed H. (2022-10-17), Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, arXiv: 2210.09261
  17. Mialon, Grégoire; Fourrier, Clémentine; Swift, Craig; Wolf, Thomas; LeCun, Yann; Scialom, Thomas (2023-11-21), GAIA: a benchmark for General AI Assistants, arXiv: 2311.12983
  18. Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv: 2110.14168
  19. Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv: 2009.03300
  20. Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv: 2103.03874
  21. math-eval (2025-01-26), math-eval/MathEval , retrieved 2025-01-27
  22. Chen, Wenhu; Yin, Ming; Ku, Max; Lu, Pan; Wan, Yixin; Ma, Xueguang; Xu, Jianyu; Wang, Xinyi; Xia, Tony (December 2023). Bouamor, Houda; Pino, Juan; Bali, Kalika (eds.). "TheoremQA: A Theorem-driven Question Answering Dataset". Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics: 7889–7901. doi:10.18653/v1/2023.emnlp-main.489.
  23. openai/miniF2F, OpenAI, 2025-02-01, retrieved 2025-02-03
  24. Gao, Bofei; Song, Feifan; Yang, Zhe; Cai, Zefan; Miao, Yibo; Dong, Qingxiu; Li, Lei; Ma, Chenghao; Chen, Liang (2024-12-24), Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models, arXiv: 2410.07985
  25. Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv: 2411.04872
  26. Hendrycks, Dan; Basart, Steven; Kadavath, Saurav; Mazeika, Mantas; Arora, Akul; Guo, Ethan; Burns, Collin; Puranik, Samir; He, Horace (2021-11-08), Measuring Coding Challenge Competence With APPS, arXiv, doi:10.48550/arXiv.2105.09938, arXiv:2105.09938
  27. Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374
  28. "CodeElo". codeelo-bench.github.io. Retrieved 2025-02-13.
  29. Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv: 2310.06770
  30. "SWE-bench". www.swebench.com. Retrieved 2025-02-11.
  31. Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv: 2311.12022
  32. Cui, Ruixiang (2025-02-03), ruixiangcui/AGIEval , retrieved 2025-02-03
  33. "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI". gair-nlp.github.io. Retrieved 2025-02-03.
  34. He, Chaoqun; Luo, Renjie; Bai, Yuzhuo; Hu, Shengding; Thai, Zhen Leng; Shen, Junhao; Hu, Jinyi; Han, Xu; Huang, Yujie (2024-06-06), OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems, arXiv, doi:10.48550/arXiv.2402.14008, arXiv:2402.14008
  35. "ARC Prize". ARC Prize. Retrieved 2025-01-27.
  36. "LiveBench". livebench.ai. Retrieved 2025-01-27.
  37. "Humanity's Last Exam". lastexam.ai. Retrieved 2025-02-02.