Language model benchmark

Last updated

Performance of AI models on various benchmarks from 1998 to 2024 Performance of AI models on various benchmarks from 1998 to 2024.png
Performance of AI models on various benchmarks from 1998 to 2024

Language model benchmark is a standardized test designed to evaluate the performance of language model on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning.

Contents

Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field.

Overview

Types

Benchmarks may be described by the following adjectives, not mutually exclusive:

The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set.

Conversely, certain benchmarks may be used as a training set, such as the English Gigaword [4] or the One Billion Word Benchmark, which in modern language is just the negative log likelihood loss on a pretraining set with 1 billion words. [5] Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm.

Lifecycle

Generally, the life cycle of a benchmark consists of the following steps: [6]

Construction

Like datasets, benchmarks are typically constructed by several methods, individually or in combination:

Evaluation

Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime.

The benchmark scores are of the following kinds:

The pass@n score can be estimated more accurately by making attempts, and use the unbiased estimator , where is the number of correct attempts. [8]

For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores: BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, [9] SPICE, [10] etc.

Issues

List of benchmarks

General language modeling

Essentially any dataset can be used as a benchmark for statistical language modeling, with the perplexity (or near-equivalently, negative log-likelihood and bits per character, as in the original Shannon's test of the entropy of the English language [19] ) being used as the benchmark score. For example, the original GPT-2 announcement included those of the model on WikiText-2, enwik8, text8, and WikiText-103 (all being standard language datasets made from the English Wikipedia). [3] [20]

However, there had been datasets more commonly used, or specifically designed, for use as a benchmark.

General language understanding

See [22] for a review of over 100 such benchmarks.

General language generation

Open-book question-answering

Closed-book question-answering

Omnibus

Some benchmarks are "omnibus", meaning they are made by combining several previous benchmarks.

Multimodal

Some benchmarks specifically test for multimodal ability, usually between text, image, video, and audio.

Agency

Context length

Some benchmarks were designed specifically to test for processing continuous text that is very long.

Reasoning

Mathematics

  • Alg514: 514 algebra word problems and associated equation systems gathered from Algebra.com. [115] [116]
  • Math23K: 23,164 elementary school Chinese mathematical word problems, collected from various online educational websites. [117]
  • AQuA-RAT (Algebra Question Answering with Rationales): Also known as just "AQuA". 100,000 algebraic word problems with 5 choices per problem, and an annotation for the correct choice with natural language rationales. 34,202 "seed problems" were collected from many sources, such as GMAT and GRE, which were then expanded to the full dataset with Amazon Turk. [118]
  • GSM8K (Grade School Math): 8.5K linguistically diverse elementary school math word problems that require 2 to 8 basic arithmetic operations to solve. [119] Contains errors that had been corrected with GSM8K-Platinum. [120]
  • GSM1K: 1205 items with the same format and difficulty as GSM8K. More securely contained to avoid the data contamination concerns with the previous GSM8K. [121]
  • MATH: 12,500 competition-level math problems divided into difficulty levels 1 to 5 (as the Art of Problem Solving), with AIME problems being level 5. There are 1,324 level 5 items. [122] An adversarial version is MATH-P, obtained by modifying a few characters in the original questions. [123]
  • MathQA: 37,200 word problems in English. Each problem came from AQuA-RAT, and annotated with an "operation program" which exactly specifies the mathematical operations required to solve the problem, written in a domain-specific language with 58 operators. [124] Has a variant, MathQA-Python, consisting of 23,914 problems, produced by taking the solutions to a subset of the MathQA dataset, and rewriting into Python. [125]
  • MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition. [126]
  • TheoremQA: 800 questions that test for the use of 350 theorems from math, physics, electric engineering, computer science, and finance. [127]
  • ProofNet: 371 theorems in undergraduate-level mathematics, each consisting of a formal statement in Lean, a natural language statement, and a natural language proof. There are two tasks: given an informal (formal) statement, produce a corresponding formal (informal) statement; given an informal theorem statement, its informal proof, and its formal statement, produce a formal proof. [128] Originally was in Lean 3, [129] but the original authors deprecated it in favor of the Lean 4 version. [130]
  • miniF2F (mini formal-to-formal): 488 Olympiad-level mathematics problems from AIME, AMC, and IMO, stated in formal languages (Metamath, Lean, Isabelle (partially) and HOL Light (partially)). The task is to formally prove the formal statement, which can be verified automatically. [131]
  • U-MATH: 1100 math problems sourced from real-world university curricula, balanced across six subjects with 20% of problems including visual elements. [132]
  • MathBench: 3709 questions in English and Chinese, divided into 5 difficulty levels (basic arithmetic, primary school, middle school, high school, college). Divided into 2,209 questions of MathBench-T (theoretical) and 1,500 questions of MathBench-A (applied). [133]
  • PutnamBench: 1709 formalized versions of Putnam competition questions during 1962 - 2023. The task is to compute the numerical answer (if there is a numerical answer) and to provide a formal proof. The formalizations are in Lean 4, Isabelle, and Coq. [134] [135]
  • Omni-MATH: 4428 competition-level math problems with human annotation. [136]
  • FrontierMath: Several hundred questions from areas of modern math that are difficult for professional mathematicians to solve. Many questions have integer answers, so that answers can be verified automatically. Held-out to prevent contamination. [137]
  • MathArena: Instead of a purpose-built benchmark, the MathArena benchmark simply takes the latest math competitions (AIME and HMMT) as soon as possible and uses those to benchmark LLMs, to prevent contamination. [138]

Programming

  • APPS: 10,000 problems from Codewars, AtCoder, Kattis, and Codeforces. [139]
  • MBPP (Mostly Basic Programming Problems): 974 short Python functions designed to be solved by entry-level programmers. Each comes with a text description and unit tests. They were written by an internal pool of crowdworkers who have basic knowledge of Python. [125]
  • DS-1000: 1000 data science problems obtained by reformulating 451 unique StackOverflow problems, requiring the use of 7 Python libraries, such as NumPy and Pandas. The resposes are scored by running test cases and comparing outputs, and checking for the presence/absence of specific APIs or keywords. [140] [141]
  • HumanEval: 164 problems where the solution is always a python function, often just a few lines long. [8]
  • CodeElo: 387 contest problems from Codeforces during 2024, annotated with metadata such as contest divisions, problem difficulty ratings, and problem algorithm tags. Benchmarking is run by directly submitting to Codeforces, resulting in an Elo rating. Limited to 8 submissions per problem. [142]
  • Aider Polyglot: 225 of the hardest coding exercises from Exercism, in languages of C++, Go, Java, JavaScript, Python and Rust. [143]
  • BigCodeBench: 1140 tasks that requires multiple function calls. The benchmark involves 139 libraries and 7 domains. A subset BigCodeBench-Hard involves just a 148-task subset of the full benchmark. [144] [145]
  • SWE-bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue. [146] There are 2 subsets: Lite (300 problems that are faster to run), Verified (human-validated subset of 500 problems reviewed by software engineers). [147]
  • Multi-SWE-bench: 1,632 problems across 7 languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++. Similar to SWE-bench. [148]
  • SWE-bench Multimodal: a variant of SWE-bench, with 619 task instances from 17 popular JavaScript repositories, each featuring images that are required for solving the task. [149]
  • SWE-Lancer: 1,488 freelance software engineering tasks from Upwork. Includes implementation tasks (from $50 bug fixes to $32,000 feature implementations), called "IC" (for "Individual Contributor"), and "Management" tasks, where the model must choose between technical implementation proposals. [150] [151]
  • KernelBench: 250 PyTorch machine learning tasks, for which a CUDA kernel must be written. [152]
  • Cybench (cybersecurity bench): 40 professional-level Capture the Flag (CTF) tasks from 4 competitions. Tasks are broken down into subtasks for more fine-grained scoring. At least one professional-level human team at each competition was able to solve each of the tasks. The time it took the fastest team to solve each task ranged from 2 minutes to 25 hours. [153]
  • HCAST (Human-Calibrated Autonomy Software Tasks): 189 tasks in machine learning, cybersecurity, software engineering, and general reasoning. Each task has a "baseline", the measured average time required for a human skilled in the task domains, working under identical conditions as AI agents. The baseline ranges from 1 minute to 8+ hours. [154]
  • PaperBench: 8,316 individually gradable tasks that would be necessary for replicating 20 Spotlight and Oral papers from ICML 2024 from scratch. The human baseline of ML PhDs (best of 3 attempts) at 48 hours of effort is 41.4%. [155]
  • DSBench: 466 data analysis tasks and 74 data modeling tasks sourced from Kaggle and ModelOff competitions, spanning exploratory analysis, multi‑table joins, and predictive modeling with large CSVs and multimodal prompts. [156]
  • SpreadsheetBench: 912 real-world spreadsheet manipulation tasks scraped from public Excel help forums, spanning formula writing, data cleaning, filtering and layout edits in diverse formatting. Scored automatically on 2729 test cases at cell-, sheet- and overall levels. [157]

General

  • GPQA (Google-Proof Q&A): 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, designed to be PhD-level. The "Diamond" subset contains the 198 hardest questions in it. [158] [159] OpenAI found that human experts achieve an average score of 69.7% on the Diamond subset. [160]
  • SuperGPQA: 26,529 multiple-choice questions collected by domain experts in 285 graduate-level disciplines. The questions were collected by individuals with or pursuing a PhD and then refined and inspected with the help of large language models. [161]
  • MathVista: 6,141 questions involving quantitative reasoning that requires reading a picture to solve. [162]
  • AGIEval: questions from 20 official, public, and high-standard admission and qualification exams, such as SAT, Gaokao, law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. [163]
  • OlympicArena: 11,163 problems from 62 distinct Olympic competitions. [164]
  • OlympiadBench: 8,476 math and physics problems in English and Chinese, sourced from International Olympiads, Chinese Olympiads, and Gaokao. [165]
  • ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Given three pairs of before-and-after diagrams of applying a rule, apply the same rule to the fourth before-diagram. It is similar to a Raven's Progressive Matrices test. [166]
  • LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks. [167]
  • Humanity's Last Exam: 3,000 multimodal questions across over a hundred academic subjects, with a held-out private dataset left unreleased to prevent contamination. 10% of questions requires both image and text comprehension and the rest are fully text-based. 80% of questions are scored by exact string matching, and the rest are multiple-choice. [168]
  • SimpleBench: A multiple-choice text benchmark with over 200 questions covering spatio-temporal reasoning, social intelligence, and linguistic adversarial robustness (or trick questions). It is designed to test "everyday human reasoning". [169]

See also

References

  1. Chen, Danqi; Yih, Wen-tau (July 2020). Savary, Agata; Zhang, Yue (eds.). "Open-Domain Question Answering" . Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. Online: Association for Computational Linguistics: 34–37. doi:10.18653/v1/2020.acl-tutorials.8.
  2. Weng, Lilian (2020-10-29). "How to Build an Open-Domain Question Answering System?". lilianweng.github.io. Retrieved 2025-03-05.
  3. 1 2 Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (February 14, 2019). "Language Models are Unsupervised Multitask Learners" (PDF). OpenAI.
  4. "English Gigaword Fifth Edition". Linguistic Data Consortium . June 17, 2011. Retrieved 2025-05-17.
  5. 1 2 Chelba, Ciprian; Mikolov, Tomas; Schuster, Mike; Ge, Qi; Brants, Thorsten; Koehn, Phillipp; Robinson, Tony (2013). "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling". arXiv: 1312.3005 [cs.CL].
  6. 1 2 Dehghani, Mostafa; Tay, Yi; Gritsenko, Alexey A.; Zhao, Zhe; Houlsby, Neil; Diaz, Fernando; Metzler, Donald; Vinyals, Oriol (2021-07-14). "The Benchmark Lottery". arXiv: 2107.07002 [cs.LG].
  7. DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv: 2501.12948 [cs.CL].
  8. 1 2 Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14). "Evaluating Large Language Models Trained on Code". arXiv: 2107.03374 [cs.LG].
  9. Vedantam, Ramakrishna; Lawrence Zitnick, C.; Parikh, Devi (2015). "CIDEr: Consensus-Based Image Description Evaluation". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 4566–4575.
  10. Anderson, Peter; Fernando, Basura; Johnson, Mark; Gould, Stephen (2016). "SPICE: Semantic Propositional Image Caption Evaluation". In Leibe, Bastian; Matas, Jiri; Sebe, Nicu; Welling, Max (eds.). Computer Vision – ECCV 2016. Lecture Notes in Computer Science. Vol. 9909. Cham: Springer International Publishing. pp. 382–398. arXiv: 1607.08822 . doi:10.1007/978-3-319-46454-1_24. ISBN   978-3-319-46454-1.
  11. Northcutt, Curtis G.; Athalye, Anish; Mueller, Jonas (2021-11-07). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks". arXiv: 2103.14749 [stat.ML].
  12. Richie, Russell; Grover, Sachin; Tsui, Fuchiang (Rich) (May 2022). Demner-Fushman, Dina; Cohen, Kevin Bretonnel; Ananiadou, Sophia; Tsujii, Junichi (eds.). "Inter-annotator agreement is not the ceiling of machine learning performance: Evidence from a comprehensive set of simulations". Proceedings of the 21st Workshop on Biomedical Language Processing. Dublin, Ireland: Association for Computational Linguistics: 275–284. doi: 10.18653/v1/2022.bionlp-1.26 .
  13. Artstein, Ron (2017), Ide, Nancy; Pustejovsky, James (eds.), "Inter-annotator Agreement" , Handbook of Linguistic Annotation, Dordrecht: Springer Netherlands, pp. 297–313, doi:10.1007/978-94-024-0881-2_11, ISBN   978-94-024-0881-2 , retrieved 2025-02-22
  14. Nie, Yixin; Zhou, Xiang; Bansal, Mohit (November 2020). "What Can We Learn from Collective Human Opinions on Natural Language Inference Data?". In Webber, Bonnie; Cohn, Trevor; He, Yulan; Liu, Yang (eds.). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics. pp. 9131–9143. doi:10.18653/v1/2020.emnlp-main.734.
  15. Pavlick, Ellie; Kwiatkowski, Tom (November 2019). "Inherent Disagreements in Human Textual Inferences". Transactions of the Association for Computational Linguistics. 7: 677–694. doi: 10.1162/tacl_a_00293 . ISSN   2307-387X.
  16. Gururangan, Suchin; Swayamdipta, Swabha; Levy, Omer; Schwartz, Roy; Bowman, Samuel R.; Smith, Noah A. (2018-04-16). "Annotation Artifacts in Natural Language Inference Data". arXiv: 1803.02324 [cs.CL].
  17. Deng, Chunyuan; Zhao, Yilun; Tang, Xiangru; Gerstein, Mark; Cohan, Arman (June 2024). "Investigating Data Contamination in Modern Benchmarks for Large Language Models". In Duh, Kevin; Gomez, Helena; Bethard, Steven (eds.). Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics. pp. 8706–8719. arXiv: 2311.09783 . doi:10.18653/v1/2024.naacl-long.482.
  18. LI, Yanyang (2025-02-17), lyy1994/awesome-data-contamination , retrieved 2025-02-22
  19. Shannon, C. E. (1951). "Prediction and Entropy of Printed English" . Bell System Technical Journal. 30 (1): 50–64. doi:10.1002/j.1538-7305.1951.tb01366.x. ISSN   1538-7305.
  20. Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (February 14, 2019). "Better language models and their implications". OpenAI.
  21. Magnusson, Ian; Bhagia, Akshita; Hofmann, Valentin; Soldaini, Luca; Jha, Ananya Harsh; Tafjord, Oyvind; Schwenk, Dustin; Walsh, Evan Pete; Elazar, Yanai (2024-12-07). "Paloma: A Benchmark for Evaluating Language Model Fit". arXiv: 2312.10523 [cs.CL].
  22. Davis, Ernest (2023-10-23). "Benchmarks for Automated Commonsense Reasoning: A Survey". ACM Comput. Surv. 56 (4): 81:1–81:41. arXiv: 2302.04752 . doi:10.1145/3615355. ISSN   0360-0300.
  23. Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012). The Winograd Schema Challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning.
  24. Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (2023-07-11). "The defeat of the Winograd Schema Challenge". Artificial Intelligence. 325 103971. arXiv: 2201.02387 . doi:10.1016/j.artint.2023.103971. ISSN   0004-3702. S2CID   245827747.
  25. Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale". arXiv: 1907.10641 [cs.CL].
  26. "The Corpus of Linguistic Acceptability (CoLA)". nyu-mll.github.io. Archived from the original on 2025-03-11. Retrieved 2025-04-19.
  27. Warstadt, Alex; Singh, Amanpreet; Bowman, Samuel R. (November 2019). "Neural Network Acceptability Judgments". Transactions of the Association for Computational Linguistics. 7: 625–641. arXiv: 1805.12471 . doi:10.1162/tacl_a_00290. ISSN   2307-387X.
  28. Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D. (September 2015). "A large annotated corpus for learning natural language inference". In Màrquez, Lluís; Callison-Burch, Chris; Su, Jian (eds.). Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics. pp. 632–642. arXiv: 1508.05326 . doi:10.18653/v1/D15-1075.
  29. "The Stanford Natural Language Processing Group". nlp.stanford.edu. Retrieved 2025-02-22.
  30. Bojar, Ondřej; Buck, Christian; Federmann, Christian; Haddow, Barry; Koehn, Philipp; Leveling, Johannes; Monz, Christof; Pecina, Pavel; Post, Matt; Saint-Amand, Herve; Soricut, Radu; Specia, Lucia; Tamchyna, Aleš (June 2014). Bojar, Ondřej; Buck, Christian; Federmann, Christian; Haddow, Barry; Koehn, Philipp; Monz, Christof; Post, Matt; Specia, Lucia (eds.). "Findings of the 2014 Workshop on Statistical Machine Translation". Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics: 12–58. doi:10.3115/v1/W14-3302. hdl: 20.500.11820/789fbc29-61e0-4529-af4a-819461c57a8f .
  31. Williams, Adina; Nangia, Nikita; Bowman, Samuel R. (2018-02-19). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference". arXiv: 1704.05426 [cs.CL].
  32. Chen, Danqi; Bolton, Jason; Manning, Christopher D. (2016-08-08). "A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task". arXiv: 1606.02858 [cs.CL].
  33. Zellers, Rowan; Bisk, Yonatan; Schwartz, Roy; Choi, Yejin (2018-08-16). "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference". arXiv: 1808.05326 [cs.CL].
  34. Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019-05-19). "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv: 1905.07830 [cs.CL].
  35. "HellaSwag". rowanzellers.com. Retrieved 2025-02-06.
  36. Lai, Guokun; Xie, Qizhe; Liu, Hanxiao; Yang, Yiming; Hovy, Eduard (2017-12-05). "RACE: Large-scale ReAding Comprehension Dataset From Examinations". arXiv: 1704.04683 [cs.CL].
  37. Paperno, Denis; Kruszewski, Germán; Lazaridou, Angeliki; Pham, Quan Ngoc; Bernardi, Raffaella; Pezzelle, Sandro; Baroni, Marco; Boleda, Gemma; Fernández, Raquel (2016-06-20). "The LAMBADA dataset: Word prediction requiring a broad discourse context". arXiv: 1606.06031 [cs.CL].
  38. Mishra, Swaroop; Khashabi, Daniel; Baral, Chitta; Hajishirzi, Hannaneh (2022-03-14). "Cross-Task Generalization via Natural Language Crowdsourcing Instructions". arXiv: 2104.08773 [cs.CL].
  39. Wang, Yizhong; Mishra, Swaroop; Alipoormolabashi, Pegah; Kordi, Yeganeh; Mirzaei, Amirreza; Arunkumar, Anjana; Ashok, Arjun; Dhanasekaran, Arut Selvan; Naik, Atharva (2022-10-24). "Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks". arXiv: 2204.07705 [cs.CL].
  40. Zhou, Jeffrey; Lu, Tianjian; Mishra, Swaroop; Brahma, Siddhartha; Basu, Sujoy; Luan, Yi; Zhou, Denny; Hou, Le (2023-11-14). "Instruction-Following Evaluation for Large Language Models". arXiv: 2311.07911 [cs.CL].
  41. 1 2 Zheng, Lianmin; Chiang, Wei-Lin; Sheng, Ying; Zhuang, Siyuan; Wu, Zhanghao; Zhuang, Yonghao; Lin, Zi; Li, Zhuohan; Li, Dacheng (2023-12-24). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena". arXiv: 2306.05685 [cs.CL].
  42. Sirdeshmukh, Ved; Deshpande, Kaustubh; Mols, Johannes; Jin, Lifeng; Cardona, Ed-Yeremai; Lee, Dean; Kritz, Jeremy; Primack, Willow; Yue, Summer; Xing, Chen (2025). "MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMS". arXiv: 2501.17399 [cs.CL].
  43. Daum, Shilo; Shapira, Tal; Bremler-Barr, Anat; Hay, David (2024). "Non-uniformity is All You Need: Efficient and Timely Encrypted Traffic Classification with ECHO". arXiv: 2406.01852 [cs.NI].
  44. Richardson, Matthew; Burges, Christopher J.C.; Renshaw, Erin (October 2013). "MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text". In Yarowsky, David; Baldwin, Timothy; Korhonen, Anna; Livescu, Karen; Bethard, Steven (eds.). Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics. pp. 193–203. doi:10.18653/v1/D13-1020.
  45. Baik, Jinho; Barraquand, Guillaume; Corwin, Ivan; Suidan, Toufic (2018). "Pfaffian Schur processes and last passage percolation in a half-quadrant". The Annals of Probability. 46 (6). arXiv: 1606.00525 . doi:10.1214/17-AOP1226.
  46. Wallis, Ben (2018). "Closed ideals of operators acting on some families of sequence spaces". arXiv: 1806.00382 [math.FA].
  47. Minev, Z. K.; Mundhada, S. O.; Shankar, S.; Reinhold, P.; Gutiérrez-Jáuregui, R.; Schoelkopf, R. J.; Mirrahimi, M.; Carmichael, H. J.; Devoret, M. H. (2019). "To catch and reverse a quantum jump mid-flight". Nature. 570 (7760): 200–204. arXiv: 1803.00545 . Bibcode:2019Natur.570..200M. doi:10.1038/s41586-019-1287-z. PMID   31160725.
  48. Reddy, Siva; Chen, Danqi; Manning, Christopher D. (2019-05-01). "CoQA: A Conversational Question Answering Challenge". Transactions of the Association for Computational Linguistics. 7: 249–266. arXiv: 1808.07042 . doi:10.1162/tacl_a_00266. ISSN   2307-387X.
  49. Berant, Jonathan; Chou, Andrew; Frostig, Roy; Liang, Percy (October 2013). "Semantic Parsing on Freebase from Question-Answer Pairs". In Yarowsky, David; Baldwin, Timothy; Korhonen, Anna; Livescu, Karen; Bethard, Steven (eds.). Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics. pp. 1533–1544. doi:10.18653/v1/D13-1160.
  50. Kwiatkowski, Tom; Palomaki, Jennimaria; Redfield, Olivia; Collins, Michael; Parikh, Ankur; Alberti, Chris; Epstein, Danielle; Polosukhin, Illia; Devlin, Jacob; Lee, Kenton; Toutanova, Kristina; Jones, Llion; Kelcey, Matthew; Chang, Ming-Wei; Dai, Andrew M. (2019-08-01). "Natural Questions: A Benchmark for Question Answering Research". Transactions of the Association for Computational Linguistics. 7: 453–466. doi: 10.1162/tacl_a_00276 . ISSN   2307-387X.
  51. Hague, Matthew; Meyer, Roland; Muskalla, Sebastian (2017). "Domains for Higher-Order Games". arXiv: 1705.00355 [cs.LO].
  52. Enns, John (2018). "Multiplicities in the ordinary part of mod $p$ cohomology for $\mathrm{GL}_n(\mathbb{Q}_p)$". arXiv: 1809.00278 [math.NT].
  53. Du Toit, E.J.; o'Brien, M.R.; Vann, R.G.L. (2017). "A Kinetic Study of Microwave Start-up of Tokamak Plasmas". EPJ Web of Conferences. 147: 01002. arXiv: 1704.00517 . Bibcode:2017EPJWC.14701002D. doi:10.1051/epjconf/201714701002.
  54. Wang, Yueyue; Zhao, Liang; Song, Zhijian; Wang, Manning (2018). "Organ at Risk Segmentation in Head and Neck CT Images by Using a Two-Stage Segmentation Framework Based on 3D U-Net". arXiv: 1809.00960 [cs.CV].
  55. Geva, Mor; Khashabi, Daniel; Segal, Elad; Khot, Tushar; Roth, Dan; Berant, Jonathan (2021-04-26). "Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies". Transactions of the Association for Computational Linguistics. 9: 346–361. doi: 10.1162/tacl_a_00370 . ISSN   2307-387X.
  56. Mendoza-Arenas, J. J.; Gómez-Ruiz, F. J.; Rodríguez, F. J.; Quiroga, L. (2019). "Enhancing violations of Leggett-Garg inequalities in nonequilibrium correlated many-body systems by interactions and decoherence". Scientific Reports. 9 (1): 17772. arXiv: 1903.00016 . Bibcode:2019NatSR...917772M. doi:10.1038/s41598-019-54121-1. PMC   6882789 . PMID   31780693.
  57. Khrennikov, Andrei; Ozawa, Masanao; Benninger, Felix; Shor, Oded (2024). "Coupling quantum-like cognition with the neuronal networks within generalized probability theory". arXiv: 2411.00036 [physics.soc-ph].
  58. Masry, Ahmed; Do, Xuan Long; Tan, Jia Qing; Joty, Shafiq; Hoque, Enamul (May 2022). "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning". In Muresan, Smaranda; Nakov, Preslav; Villavicencio, Aline (eds.). Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics. pp. 2263–2279. arXiv: 2203.10244 . doi:10.18653/v1/2022.findings-acl.177.
  59. "Industry Documents Library". industrydocuments.ucsf.edu. Retrieved 2025-04-05.
  60. "DocVQA". www.docvqa.org. Retrieved 2025-04-05.
  61. Mathew, Minesh; Karatzas, Dimosthenis; Jawahar, C. V. (2021). "DocVQA: A Dataset for VQA on Document Images": 2200–2209.{{cite journal}}: Cite journal requires |journal= (help)
  62. "C-Eval: 一个适用于大语言模型的多层次多学科中文评估套件". cevalbenchmark.com. Retrieved 2025-02-25.
  63. Matias, José; Oliveira, Julio P.C.; Le Roux, Galo A.C.; Jäschke, Johannes (2022). "Steady-state real-time optimization using transient measurements on an experimental rig". Journal of Process Control. 115: 181–196. arXiv: 2109.00795 . doi:10.1016/j.jprocont.2022.04.015.
  64. Bisk, Yonatan; Zellers, Rowan; Bras, Ronan Le; Gao, Jianfeng; Choi, Yejin (2020-04-03). "PIQA: Reasoning about Physical Commonsense in Natural Language". Proceedings of the AAAI Conference on Artificial Intelligence. 34 (5): 7432–7439. arXiv: 1911.11641 . doi:10.1609/aaai.v34i05.6239. ISSN   2374-3468.
  65. Jin, Di; Pan, Eileen; Oufattole, Nassim; Weng, Wei-Hung; Fang, Hanyi; Szolovits, Peter (January 2021). "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams". Applied Sciences. 11 (14): 6421. doi: 10.3390/app11146421 . hdl: 1721.1/136684.2 . ISSN   2076-3417.
  66. Lu, Pan; Mishra, Swaroop; Xia, Tanglin; Qiu, Liang; Chang, Kai-Wei; Zhu, Song-Chun; Tafjord, Oyvind; Clark, Peter; Kalyan, Ashwin (2022-12-06). "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering". Advances in Neural Information Processing Systems. 35: 2507–2521. arXiv: 2209.09513 .
  67. Jamal Abdul Nasir; Guan, Jingcheng; Jee, Woongkyu; Woodley, Scott M.; Sokol, Alexey A.; Catlow, C. Richard A.; Alin Marin Elena (2024). "Modelling Silica using MACE-MP-0 Machine Learnt Interatomic Potentials". arXiv: 2411.00436 [cond-mat.mtrl-sci].
  68. "Grok-1.5 Vision Preview | xAI". x.ai. Retrieved 2025-03-12.
  69. Majumdar, Arjun; Ajay, Anurag; Zhang, Xiaohan; Putta, Pranav; Yenamandra, Sriram; Henaff, Mikael; Silwal, Sneha; Mcvay, Paul; Maksymets, Oleksandr; Arnaud, Sergio; Yadav, Karmesh; Li, Qiyang; Newman, Ben; Sharma, Mohit; Berges, Vincent (2024). "OpenEQA: Embodied Question Answering in the Era of Foundation Models". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 16488–16498.
  70. Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv: 1804.07461 [cs.CL].
  71. "GLUE Benchmark". gluebenchmark.com. Retrieved 2019-02-25.
  72. Herzig, Florian; Kozioł, Karol; Vignéras, Marie-France (2020). "On the Existence of Admissible Supersingular Representations of -Adic Reductive Groups". Forum of Mathematics, Sigma. 8 e2. arXiv: 1905.00053 . doi:10.1017/fms.2019.50.
  73. Lovesey, S. W. (2022). "Polar magnetization unveiled by polarized neutron diffraction". Physical Review B. 106 (6) 064415. arXiv: 2206.00461 . Bibcode:2022PhRvB.106f4415L. doi:10.1103/PhysRevB.106.064415.
  74. Ddamulira, Mahadi; Emong, Paul; Geoffrey Ismail Mirumbe (2022). "Members of Narayana's cow sequence that are concatenations of two repdigits". arXiv: 2210.00926 [math.NT].
  75. Kazemi, Mehran; Fatemi, Bahare; Bansal, Hritik; Palowitch, John; Anastasiou, Chrysovalantis; Sanket Vaibhav Mehta; Jain, Lalit K.; Aglietti, Virginia; Jindal, Disha; Chen, Peter; Dikkala, Nishanth; Tyen, Gladys; Liu, Xin; Shalit, Uri; Chiappa, Silvia; Olszewska, Kate; Tay, Yi; Tran, Vinh Q.; Le, Quoc V.; Firat, Orhan (2025). "BIG-Bench Extra Hard". arXiv: 2502.19187 [cs.CL].
  76. Hernandez, A.; Woo, S.; Corrales, H.; Parra, I.; Kim, E.; Llorca, D. F.; Sotelo, M. A. (2020). "3D-DEEP: 3-Dimensional Deep-learning based on elevation patterns for road scene interpretation". 2020 IEEE Intelligent Vehicles Symposium (IV). pp. 892–898. arXiv: 2009.00330 . doi:10.1109/IV47402.2020.9304601. ISBN   978-1-7281-6673-5.
  77. Arjomandbigdeli, Ali; Mata, Andrew; Bak, Stanley (2024). "Verification of Neural Network Control Systems in Continuous Time". AI Verification. Lecture Notes in Computer Science. Vol. 14846. pp. 100–115. arXiv: 2406.00157 . doi:10.1007/978-3-031-65112-0_5. ISBN   978-3-031-65111-3.
  78. "openai/MMMLU · Datasets at Hugging Face". huggingface.co. 2024-10-22. Retrieved 2025-02-28.
  79. Zimmerman, Charlotte; Olsho, Alexis; Loverude, Michael; Suzanne White Brahmia (2023). "Expert covariational reasoning resources in physics graphing tasks". arXiv: 2306.00921 [physics.ed-ph].
  80. "MMMU". mmmu-benchmark.github.io. Retrieved 2025-02-28.
  81. Ates, Halim Cagri; Bhargava, Shruti; Li, Site; Lu, Jiarui; Maddula, Siddhardha; Moniz, Joel Ruben Antony; Nalamalapu, Anil Kumar; Nguyen, Roman Hoang; Ozyildirim, Melis; Patel, Alkesh; Piraviperumal, Dhivya; Renkens, Vincent; Samal, Ankit; Tran, Thy; Tseng, Bo-Hsiang; Yu, Hong; Zhang, Yuan; Zou, Shirley (2023). "MARRS: Multimodal Reference Resolution System". Proceedings of the Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023). pp. 51–58. arXiv: 2311.01650 . doi:10.18653/v1/2023.crac-main.7.
  82. Hu, Kairui; Wu, Penghao; Pu, Fanyi; Xiao, Wang; Zhang, Yuanhan; Yue, Xiang; Li, Bo; Liu, Ziwei (2025). "Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos". arXiv: 2501.13826 [cs.CV].
  83. "Video-MMMU". videommmu.github.io. Retrieved 2025-06-07.
  84. Ma, Yingjie; Guo, Jing; Maloney, Andrew; Braatz, Richard (2024). "Quasi-Steady-State Approach for Efficient Multiscale Simulation and Optimization of mAb Glycosylation in CHO Cell Culture". arXiv: 2409.00281 [math.NA].
  85. Padlewski, Piotr; Bain, Max; Henderson, Matthew; Zhu, Zhongkai; Relan, Nishant; Pham, Hai; Ong, Donovan; Aleksiev, Kaloyan; Ormazabal, Aitor; Phua, Samuel; Yeo, Ethan; Lamprecht, Eugenie; Liu, Qi; Wang, Yuqi; Chen, Eric; Fu, Deyu; Li, Lei; Zheng, Che; Cyprien de Masson d'Autume; Yogatama, Dani; Artetxe, Mikel; Tay, Yi (2024). "Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models". arXiv: 2405.02287 [cs.CL].
  86. "MMT-Bench". mmt-bench.github.io. Retrieved 2025-07-12.
  87. Bonneau, Pierre; Mazzilli, Emmanuel (2023). "Almost holomorphic curves in real analytic hypersurfaces". arXiv: 2311.01298 [math.CV].
  88. Ren, Kui; Soedjak, Nathan (2023). "Recovering coefficients in a system of semilinear Helmholtz equations from internal data". Inverse Problems. 40 (4): 045023. arXiv: 2307.01385 . Bibcode:2024InvPr..40d5023R. doi:10.1088/1361-6420/ad2cf9.
  89. Deng, Xiang; Gu, Yu; Zheng, Boyuan; Chen, Shijie; Stevens, Sam; Wang, Boshi; Sun, Huan; Su, Yu (2023-12-15). "Mind2Web: Towards a Generalist Agent for the Web". Advances in Neural Information Processing Systems. 36: 28091–28114. arXiv: 2306.06070 .
  90. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments". os-world.github.io. Retrieved 2025-02-24.
  91. "Windows Agent Arena: Evaluating Multi-modal OS Agents at Scale". microsoft.github.io. Retrieved 2025-02-24.
  92. Lin, Guying; Yang, Lei; Liu, Yuan; Zhang, Congyi; Hou, Junhui; Jin, Xiaogang; Komura, Taku; Keyser, John; Wang, Wenping (2024). "On Optimal Sampling for Learning SDF Using MLPS Equipped with Positional Encoding". arXiv: 2401.01391 [cs.CV].
  93. "Berkeley Function Calling Leaderboard". gorilla.cs.berkeley.edu. Retrieved 2025-03-11.
  94. Li, Tianyin (2024). "Quantum simulations of quantum electrodynamics in Coulomb gauge". arXiv: 2406.01204 [hep-lat].
  95. Barres, Victor; Dong, Honghua; Ray, Soham; Si, Xujie; Narasimhan, Karthik (2025). "$τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment". arXiv: 2506.07982 [cs.AI].
  96. "Terminal-Bench". Terminal-Bench. Retrieved 2025-05-25.
  97. Richarte, Martín G.; Toscano, Facundo; Lambas, Diego G.; Luparello, Heliana E.; Luiz Filipe Guimarães; Fabris, Júlio C. (2025). "Quasar pairs as large-scale structure tracers". Astronomy & Astrophysics. arXiv: 2504.01251 . doi:10.1051/0004-6361/202554998.
  98. https://x.com/GregKamradt/status/1722386725635580292 [ bare URL ]
  99. Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (2020). "Long Range Arena: A Benchmark for Efficient Transformers". arXiv: 2011.04006 [cs.LG].
  100. Modarressi, Ali; Deilamsalehy, Hanieh; Dernoncourt, Franck; Bui, Trung; Rossi, Ryan A.; Yoon, Seunghyun; Schütze, Hinrich (2025). "NoLiMa: Long-Context Evaluation Beyond Literal Matching". arXiv: 2502.05167 [cs.CL].
  101. An, Chenxin; Gong, Shansan; Zhong, Ming; Zhao, Xingjian; Li, Mukai; Zhang, Jun; Kong, Lingpeng; Qiu, Xipeng (August 2024). Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek (eds.). "L-Eval: Instituting Standardized Evaluation for Long Context Language Models". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics: 14388–14411. arXiv: 2307.11088 . doi:10.18653/v1/2024.acl-long.776.
  102. Zhang, Xinrong; Chen, Yingfa; Hu, Shengding; Xu, Zihang; Chen, Junhao; Moo Khai Hao; Han, Xu; Zhen Leng Thai; Wang, Shuo; Liu, Zhiyuan; Sun, Maosong (2024). "$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens". arXiv: 2402.13718 [cs.CL].
  103. Shaham, Uri; Ivgi, Maor; Efrat, Avia; Berant, Jonathan; Levy, Omer (2023). "ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding". arXiv: 2305.14196 [cs.CL].
  104. Li, Tianle; Zhang, Ge; Quy Duc Do; Yue, Xiang; Chen, Wenhu (2024). "Long-context LLMS Struggle with Long In-context Learning". arXiv: 2404.02060 [cs.CL].
  105. "LongBench v2". longbench2.github.io. Retrieved 2025-02-21.
  106. Bai, Yushi; Tu, Shangqing; Zhang, Jiajie; Peng, Hao; Wang, Xiaozhi; Lv, Xin; Cao, Shulin; Xu, Jiazheng; Hou, Lei; Dong, Yuxiao; Tang, Jie; Li, Juanzi (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks". arXiv: 2412.15204 [cs.CL].
  107. Hsieh, Cheng-Ping; Sun, Simeng; Kriman, Samuel; Acharya, Shantanu; Rekesh, Dima; Jia, Fei; Zhang, Yang; Ginsburg, Boris (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?". arXiv: 2404.06654 [cs.CL].
  108. Lee, Jinhyuk; Chen, Anthony; Dai, Zhuyun; Dua, Dheeru; Devendra Singh Sachan; Boratko, Michael; Luan, Yi; Arnold, Sébastien M. R.; Perot, Vincent; Dalmia, Siddharth; Hu, Hexiang; Lin, Xudong; Pasupat, Panupong; Amini, Aida; Cole, Jeremy R.; Riedel, Sebastian; Naim, Iftekhar; Chang, Ming-Wei; Guu, Kelvin (2024). "Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?". arXiv: 2406.13121 [cs.CL].
  109. Visser, Eline (2022). A grammar of Kalamang. Language Science Press. ISBN   978-3-96110-343-0.
  110. Visser, Eline (2021-09-24), dictionaria/kalamang: Kalamang dictionary, doi:10.5281/ZENODO.5526419 , retrieved 2025-04-05
  111. Tanzer, Garrett; Suzgun, Mirac; Visser, Eline; Jurafsky, Dan; Melas-Kyriazi, Luke (2023). "A Benchmark for Learning to Translate a New Language from One Grammar Book". arXiv: 2309.16575 [cs.CL].
  112. "FACTS Grounding: A new benchmark for evaluating the factuality of large language models". Google DeepMind. 2024-12-17. Retrieved 2025-06-07.
  113. Jacovi, Alon; Wang, Andrew; Alberti, Chris; Tao, Connie; Lipovetz, Jon; Olszewska, Kate; Haas, Lukas; Liu, Michelle; Keating, Nate; Bloniarz, Adam; Saroufim, Carl; Fry, Corey; Marcus, Dror; Kukliansky, Doron; Gaurav Singh Tomar; Swirhun, James; Xing, Jinwei; Wang, Lily; Gurumurthy, Madhu; Aaron, Michael; Ambar, Moran; Fellinger, Rachana; Wang, Rui; Zhang, Zizhao; Goldshtein, Sasha; Das, Dipanjan (2025). "The FACTS Grounding Leaderboard: Benchmarking LLMS' Ability to Ground Responses to Long-Form Input". arXiv: 2501.03200 [cs.CL].
  114. Vodrahalli, Kiran; Ontanon, Santiago; Tripuraneni, Nilesh; Xu, Kelvin; Jain, Sanil; Shivanna, Rakesh; Hui, Jeffrey; Dikkala, Nishanth; Kazemi, Mehran (2024-09-20). "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries". arXiv: 2409.12640 [cs.CL].
  115. Kushman, Nate; Artzi, Yoav; Zettlemoyer, Luke; Barzilay, Regina (June 2014). Toutanova, Kristina; Wu, Hua (eds.). "Learning to Automatically Solve Algebra Word Problems". Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics: 271–281. doi:10.3115/v1/P14-1026.
  116. Huang, Danqing; Shi, Shuming; Lin, Chin-Yew; Yin, Jian; Ma, Wei-Ying (August 2016). Erk, Katrin; Smith, Noah A. (eds.). "How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation" . Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics: 887–896. doi:10.18653/v1/P16-1084.
  117. Wang, Yan; Liu, Xiaojiang; Shi, Shuming (September 2017). "Deep Neural Solver for Math Word Problems". In Palmer, Martha; Hwa, Rebecca; Riedel, Sebastian (eds.). Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics. pp. 845–854. doi:10.18653/v1/D17-1088.
  118. Ling, Wang; Yogatama, Dani; Dyer, Chris; Blunsom, Phil (July 2017). Barzilay, Regina; Kan, Min-Yen (eds.). "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems". Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics: 158–167. arXiv: 1705.04146 . doi:10.18653/v1/P17-1015.
  119. Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob; Nakano, Reiichiro; Hesse, Christopher; Schulman, John (2021). "Training Verifiers to Solve Math Word Problems". arXiv: 2110.14168 [cs.LG].
  120. "madrylab/gsm8k-platinum · Datasets at Hugging Face". huggingface.co. Retrieved 2025-03-07.
  121. Zhang, Hugh; Da, Jeff; Lee, Dean; Robinson, Vaughn; Wu, Catherine; Song, Will; Zhao, Tiffany; Raja, Pranav; Zhuang, Charlotte; Slack, Dylan; Lyu, Qin; Hendryx, Sean; Kaplan, Russell; Lunati, Michele; Yue, Summer (2024). "A Careful Examination of Large Language Model Performance on Grade School Arithmetic". arXiv: 2405.00332 [cs.CL].
  122. Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021). "Measuring Mathematical Problem Solving with the MATH Dataset". arXiv: 2103.03874 [cs.LG].
  123. "MATH-Perturb". math-perturb.github.io. Retrieved 2025-04-09.
  124. Amini, Aida; Gabriel, Saadia; Lin, Peter; Koncel-Kedziorski, Rik; Choi, Yejin; Hajishirzi, Hannaneh (2019). "MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms". arXiv: 1905.13319 [cs.CL].
  125. 1 2 Austin, Jacob; Odena, Augustus; Nye, Maxwell; Bosma, Maarten; Michalewski, Henryk; Dohan, David; Jiang, Ellen; Cai, Carrie; Terry, Michael; Le, Quoc; Sutton, Charles (2021). "Program Synthesis with Large Language Models". arXiv: 2108.07732 [cs.PL].
  126. math-eval (2025-01-26), math-eval/MathEval , retrieved 2025-01-27
  127. Chen, Wenhu; Yin, Ming; Ku, Max; Lu, Pan; Wan, Yixin; Ma, Xueguang; Xu, Jianyu; Wang, Xinyi; Xia, Tony (December 2023). "TheoremQA: A Theorem-driven Question Answering Dataset". In Bouamor, Houda; Pino, Juan; Bali, Kalika (eds.). Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics. pp. 7889–7901. arXiv: 2305.12524 . doi:10.18653/v1/2023.emnlp-main.489.
  128. Azerbayev, Zhangir; Piotrowski, Bartosz; Schoelkopf, Hailey; Ayers, Edward W.; Radev, Dragomir; Avigad, Jeremy (2023). "ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics". arXiv: 2302.12433 [cs.CL].
  129. Azerbayev, Zhangir (2025-04-02), zhangir-azerbayev/ProofNet , retrieved 2025-04-03
  130. deepseek-ai/DeepSeek-Prover-V1.5, DeepSeek, 2025-04-01, retrieved 2025-04-03
  131. openai/miniF2F, OpenAI, 2025-02-01, retrieved 2025-02-03
  132. Chernyshev, Konstantin; Polshkov, Vitaliy; Artemova, Ekaterina; Myasnikov, Alex; Stepanov, Vlad; Miasnikov, Alexei; Tilga, Sergei (2024). "U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMS". arXiv: 2412.03205 [cs.CL].
  133. Liu, Hongwei; Zheng, Zilong; Qiao, Yuxuan; Duan, Haodong; Fei, Zhiwei; Zhou, Fengzhe; Zhang, Wenwei; Zhang, Songyang; Lin, Dahua; Chen, Kai (2024). "MathBench: Evaluating the Theory and Application Proficiency of LLMS with a Hierarchical Mathematics Benchmark". arXiv: 2405.12209 [cs.CL].
  134. Tsoukalas, George; Lee, Jasper; Jennings, John; Xin, Jimmy; Ding, Michelle; Jennings, Michael; Thakur, Amitayush; Chaudhuri, Swarat (2024). "PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition". arXiv: 2407.11214 [cs.AI].
  135. "PutnamBench: A Multilingual Mathematics Benchmark for Formal Theorem-Proving". trishullab.github.io. Retrieved 2025-04-02.
  136. Gao, Bofei; Song, Feifan; Yang, Zhe; Cai, Zefan; Miao, Yibo; Dong, Qingxiu; Li, Lei; Ma, Chenghao; Chen, Liang; Xu, Runxin; Tang, Zhengyang; Wang, Benyou; Zan, Daoguang; Quan, Shanghaoran; Zhang, Ge; Sha, Lei; Zhang, Yichang; Ren, Xuancheng; Liu, Tianyu; Chang, Baobao (2024). "Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models". arXiv: 2410.07985 [cs.CL].
  137. Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Caroline Falkman Olsson; Denain, Jean-Stanislas; Ho, Anson; Emily de Oliveira Santos; Järviniemi, Olli; Barnett, Matthew; Sandler, Robert; Vrzala, Matej; Sevilla, Jaime; Ren, Qiuyu; Pratt, Elizabeth; Levine, Lionel; Barkley, Grant; Stewart, Natalie; Grechuk, Bogdan; Grechuk, Tetiana; Shreepranav Varma Enugandla; Wildon, Mark (2024). "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI". arXiv: 2411.04872 [cs.AI].
  138. "MathArena.ai". matharena.ai. Retrieved 2025-02-22.
  139. Hendrycks, Dan; Basart, Steven; Kadavath, Saurav; Mazeika, Mantas; Arora, Akul; Guo, Ethan; Burns, Collin; Puranik, Samir; He, Horace; Song, Dawn; Steinhardt, Jacob (2021). "Measuring Coding Challenge Competence with APPS". arXiv: 2105.09938 [cs.SE].
  140. Lai, Yuhang; Li, Chengxi; Wang, Yiming; Zhang, Tianyi; Zhong, Ruiqi; Zettlemoyer, Luke; Scott Wen-tau Yih; Fried, Daniel; Wang, Sida; Yu, Tao (2022). "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation". arXiv: 2211.11501 [cs.SE].
  141. "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation". ds1000-code-gen.github.io. Retrieved 2025-03-11.
  142. "CodeElo". codeelo-bench.github.io. Retrieved 2025-02-13.
  143. Aider-AI/polyglot-benchmark, Aider AI, 2025-03-29, retrieved 2025-03-30
  144. Zhuo, Terry Yue; Chien, Vu Minh; Chim, Jenny; Hu, Han; Yu, Wenhao; Widyasari, Ratnadira; Yusuf, Imam Nur Bani; Zhan, Haolan; He, Junda; Paul, Indraneil; Brunner, Simon; Gong, Chen; Hoang, James; Zebaze, Armel Randy; Hong, Xiaoheng (2024-10-04). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions". Iclr 2025. arXiv: 2406.15877 .
  145. "BigCodeBench Leaderboard". bigcode-bench.github.io. Retrieved 2025-04-09.
  146. Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?". arXiv: 2310.06770 [cs.CL].
  147. "Introducing SWE-bench Verified". openai.com.
  148. Zan, Daoguang; Huang, Zhirong; Liu, Wei; Chen, Hanwu; Zhang, Linhao; Xin, Shulin; Chen, Lu; Liu, Qi; Zhong, Xiaojian; Li, Aoyan; Liu, Siyao; Xiao, Yongsheng; Chen, Liangqiang; Zhang, Yuyu; Su, Jing; Liu, Tianyu; Long, Rui; Shen, Kai; Xiang, Liang (2025). "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving". arXiv: 2504.02605 [cs.SE].
  149. "SWE-bench". www.swebench.com. Retrieved 2025-02-11.
  150. openai/SWELancer-Benchmark, OpenAI, 2025-02-21, retrieved 2025-02-21
  151. Miserendino, Samuel; Wang, Michele; Patwardhan, Tejal; Heidecke, Johannes (2025). "SWE-Lancer: Can Frontier LLMS Earn $1 Million from Real-World Freelance Software Engineering?". arXiv: 2502.12115 [cs.LG].
  152. Ouyang, Anne; Guo, Simon; Arora, Simran; Zhang, Alex L.; Hu, William; Ré, Christopher; Mirhoseini, Azalia (2025). "KernelBench: Can LLMS Write Efficient GPU Kernels?". arXiv: 2502.10517 [cs.LG].
  153. "Cybench". cybench.github.io. Retrieved 2025-04-10.
  154. Rein, David; Becker, Joel; Deng, Amy; Nix, Seraphina; Canal, Chris; O'Connel, Daniel; Arnott, Pip; Bloom, Ryan; Broadley, Thomas; Garcia, Katharyn; Goodrich, Brian; Hasin, Max; Jawhar, Sami; Kinniment, Megan; Kwa, Thomas; Lajko, Aron; Rush, Nate; Lucas Jun Koba Sato; Sydney Von Arx; West, Ben; Chan, Lawrence; Barnes, Elizabeth (2025). "HCAST: Human-Calibrated Autonomy Software Tasks". arXiv: 2503.17354 [cs.AI].
  155. "PaperBench: Evaluating AI's Ability to Replicate AI Research". openai.com. Retrieved 2025-04-02.
  156. Jing, Liqiang; Huang, Zhehui; Wang, Xiaoyang; Yao, Wenlin; Yu, Wenhao; Ma, Kaixin; Zhang, Hongming; Du, Xinya; Yu, Dong (2024). "DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?". arXiv: 2409.07703 [cs.AI].
  157. Ma, Zeyao; Zhang, Bohan; Zhang, Jing; Yu, Jifan; Zhang, Xiaokang; Zhang, Xiaohan; Luo, Sijia; Wang, Xi; Tang, Jie (2024-12-16). "SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation". Advances in Neural Information Processing Systems. 37: 94871–94908.
  158. Rein, David; Betty Li Hou; Asa Cooper Stickland; Petty, Jackson; Richard Yuanzhe Pang; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark". arXiv: 2311.12022 [cs.AI].
  159. Rein, I. David (2025-08-24), idavidrein/gpqa , retrieved 2025-08-25
  160. "Learning to reason with LLMs". openai.com. September 12, 2024. Retrieved 2025-02-27.
  161. Team, M-A-P; et al. (2025). "SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines". arXiv: 2502.14739 [cs.CL].
  162. "MathVista: Evaluating Math Reasoning in Visual Contexts". mathvista.github.io. Retrieved 2025-03-07.
  163. Cui, Ruixiang (2025-02-03), ruixiangcui/AGIEval , retrieved 2025-02-03
  164. "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI". gair-nlp.github.io. Retrieved 2025-02-03.
  165. He, Chaoqun; Luo, Renjie; Bai, Yuzhuo; Hu, Shengding; Zhen Leng Thai; Shen, Junhao; Hu, Jinyi; Han, Xu; Huang, Yujie; Zhang, Yuxiang; Liu, Jie; Qi, Lei; Liu, Zhiyuan; Sun, Maosong (2024). "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems". arXiv: 2402.14008 [cs.CL].
  166. "ARC Prize". ARC Prize. Retrieved 2025-01-27.
  167. "LiveBench". livebench.ai. Retrieved 2025-01-27.
  168. "Humanity's Last Exam". lastexam.ai. Retrieved 2025-02-02.
  169. "SimpleBench". simple-bench.com. Retrieved 2025-04-09.