Humanity's Last Exam (HLE) is a language model benchmark consisting of over 2,500 expert-level questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI, and was designed to test reasoning abilities and human-like intelligence, as opposed to just pattern recognition.
Benchmark tests like Humanity's Last Exam have long been used to evaluate reasoning and learning capabilities in machines [1] . Early benchmarks, such as the Turing Test, measured whether machines could demonstrate human-like conversation abilities [2] . Other early benchmark tests evaluated computer vision, like MNIST for handwritten digit recognition and ImageNet for continual image classification [3] . The emergence of large language models (LLMs) in the 2020s led to the advancement and evolution of benchmark tests, with a focus on emphasizing interpretability, reproducibility, and clearer evaluation criteria. Recent foundation model benchmarks, such as MMLU, HellaSwag, and ARC Challenge, illustrate this shift. [4]
Humanity’s Last Exam was created to parallel the quick progression of LLMs and provide a proper assessment of these models. Previous benchmarks evaluated LLMs with about 90% correctness creating the need for a more difficult exam. [5] Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation". [6] The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions. [7] The questions were crowdsourced from subject matter experts from various institutions across the world. [8] [9] The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts for accuracy and wording in two rounds, and then approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000 U.S. dollars—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset". [9] AI systems are able to surpass more focused, task-oriented tests, yet few are able to perform well on broader, general ability assessments. [10] HLE was designed to test reasoning abilities, which are considered a metric of “human” intelligence. [11]
The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark overfitting. [9]
An example question: [7]
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
An independent investigation by FutureHouse, published in July 2025, suggested that around 30% of the HLE answers for text-only chemistry and biology questions could be incorrect; the benchmark's team partially replicated the findings, and said they hope to institute a continuous revisions process. [12]
| Organization | Model | Accuracy (%) ↑ | Calibration Error (%) ↓ |
|---|---|---|---|
| Google DeepMind | Gemini 3 Pro Preview | 37.52 | 57 |
| OpenAI | GPT-5 Pro | 31.64 | 49 |
| Anthropic | Claude Opus 4.5 (Thinking) | 25.20 | 55 |
| Z.ai | GLM 4.5 | 8.32 | 79 |
| Meta AI | Llama 4 Maverick | 5.68 | 83 |
| Mistral AI | Mistral Medium 3 | 4.52 | 77 |
| Amazon Web Services | Nova Pro | 4.40 | 80 |
| Organization | Model | Accuracy (%) ↑ | Calibration Error (%) ↓ |
|---|---|---|---|
| OpenAI | gpt-oss-120b | 15.48 | 76 |
| Alibaba Cloud | Qwen3-235B-A22B-Thinking-2507 | 15.43 | 78 |
| DeepSeek | DeepSeek-R1-0528 | 14.04 | 78 |
| Moonshot AI | Kimi-K2-Instruct | 4.68 | 82 |
| Amazon Web Services | Nova Micro | 4.41 | 84 |