Humanity's Last Exam

Last updated March 25, 2025

Humanity's Last Exam (HLE) is a language model benchmark encompassing 3000 unambiguous and easily verifiable academic questions about mathematics, humanities, and the natural sciences contributed by almost 1000 subject-experts from over 500 institutions across 50 countries, providing expert-level human performance on closed-ended academic questions. It has been developed collaboratively by the Center for AI Safety and Scale AI.^[1]^[2]

Background

As LLMs have rapidly advanced, they have achieved over 90% accuracy on popular benchmarks like the Massive Multitask Language Understanding (MMLU) benchmark, limiting the effectiveness of these tests in measuring state-of-the-art capabilities.^{[ citation needed ]} In response, HLE was introduced to provide a more challenging and comprehensive assessment tool.^{[ citation needed ]}

Dataset composition

The dataset is multi-modal, with approximately 10% of the questions requiring both image and text comprehension, while the remaining 90% are text-based.^{[ citation needed ]}

Results

State-of-the-art LLMs have demonstrated low accuracy on HLE, highlighting substantial room for improvement. For instance, models like GPT-4o and Grok-2 achieved accuracies of 3.3% and 3.8%, respectively, while o3-mini (high) (evaluated only on text) and Deep Research achieved accuracies of 13%^[3] and 26.6%,^[4] respectively.

References

↑ Roose, Kevin (2025-01-23). "When A.I. Passes This Test, Look Out". The New York Times. ISSN 0362-4331 . Retrieved 2025-02-04.
↑ Dastin, Jeffrey; Paul, Katie (2024-09-16). "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters.
↑ "Humanity's Last Exam". 2025-02-10. Archived from the original on 10 February 2025. Retrieved 2025-02-10.
↑ "Introducing deep research". openai.com. Retrieved 2025-02-10.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[NYT_20250123-1] Roose, Kevin (2025-01-23). "When A.I. Passes This Test, Look Out". The New York Times. ISSN 0362-4331 . Retrieved 2025-02-04.

[Reuters_20240916-2] Dastin, Jeffrey; Paul, Katie (2024-09-16). "AI experts ready 'Humanity's Last Exam' to stump powerful tech". Reuters.

[3] "Humanity's Last Exam". 2025-02-10. Archived from the original on 10 February 2025. Retrieved 2025-02-10.

[4] "Introducing deep research". openai.com. Retrieved 2025-02-10.

[1]

[2]

[3]

[4]