In artificial intelligence, Humanity's Last Exam (HLE) is a benchmark for evaluating the capabilities of large language models. It encompasses 3000 unambiguous and easily verifiable academic questions about mathematics, humanities, and the natural sciences contributed by almost 1000 subject-experts from over 500 institutions across 50 countries, providing expert-level human performance on closed-ended academic questions. It has been developed collaboratively by the Center for AI Safety and Scale AI. [1] [2]
As LLMs have rapidly advanced, they have achieved over 90% accuracy on popular benchmarks like the Massive Multitask Language Understanding (MMLU) benchmark, limiting the effectiveness of these tests in measuring state-of-the-art capabilities.[ citation needed ] In response, HLE was introduced to provide a more challenging and comprehensive assessment tool.[ citation needed ]
The dataset is multi-modal, with approximately 10% of the questions requiring both image and text comprehension, while the remaining 90% are text-based.[ citation needed ]
State-of-the-art LLMs have demonstrated low accuracy on HLE, highlighting substantial room for improvement. For instance, models like GPT-4o and Grok-2 achieved accuracies of 3.3% and 3.8%, respectively, while o3-mini (high) (evaluated only on text) and Deep Research achieved accuracies of 13% [3] and 26.6%, [4] respectively.