Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the Center for AI Safety and

Scale AI Scale AI, Inc. is an American data annotation company based in San Francisco, California. It provides data labeling and model evaluation services to develop applications for artificial intelligence. The company’s research arm, the Safety, E ...

Creation

Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation". The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with

Elon Musk Elon Reeve Musk ( ; born June 28, 1971) is a businessman. He is known for his leadership of Tesla, SpaceX, X (formerly Twitter), and the Department of Government Efficiency (DOGE). Musk has been considered the wealthiest person in th ...

, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with

to compile the questions. The questions were

crowdsource Crowdsourcing involves a large group of dispersed participants contributing or producing goods and services, goods or services—including ideas, Voting, votes, Microwork, micro-tasks, and finances—for payment or as volunteers. Contemporary ...

d from subject matter experts from various institutions across the world. The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts in two rounds and approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000

U.S. dollar The United States dollar (symbol: $; currency code: USD) is the official currency of the United States and several other countries. The Coinage Act of 1792 introduced the U.S. dollar at par with the Spanish silver dollar, divided it int ...

s—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset".

Composition

The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. An example question:

Results

References

Large language models 2025 in artificial intelligence

External links

Humanity's Last Exam
at the Center for AI Safety.
Humanity's Last Exam
at

. {{Artificial intelligence navbox