Language model benchmarks are standardized tests designed to evaluate the performance of

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

s on various

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding,

generation A generation is all of the people born and living at about the same time, regarded collectively. It also is "the average period, generally considered to be about 20–⁠30 years, during which children are born and grow up, become adults, and b ...

, and

reasoning Reason is the capacity of consciously applying logic by drawing valid conclusions from new or existing information, with the aim of seeking the truth. It is associated with such characteristically human activities as philosophy, religion, scien ...

. Benchmarks generally consist of a

dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...

and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field.

Overview

Performance_of_AI_models_on_various_benchmarks_from_1998_to_2024

Types

Benchmarks may be described by the following adjectives, not mutually exclusive: * Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by

BLEU Bleu or BLEU may refer to: * '' Three Colors: Blue'', a 1993 film * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese * Parti bleu, 19th century ...

scores. * Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles

reading comprehension Reading comprehension is the ability to process written text, understanding, understand its meaning, and to integrate with what the reader already knows. Reading Comprehension of spoken language, comprehension relies on two abilities that are co ...

questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. * Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. * Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. * Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. * Agency: These tasks are for a language-model–based

software agent In computer science, a software agent is a computer program that acts for a user or another program in a relationship of agency. The term ''agent'' is derived from the Latin ''agere'' (to do): an agreement to act on one's behalf. Such "action on ...

that operates a computer for a user, such as editing images, browsing the web, etc. * Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after SOTA models have saturated a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. * Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark has access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm.

Lifecycle

Generally, the life cycle of a benchmark consists of the following steps: * Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). * Growth: More papers and models use the benchmark, and the performance on the benchmark grows. * Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. * Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress.

Construction

Like datasets, benchmarks are typically constructed by several methods, individually or in combination: * Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. * Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. * Crowd sourcing: Items may be constructed by paying people to write them, such as on

Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website with which businesses can hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do as economically. It is operated under Amazon Web ...

. This was used for making the MCTest.

Evaluation

Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: * For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall,

F1 score In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number o ...

, etc. * pass@n: The model is given

n

attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. * k@n: The model makes

n

attempts to solve each problem, but only

k

attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. * cons@n: The model is given

n

attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making

N > n

attempts, and use the unbiased estimator

1- \frac

, where

c

is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores:

ROUGE,

METEOR A meteor, known colloquially as a shooting star, is a glowing streak of a small body (usually meteoroid) going through Earth's atmosphere, after being heated to incandescence by collisions with air molecules in the upper atmosphere, creating a ...

NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical s ...

word error rate Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. The WER metric typically ranges from 0 to 1, where 0 indicates that the compared pieces of text are exactly identical, and 1 (or larg ...

LEPOR LEPOR (Length Penalty, Precision, n-gram Position difference Penalty and Recall) is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors. Background Since IBM proposed and realize ...

, CIDEr, SPICE, etc.

Issues

* error: Some benchmark answers may be wrong. * ambiguity: Some benchmark questions may be ambiguously worded. * subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a

formal language In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet". The alphabet of a formal language consists of symbols that concatenate into strings (also c ...

is possible. * open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". * inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. * shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the sentences actually say. * contamination/leakage: Some benchmark questions may have answers already present in the training set. Also called "training on the test set". Some benchmarks (such as Big-Bench) may use a "canary string", so that documents containing the canary string can be voluntarily removed from the training set. * saturation: As time goes on, many models reach the highest performance level practically possible, and so the benchmark can no longer differentiate these models. For example, GLUE had been saturated, necessitating SuperGLUE. *

Goodhart's law Goodhart's law is an adage that has been stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 arti ...

: If new models are designed or selected to score highly on a benchmark, the benchmark may cease to be a good indicator for model quality. *

cherry picking Cherry picking, suppressing evidence, or the fallacy of incomplete evidence is the act of pointing to individual cases or data that seem to confirm a particular position while ignoring a significant portion of related and similar cases or data th ...

: New model publications may only point to benchmark scores on which the new model performed well, avoiding benchmark scores that it did badly on.

List of benchmarks

General language modeling

Essentially any dataset can be used as a benchmark for statistical language modeling, with the

perplexity In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the ...

(or near-equivalently, negative

log-likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...

and bits per character, as in the original Shannon's test of the entropy of the English language) being used as the benchmark score. For example, the original

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was par ...

announcement included those of the model on WikiText-2, enwik8, text8, and WikiText-103 (all being standard language datasets made from the English Wikipedia). However, there had been datasets more commonly used, or specifically designed, for use as a benchmark. * One Billion Word Benchmark: The negative log likelihood loss on a dataset of 1 billion words. * Penn Treebank: The error or negative log likelihood loss for part-of-speech tags on a dataset of text. * Paloma (Perplexity Analysis for Language Model Assessment): A collection of English and code texts, divided into 546 domains. Used to measure the perplexity of a model on specific domains.

General language understanding

See for a review of over 100 such benchmarks. *WSC ( Winograd schema challenge): 273 sentences with ambiguous pronouns. The task is to determine what the pronoun refers to. * WinoGrande: A larger version of WSC with 44,000 items. Designed to be adversarial to 2019 SOTA, since the original had been saturated. This dataset consists of fill-in-the-blank style sentences, as opposed to the pronoun format of previous datasets. * CoLA (Corpus of Linguistic Acceptability): 10,657 English sentences from published linguistics literature that were manually labeled either as grammatical or ungrammatical. * SNLI (Stanford Natural Language Inference: 570K human-written English sentence pairs manually labeled for balanced classification with 3 labels "

entailment Logical consequence (also entailment or logical implication) is a fundamental concept in logic which describes the relationship between statements that hold true when one statement logically ''follows from'' one or more statements. A valid l ...

", "contradiction", and "neutral". * WMT 2014 (Workshop on Statistical Machine Translation): a collection of 4

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

benchmarks at the Ninth Workshop on Statistical Machine Translation. The ''

Attention Is All You Need "Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism p ...

'' paper used it as a benchmark. * MultiNLI (Multi-Genre Natural Language Inference): Similar to SNLI, with 433K English sentence pairs from ten distinct genres of written and spoken English. * CNN/Daily Mail Reading Comprehension Task: Articles from

CNN Cable News Network (CNN) is a multinational news organization operating, most notably, a website and a TV channel headquartered in Atlanta. Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable ne ...

(380K training, 3.9K development, 3.2K test) and

Daily Mail The ''Daily Mail'' is a British daily Middle-market newspaper, middle-market Tabloid journalism, tabloid conservative newspaper founded in 1896 and published in London. , it has the List of newspapers in the United Kingdom by circulation, h ...

(879K training, 64.8K development, 53.2K test) were scraped. The bullet point summaries accompanying the news articles were used. One entity in a bullet point was replaced with a placeholder, creating a cloze-style question. The goal is to identify the masked entity from the article. * SWAG (Situations With Adversarial Generations): 113K descriptions of activities or events, each with 4 candidate endings; the model must choose the most plausible ending. Adversarial against a few shallow language models ( MLP,

bag of words The bag-of-words (BoW) model is a model of text which uses an unordered collection (a "bag") of words. It is used in natural language processing and information retrieval (IR). It disregards word order (and thus most of syntax or grammar) but ca ...

, one-layer

, etc). * HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for SWAG): A harder version of SWAG. Contains 10K items. * RACE (ReAding Comprehension Examinations): 100,000 reading comprehension problems in 28,000 passages, collected from the English exams for middle and high school Chinese students in the age range between 12 to 18. * LAMBADA: 10,000 narrative passages from books, each with a missing last word that humans can guess if given the full passage but not from the last sentence alone.

General language generation

* NaturalInstructions: 61 distinct tasks with human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. * Super-NaturalInstructions: 1,616 diverse NLP tasks and their expert-written instructions, and 5M task instances. * IFEval (Instruction-Following Eval): 541 instructions to be followed, each containing at least one verifiable constraint, such as "mention the keyword of AI at least 3 times". * Chatbot Arena: Human users vote between two outputs from two language models. An

Elo rating The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor. The Elo system wa ...

for each language model is computed based on these human votes. * MT-Bench (multi-turn benchmark): An automated version of Chatbot Arena where LLMs replace humans in generating votes. * MultiChallenge: 273 instances. Each instance is a multi-turn (up to 10 turns) conversation history between two parties, ending with a final user turn containing a requirement/question. Designed to test for instruction-following, context allocation, and in-context reasoning at the same time. Scored by LLM as judge with instance-level rubrics. * CharXiv: 9292 descriptive questions (examining basic chart elements) and 2323 reasoning questions (synthesizing information across complex visual elements) about 2323 charts from scientific papers.

Open-book question-answering

* MCTest (Machine Comprehension Test): 500 fictional stories, each with 4 multiple-choice questions (with at least 2 requiring multi-sentence understanding), designed to be understandable by a 7-year-old. The vocabulary was limited to approximately 8,000 words probably known by a 7-year-old. The stories were written by workers on

. * SQuAD (Stanford Question Answering Dataset): 100,000+ questions posed by crowd workers on 500+ Wikipedia articles. The task is, given a passage from Wikipedia and a question, find a span of text in the text that answers the question. * SQuAD 2.0: 50,000 unanswerable questions that look similar to SQuAD questions. Every such unanswerable question must be answered with an empty string. Written by crowd workers. * ARC (AI2 Reasoning Challenge): Multiple choice questions, with a Challenge Set (2590 questions) and an Easy Set (5197 questions). Designed specifically to be adversarial against models that had saturated SNLI and SQuAD. * CoQA (Conversational QA): 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. * WebQuestions: 6,642 question-answer pairs designed to be answerable with knowledge present in the 2013 version of Freebase. * Natural Questions: 323045 items. Each containing a question that had been searched on Google, a Wikipedia page relevant for answering the question, a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or "null" if no long/short answer is present. * TriviaQA: 650K question-answer-evidence triples. Includes 95K question-answer pairs scraped from 14 trivia and quiz-league websites, and (on average 6) evidence documents for each pair, gathered by searching with

Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a varie ...

and Wikipedia. * OpenBookQA: 5960 multiple choice questions, each coming with an elementary level science fact (the "open book"). There are 1329 such facts in total. * SearchQA: 140,461 question-answer pairs from the J! Archive, with each pair augmented with (on average 50) snippets and urls obtained by searching the question on Google. * HotpotQA: 113K multi-hop questions that require reading multiple Wikipedia-based passages to answer. They were produced by showing crowd workers multiple supporting context documents and asking them to produce questions that requiring reasoning about all of the documents. * StrategyQA: 2,780 questions annotated with relevant passages from Wikipedia, such that the question require multi-hop reasoning over the passages to answer. For example, "Did Aristotle use a laptop?" is annotated with passages from the Wikipedia pages for "laptop" and "Aristotle". * DROP (Discrete Reasoning Over the content of Paragraphs): 96,567 questions along with Wikipedia passages, especially from narratives rich in numerical information (like sports summaries and history), often involving multi-step numerical reasoning over several text spans. Adversarial against 2019 SOTA. * GRS-QA: Graph Reasoning-Structured Question Answering Dataset. A dataset designed to evaluate question answering models on graph-based reasoning tasks. * ChartQA: 32,719 questions about 20,882

chart A chart (sometimes known as a graph) is a graphics, graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can repres ...

s crawled from four diverse online sources (

Statista Statista (styled in all lower case) is a German online platform that specializes in data gathering and visualization. In addition to publicly available third-party data, Statista also provides exclusive data via the platform, which is collect ...

Pew Research Center The Pew Research Center (also simply known as Pew) is a nonpartisan American think tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world. It ...

Our World In Data Our World in Data (OWID) is a scientific online publication that focuses on large global problems such as poverty, disease, hunger, war, climate change, population growth, existential risks, and inequality. It is a project of the Global Cha ...

OECD The Organisation for Economic Co-operation and Development (OECD; , OCDE) is an international organization, intergovernmental organization with 38 member countries, founded in 1961 to stimulate economic progress and international trade, wor ...

). Of these, 9,608 were human-written (in ChartQA-H), and 23,111 were machine-generated (in ChartQA-M). The answers are either verbatim texts from the chart or integers calculated based on the chart's data. * DocVQA: multimodal, 50,000 questions on 12,767 document images, sectioned from 6,071 distinct documents. The documents were sourced from 5 industries (tobacco, food, drug, fossil fuel, chemical) of the

UCSF The University of California, San Francisco (UCSF) is a public land-grant research university in San Francisco, California, United States. It is part of the University of California system and is dedicated entirely to health science and life ...

Industry Documents Library, mostly from the 1940-2010 period. Documents with structured elements like tables, forms, lists, and figures were prioritized. The answers are verbatim extracts from the document text.

Closed-book question-answering

*C-Eval (Chinese Eval): 13948 multiple choice questions about in 52 subjects at 4 levels of difficulty. In Chinese. * TruthfulQA: 817 questions in health, law, finance and politics with common misconceptions. Adversarial against

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

and T5. * PIQA (Physical Interaction QA): 17951 two-choice questions. Each question gives a goal (like separating egg yolk from egg white with a water bottle), and 2 choices for accomplishing it. * MedQA: 61097 questions from professional medical board exams, in English, Simplified Chinese, Traditional Chinese. * ScienceQA: 21208 multiple choice questions in natural science, social science, and linguistics, with difficulty level from grade 1 to grade 12, sourced from elementary and high school science curricula. Some questions require reading a diagram. Most questions are annotated with lecture textual lectures and explanations. * SimpleQA: 4,326 short questions that are answerable with knowledge as of 2023. Each answer is graded as either "correct", "incorrect", or "not attempted". Adversarial against GPT-4 specifically. * RealWorldQA: 765 multimodal multiple-choice questions. Each containing an image and a question. Designed to test spatial understanding. Images are drawn from various real-world scenarios, including those captured from vehicles. * OpenEQA (Open Embodied QA): over 1600 questions accompanying about videos, scans of real-world environments, and simulations.

Omnibus

Some benchmarks are "omnibus", meaning they are made by combining several previous benchmarks. *GLUE (General Language Understanding Evaluation): collection of 9 benchmarks designed for testing general language understanding. The tasks are in the format of sentence- or sentence-pair. There are over 1M items. *SuperGLUE: An update to GLUE. Designed to be still challenging to the SOTA models of the time (2019) since the original had been saturated. Includes 8 additional tasks (e.g. logical reasoning, commonsense inference, coreference resolution). *Big-Bench (Beyond the Imitation Game): A benchmark collection of 204 tasks. A particular subset of 23 tasks is called BBH (Big-Bench Hard). An adversarial variant of BBH is called BBEH (Big-Bench Extra Hard), made by replacing each of the 23 tasks from BBH with a similar but adversarial variant. * MMLU (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. Upgraded to MMLU-Pro which increases the number of choices from 4 to 10, eliminated the trivial and noisy questions from MMLU, and added harder problems. * MMMLU (Multilingual MMLU): The test set of MMLU, translated into 14 languages by professional human translators. * CMMLU (Chinese MMLU): 1,528 multiple-choice questions across 67 subjects, 16 of which are "China-specific", like

Classical Chinese Classical Chinese is the language in which the classics of Chinese literature were written, from . For millennia thereafter, the written Chinese used in these works was imitated and iterated upon by scholars in a form now called Literary ...

. Some data collected from non-publicly available materials, mock exam questions, and questions from quiz shows to avoid contamination. More than 80% of the data was crawled from PDFs after OCR. * MMT-Bench: A comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. Comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding.

Multimodal

Some benchmarks specifically test for multimodal ability, usually between text, image, video, and audio. *MMMU (Massive Multi-discipline Multimodal Understanding): A vision-language version of MMLU. 11550 questions collected from college exams, quizzes, and textbooks, covering 30 subjects. The questions require image-understanding to solve. Includes multiple-choice questions and open-ended QA (which are scored by

regex A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

extraction). Human expert baseline is 89%. * VideoMMMU: Like MMMU, but with videos. Contains 300 college-level lecture videos in 30 subjects in 6 disciplines (Art, Business, Science, Medicine, Humanities, and Engineering), with 900 questions. * MMMU-Pro: 1730 multiple-choice multimodal questions in the same format as MMMU, designed to be adversarial against text-only models. Some problems in MMMU turned out to be answerable without looking at the images, necessitating MMMU-Pro. Each question has 10 choices, and presented in both text-image format, and screenshot/photo format. * Vibe-Eval: 269 visual understanding prompts, with standard responses written by experts. Of these, 100 were "hard" meaning they could not be solved by an LLM (Reka Core) at the time of publication. Automatic scoring by LLMs.

Agency

* GAIA: 450 questions with unambiguous answers that require information that can be obtained by browsing the Internet, requiring different levels of tooling and autonomy to solve. Divided into 3 difficulty levels. * WebArena: 241 mock-up websites based on real-world websites (

Reddit Reddit ( ) is an American Proprietary software, proprietary social news news aggregator, aggregation and Internet forum, forum Social media, social media platform. Registered users (commonly referred to as "redditors") submit content to the ...

GitLab GitLab is a software forge primarily developed by GitLab Inc. It is available as a community edition and a commercial edition. History GitLab was created in 2011 by Ukrainian programmer Dmitriy Zaporozhets as a side project written in Rub ...

Magento Magento is an open-source e-commerce platform written in PHP. Magento source code is distributed under the Open Software License. Magento was acquired by Adobe Inc in May 2018 for $1.68 billion. More than 150,000 online stores have been cre ...

's admin portal, etc), and 812 tasks to be performed on the websites. The tasks include information-seeking, site navigation, and content and configuration operation. * Mind2Web: 2,350 tasks collected from 137 websites, and crowdsourced action sequences. The task is to reproduce the action sequence. * OSWorld: 369 multimodal computer-using tasks, involving multiple real web and desktop apps and OS file I/O. In both

Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...

and

Ubuntu Ubuntu ( ) is a Linux distribution based on Debian and composed primarily of free and open-source software. Developed by the British company Canonical (company), Canonical and a community of contributors under a Meritocracy, meritocratic gover ...

. Each task includes an initial state setup configuration, and is tested by an execution-based evaluation script. * Windows Agent Arena: 154 multimodal tasks with the same format as OSWorld. Only in Windows. * WebVoyager: 643 multimodal tasks based on 15 popular websites. Evaluation is by screenshotting the action sequence and asking a vision language model to judge. * BFCL (Berkeley Function-Calling Leaderboard): The task is to write API calls according to a specification. Released in 3 versions, with 1760, 2251, and 1000 items respectively. Some calls are evaluated by parsing into an AST and comparing against the reference answer, while others are evaluated by calling and comparing the response against the reference response. Includes

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...

SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...

, and

REST API REST (Representational State Transfer) is a software architectural style that was created to describe the design and guide the development of the architecture for the World Wide Web. REST defines a set of constraints for how the architecture of ...

. * TAU-bench (Tool-Agent-User benchmark, also written as τ-bench): Two environments (retail, airline booking) that test for an agent to fulfill user instructions, interactively over multiple turns of dialogue. The user is simulated by a language model. * terminal-bench: A collection of complex tasks in the Linux terminal. * BrowseComp: 1,266 questions that requires internet browsing for finding a short factual answer. Adversarial against GPT-4o with and without browsing, OpenAI o1, and an early version of the Deep Research model. *

Context length

Some benchmarks were designed specifically to test for processing continuous text that is very long. * Needle in a haystack tests (NIH): This is not a specific benchmark, but a method for benchmarking context lengths. In this method, a long context window is filled with text, such as Paul Graham's essays, and a random statement is inserted. The task is to answer a question about the inserted statement. * Long Range Arena: 6 synthetic tasks that required 1K to 16K tokens of context length to solve. * NoLiMa: Long-Context Evaluation Beyond Literal Matching. The benchmark assesses long-context models beyond simple keyword matching. Specifically, the words in the question have minimal or no direct lexical overlap with the words in the "needle" sentence. The "haystacks" are 10 open-licensed books. * L-Eval: 2,000+ human-labeled query-response pairs over 508 long documents in 20 tasks, including diverse task types, domains, and input length (3K--200K tokens). * InfiniteBench: 3946 items in 12 tasks from 5 domains (retrieval, code, math, novels, and dialogue) with context lengths exceeding 100K tokens. * ZeroSCROLLS: 4,378 items in 6 tasks. Includes 6 tasks from SCROLLS and introduces 4 new datasets. Named "zero" because it was designed for zero-shot learning during the early days of pretraining paradigm, back when zero-shot capability was uncommon. * LongBench: 4,750 tasks on 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). Updated with LongBench v2 that contained 503 more tasks, that require a context length ranging from 8K to 2M words, with the majority under 128K. * RULER: 13 tasks in 4 categories (retrieval, multi-hop, aggregation, question answering). Each task is specified by a program which can generate arbitrarily long instances of each task on demand. * LOFT (Long-Context Frontiers): 6 long-context task categories (text retrieval, visual retrieval, audio retrieval,

retrieval-augmented generation Retrieval-augmented generation (RAG) is a technique that enables large language model, large language models (LLMs) to retrieve and incorporate new information. With RAG, LLMs do not respond to user queries until they refer to a specified set of d ...

-like dataset query, many-shot in-context learning) in 35 datasets and 4 modalities. Up to 1 million tokens. * MTOB (Machine Translation from One Book): translate sentences between English and Kalamang after reading a grammar book of Kalamang (~570 pages), a bilingual word list (2,531 entries, with Part-of-Speech tags) and a small parallel corpus of sentence pairs (~400 train sentences, 100 test sentences, filtered to exclude examples from the book), both published on ''Dictionaria''. * FACTS Grounding: 1,719 items divided into a public set (860) and a private held-out (859) set. Each contains a document, a system instruction requiring the LLM to exclusively reference the provided document, and a user request that requires understanding of the document. Answers are scored by frontier LLMs.

Reasoning

Mathematics

* Alg514: 514 algebra word problems and associated equation systems gathered from Algebra.com. * Math23K: 23,164 elementary school Chinese mathematical word problems, collected from various online educational websites. * AQuA-RAT (Algebra Question Answering with Rationales): Also known as just "AQuA". 100,000 algebraic word problems with 5 choices per problem, and an annotation for the correct choice with natural language rationales. 34,202 "seed problems" were collected from many sources, such as GMAT and GRE, which were then expanded to the full dataset with Amazon Turk. * GSM8K (Grade School Math): 8.5K linguistically diverse

elementary school A primary school (in Ireland, India, the United Kingdom, Australia, New Zealand, Trinidad and Tobago, Jamaica, South Africa, and Singapore), elementary school, or grade school (in North America and the Philippines) is a school for primary ...

math word problems that require 2 to 8 basic arithmetic operations to solve. Contains errors that had been corrected with GSM8K-Platinum. * GSM1K: 1205 items with the same format and difficulty as GSM8K. More securely contained to avoid the data contamination concerns with the previous GSM8K. * MATH: 12,500 competition-level math problems divided into difficulty levels 1 to 5 (as the Art of Problem Solving), with AIME problems being level 5. There are 1,324 level 5 items. An adversarial version is MATH-P, obtained by modifying a few characters in the original questions. * MathQA: 37,200 word problems in English. Each problem came from AQuA-RAT, and annotated with an "operation program" which exactly specifies the mathematical operations required to solve the problem, written in a

domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...

with 58 operators. Has a variant, MathQA-Python, consisting of 23,914 problems, produced by taking the solutions to a subset of the MathQA dataset, and rewriting into Python. * MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition. * TheoremQA: 800 questions that test for the use of 350 theorems from math, physics, electric engineering, computer science, and finance. * ProofNet: 371 theorems in undergraduate-level mathematics, each consisting of a formal statement in Lean, a natural language statement, and a natural language proof. There are two tasks: given an informal (formal) statement, produce a corresponding formal (informal) statement; given an informal theorem statement, its informal proof, and its formal statement, produce a formal proof. Originally was in Lean 3, but the original authors deprecated it in favor of the Lean 4 version. * miniF2F (mini formal-to-formal): 488 Olympiad-level mathematics problems from

AIME Aime (; ) is a former commune in the Savoie ''département'' in the Auvergne-Rhône-Alpes region in southeastern France. On 1 January 2016, it was merged into the new commune of Aime-la-Plagne.AMC AMC may refer to: Film and television * AMC Theatres, an American movie theater chain * AMC Networks, an American entertainment company ** AMC (TV channel) ** AMC+, streaming service ** AMC Networks International, an entertainment company *** ...

, and IMO, stated in formal languages (

Metamath Metamath is a formal language and an associated computer program (a proof assistant) for archiving and verifying mathematical proofs. Several databases of proved theorems have been developed using Metamath covering standard results in logic, set ...

, Lean,

Isabelle Isabel is a female name of Iberian origin. Isabelle is a name that is similar, but it is of French origin. It originates as the medieval Spanish form of '' Elisabeth'' (ultimately Hebrew ''Elisheba''). Arising in the 12th century, it became popul ...

(partially) and

HOL Light HOL Light is a proof assistant for classical higher-order logic. It is a member of the HOL theorem prover family. Compared with other HOL systems, HOL Light is intended to have relatively simple foundations. HOL Light is authored and maintained ...

(partially)). The task is to formally prove the formal statement, which can be verified automatically. * U-MATH: 1100 math problems sourced from real-world university curricula, balanced across six subjects with 20% of problems including visual elements. * MathBench: 3709 questions in English and Chinese, divided into 5 difficulty levels (basic arithmetic, primary school, middle school, high school, college). Divided into 2,209 questions of MathBench-T (theoretical) and 1,500 questions of MathBench-A (applied). * PutnamBench: 1709 formalized versions of Putnam competition questions during 1962 - 2023. The task is to compute the numerical answer (if there is a numerical answer) and to provide a formal proof. The formalizations are in Lean 4,

, and

Coq Coenzyme Q10 (CoQ10 ), also known as ubiquinone, is a naturally occurring biochemical cofactor (coenzyme) and an antioxidant produced by the human body. It can also be obtained from dietary sources, such as meat, fish, seed oils, vegetables, ...

. * Omni-MATH: 4428 competition-level math problems with human annotation. * FrontierMath: Several hundred questions from areas of modern math that are difficult for professional mathematicians to solve. Many questions have integer answers, so that answers can be verified automatically. Held-out to prevent contamination. * MathArena: Instead of a purpose-built benchmark, the MathArena benchmark simply takes the latest math competitions (AIME and HMMT) as soon as possible and uses those to benchmark LLMs, to prevent contamination.

Programming

* APPS: 10,000 problems from Codewars, AtCoder, Kattis, and

Codeforces Codeforces () is a website that hosts competitive programming contests. It is maintained by a group of competitive programmers from ITMO University led by Mikhail Mirzayanov. Since 2013, Codeforces claims to surpass TopCoder in terms of active co ...

. * MBPP (Mostly Basic Programming Problems): 974 short Python functions designed to be solved by entry-level programmers. Each comes with a text description and unit tests. They were written by an internal pool of crowdworkers who have basic knowledge of Python. * DS-1000: 1000

data science Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, stru ...

problems obtained by reformulating 451 unique

StackOverflow In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many fac ...

problems, requiring the use of 7 Python libraries, such as NumPy and Pandas. The resposes are scored by running test cases and comparing outputs, and checking for the presence/absence of specific APIs or keywords. * HumanEval: 164 problems where the solution is always a python function, often just a few lines long. * CodeElo: 387 contest problems from

during 2024, annotated with metadata such as contest divisions, problem difficulty ratings, and problem algorithm tags. Benchmarking is run by directly submitting to Codeforces, resulting in an

. Limited to 8 submissions per problem. * Aider Polyglot: 225 of the hardest coding exercises from Exercism, in languages of C++, Go, Java, JavaScript, Python and Rust. * BigCodeBench: 1140 tasks that requires multiple function calls. The benchmark involves 139 libraries and 7 domains. A subset BigCodeBench-Hard involves just a 148-task subset of the full benchmark. * SWE-bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue. There are 2 subsets: Lite (300 problems that are faster to run), Verified (human-validated subset of 500 problems reviewed by software engineers). * Multi-SWE-bench: 1,632 problems across 7 languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++. Similar to SWE-bench. * SWE-bench Multimodal: a variant of SWE-bench, with 619 task instances from 17 popular JavaScript repositories, each featuring images that are required for solving the task. * SWE-Lancer: 1,488 freelance software engineering tasks from

Upwork Upwork Inc., formerly Elance-oDesk, is an American freelancing platform headquartered in Santa Clara and San Francisco, California. The company was formed in 2013 as Elance-oDesk after the merger of Elance Inc. and oDesk Corp. The merged compa ...

. Includes implementation tasks (from $50 bug fixes to $32,000 feature implementations), called "IC" (for "Individual Contributor"), and "Management" tasks, where the model must choose between technical implementation proposals. * KernelBench: 250 PyTorch machine learning tasks, for which a CUDA kernel must be written. * Cybench (cybersecurity bench): 40 professional-level Capture the Flag (CTF) tasks from 4 competitions. Tasks are broken down into subtasks for more fine-grained scoring. At least one professional-level human team at each competition was able to solve each of the tasks. The time it took the fastest team to solve each task ranged from 2 minutes to 25 hours. * HCAST (Human-Calibrated Autonomy Software Tasks): 189 tasks in machine learning, cybersecurity, software engineering, and general reasoning. Each task has a "baseline", the measured average time required for a human skilled in the task domains, working under identical conditions as AI agents. The baseline ranges from 1 minute to 8+ hours. * PaperBench: 8,316 individually gradable tasks that would be necessary for replicating 20 Spotlight and Oral papers from ICML 2024 from scratch. The human baseline of ML PhDs (best of 3 attempts) at 48 hours of effort is 41.4%.

General

* GPQA (Google-Proof Q&A): 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, designed to be PhD-level. The "Diamond" subset contains the 198 hardest questions in it. OpenAI found that human experts achieve an average score of 69.7% on the Diamond subset. * SuperGPQA: 26,529 multiple-choice questions collected by domain experts in 285 graduate-level disciplines. The questions were collected by individuals with or pursuing a PhD and then refined and inspected with the help of large language models. * MathVista: 6,141 questions involving quantitative reasoning that requires reading a picture to solve. * AGIEval: questions from 20 official, public, and high-standard admission and qualification exams, such as

SAT The SAT ( ) is a standardized test widely used for college admissions in the United States. Since its debut in 1926, its name and Test score, scoring have changed several times. For much of its history, it was called the Scholastic Aptitude Test ...

Gaokao The Nationwide Unified Examination for Admissions to General Universities and Colleges (), commonly abbreviated as the Gaokao (), is the annual nationally coordinated undergraduate admission exam in mainland China, held in early June. Despite the ...

, law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. * OlympicArena: 11,163 problems from 62 distinct Olympic competitions. * OlympiadBench: 8,476 math and physics problems in English and Chinese, sourced from International Olympiads, Chinese Olympiads, and Gaokao. * ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Given three pairs of before-and-after diagrams of applying a rule, apply the same rule to the fourth before-diagram. It is similar to a

Raven's Progressive Matrices Raven's Progressive Matrices (often referred to simply as Raven's Matrices) or RPM is a non-verbal test typically used to measure general human intelligence and abstract reasoning and is regarded as a non-verbal estimate of fluid intelligence. I ...

test. * LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks. *

Humanity's Last Exam Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI. Creation Stanford HAI's AI Index 2025 Annual Report cites ...

: 3,000 multimodal questions across over a hundred academic subjects, with a held-out private dataset left unreleased to prevent contamination. 10% of questions requires both image and text comprehension and the rest are fully text-based. 80% of questions are scored by exact string matching, and the rest are multiple-choice. * SimpleBench: A multiple-choice text benchmark with over 200 questions covering spatio-temporal reasoning, social intelligence, and linguistic adversarial robustness (or trick questions). It is designed to test "everyday human reasoning".

External links

Epoch AI - AI Benchmarking Hub

References

{{DEFAULTSORT:Language model benchmarks Software comparisons Datasets in machine learning Natural language processing Benchmarks (computing)

Overview

Types

Lifecycle

Construction

Evaluation

Issues

List of benchmarks

General language modeling

General language understanding

General language generation

Open-book question-answering

Closed-book question-answering

Omnibus

Multimodal

Agency

Context length

Reasoning

Mathematics

Programming

General

See also

External links

References