MMLU

Last updated

In artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language models.

Contents

Benchmark

It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. [1] [2]

The MMLU was released by Dan Hendrycks and a team of researchers in 2020 [3] and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy. [3] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. [3] As of 2024, some of the most powerful language models, such as o1, Gemini and Claude 3, were reported to achieve scores around 90%. [4] [5]

An expert review of 3,000 randomly sampled questions found that over 9% of the questions are wrong (either the question is not well-defined, or that the given answer is wrong), which suggests that 90% is essentially the maximal achievable score. [6]

Examples

The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively. [3] The correct answers are marked in boldface:

Find all in such that is a field.

(A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Leaderboard

OrganisationLLMMMLU
OpenAI o1-preview 90.8 [4]
Anthropic Claude 3.5 Sonnet 88.7
Meta Llama-3.1 405B 88.6
xAI Grok-2 87.5
Anthropic Claude 3 Opus 86.8
Meta Llama-3.1 70B 86.0
Google Gemini-1.5 Pro 85.9
Inflection Inflection-2.5 85.5
Mistral Mistral Large 2 84.0
Reka Reka Core 83.2
AI21 Jamba-1.5 Large 81.2

Related Research Articles

<span class="mw-page-title-main">Complex number</span> Number with a real and an imaginary part

In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation ; every complex number can be expressed in the form , where a and b are real numbers. Because no real number satisfies the above equation, i was called an imaginary number by René Descartes. For the complex number ,a is called the real part, and b is called the imaginary part. The set of complex numbers is denoted by either of the symbols or C. Despite the historical nomenclature, "imaginary" complex numbers have a mathematical existence as firm as that of the real numbers, and they are fundamental tools in the scientific description of the natural world.

<span class="mw-page-title-main">Exponential function</span> Mathematical function, denoted exp(x) or e^x

The exponential function is a mathematical function denoted by or . Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, although it can be extended to the complex numbers or generalized to other mathematical objects like matrices or Lie algebras. The exponential function originated from the operation of taking powers of a number, but various modern definitions allow it to be rigorously extended to all real arguments , including irrational numbers. Its ubiquity in pure and applied mathematics led mathematician Walter Rudin to consider the exponential function to be "the most important function in mathematics".

<span class="mw-page-title-main">Inequality (mathematics)</span> Mathematical relation expressed with < or ≤

In mathematics, an inequality is a relation which makes a non-equal comparison between two numbers or other mathematical expressions. It is used most often to compare two numbers on the number line by their size. The main types of inequality are less than (<) and greater than (>).

In the mathematical discipline of set theory, forcing is a technique for proving consistency and independence results. Intuitively, forcing can be thought of as a technique to expand the set theoretical universe to a larger universe by introducing a new "generic" object .

In mathematics, complex geometry is the study of geometric structures and constructions arising out of, or described by, the complex numbers. In particular, complex geometry is concerned with the study of spaces such as complex manifolds and complex algebraic varieties, functions of several complex variables, and holomorphic constructions such as holomorphic vector bundles and coherent sheaves. Application of transcendental methods to algebraic geometry falls in this category, together with more geometric aspects of complex analysis.

In mathematics, a well-defined expression or unambiguous expression is an expression whose definition assigns it a unique interpretation or value. Otherwise, the expression is said to be not well defined, ill defined or ambiguous. A function is well defined if it gives the same result when the representation of the input is changed without changing the value of the input. For instance, if takes real numbers as input, and if does not equal then is not well defined. The term well-defined can also be used to indicate that a logical expression is unambiguous or uncontradictory.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

In mathematics, the characteristic of a ring R, often denoted char(R), is defined to be the smallest positive number of copies of the ring's multiplicative identity (1) that will sum to the additive identity (0). If no such number exists, the ring is said to have characteristic zero.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Sine and cosine</span> Fundamental trigonometric functions

In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted as and .

<span class="mw-page-title-main">Long short-term memory</span> Type of recurrent neural network architecture

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

<span class="mw-page-title-main">Real number</span> Number representing a continuous quantity

In mathematics, a real number is a number that can be used to measure a continuous one-dimensional quantity such as a distance, duration or temperature. Here, continuous means that pairs of values can have arbitrarily small differences. Every real number can be almost uniquely represented by an infinite decimal expansion.

The Winograd schema challenge (WSC) is a test of machine intelligence proposed in 2012 by Hector Levesque, a computer scientist at the University of Toronto. Designed to be an improvement on the Turing test, it is a multiple-choice test that employs questions of a very specific structure: they are instances of what are called Winograd schemas, named after Terry Winograd, professor of computer science at Stanford University.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

Chinchilla is a family of large language models (LLMs) developed by the research team at Google DeepMind, presented in March 2022.

<span class="mw-page-title-main">Neural scaling law</span> Law in machine learning

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

OpenAI o1 is a generative pre-trained transformer. A preview of o1 was released by OpenAI on September 12, 2024. o1 spends time "thinking" before it answers, making it more effective in complex reasoning tasks, science and programming.

References

  1. Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.
  2. "MMLU Dataset". HuggingFace. 24 July 2024.
  3. 1 2 3 4 Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv: 2009.03300 [cs.CY].
  4. 1 2 OpenAI o1 System Card. OpenAI. p. 33. Retrieved 13 September 2024.
  5. "Multi-task Language Understanding on MMLU | Leaderboard". Papers with Code. Retrieved 2024-10-10.
  6. Gema, Aryo Pradipta; Leang, Joshua Ong Jun; Hong, Giwon; Devoto, Alessio; Mancino, Alberto Carlo Maria; Saxena, Rohit; He, Xuanli; Zhao, Yu; Du, Xiaotang (2024-06-07). "Are We Done with MMLU?". arXiv: 2406.04127 [cs.CL].