MMLU

Last updated

In artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language models.

Contents

Benchmark

It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. [1] [2]

The MMLU was released by Dan Hendrycks and a team of researchers in 2020 [3] and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy. [3] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. [3] As of 2024, some of the most powerful language models, such as o1, Gemini and Claude 3, were reported to achieve scores around 90%. [4] [5]

Examples

The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively. [3] The correct answers are marked in boldface:

Find all in such that is a field.

(A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Leaderboard

OrganisationLLMMMLU
OpenAI o1-preview 90.8 [4]
Rubik's AI Nova-Pro 88.8
Anthropic Claude 3.5 Sonnet 88.7
Meta Llama-3.1 405B 88.6
xAI Grok-2 87.5
Anthropic Claude 3 Opus 86.8
Meta Llama-3.1 70B 86.0
Google Gemini-1.5 Pro 85.9
Inflection Inflection-2.5 85.5
Mistral Mistral Large 2 84.0
Reka Reka Core 83.2
AI21 Jamba-1.5 Large 81.2

Related Research Articles

In the field of artificial intelligence (AI), tasks that are hypothesized to require artificial general intelligence to solve are informally known as AI-complete or AI-hard. Calling a problem AI-complete reflects the belief that it cannot be solved by a simple specific algorithm.

<span class="mw-page-title-main">Complex number</span> Number with a real and an imaginary part

In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation ; every complex number can be expressed in the form , where a and b are real numbers. Because no real number satisfies the above equation, i was called an imaginary number by René Descartes. For the complex number ,a is called the real part, and b is called the imaginary part. The set of complex numbers is denoted by either of the symbols or C. Despite the historical nomenclature, "imaginary" complex numbers have a mathematical existence as firm as that of the real numbers, and they are fundamental tools in the scientific description of the natural world.

<span class="mw-page-title-main">Exponential function</span> Mathematical function, denoted exp(x) or e^x

The exponential function is a mathematical function denoted by or . Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, although it can be extended to the complex numbers or generalized to other mathematical objects like matrices or Lie algebras. The exponential function originated from the operation of taking powers of a number, but various modern definitions allow it to be rigorously extended to all real arguments , including irrational numbers. Its ubiquity in pure and applied mathematics led mathematician Walter Rudin to consider the exponential function to be "the most important function in mathematics".

<span class="mw-page-title-main">Inequality (mathematics)</span> Mathematical relation expressed with < or ≤

In mathematics, an inequality is a relation which makes a non-equal comparison between two numbers or other mathematical expressions. It is used most often to compare two numbers on the number line by their size. The main types of inequality are less than (<) and greater than (>).

In the mathematical discipline of set theory, forcing is a technique for proving consistency and independence results. Intuitively, forcing can be thought of as a technique to expand the set theoretical universe to a larger universe by introducing a new "generic" object .

In mathematics, complex geometry is the study of geometric structures and constructions arising out of, or described by, the complex numbers. In particular, complex geometry is concerned with the study of spaces such as complex manifolds and complex algebraic varieties, functions of several complex variables, and holomorphic constructions such as holomorphic vector bundles and coherent sheaves. Application of transcendental methods to algebraic geometry falls in this category, together with more geometric aspects of complex analysis.

<span class="mw-page-title-main">Algebraic variety</span> Mathematical object studied in the field of algebraic geometry

Algebraic varieties are the central objects of study in algebraic geometry, a sub-field of mathematics. Classically, an algebraic variety is defined as the set of solutions of a system of polynomial equations over the real or complex numbers. Modern definitions generalize this concept in several different ways, while attempting to preserve the geometric intuition behind the original definition.

In abstract algebra, a semiring is an algebraic structure. Semirings are a generalization of rings, dropping the requirement that each element must have an additive inverse. At the same time, semirings are a generalization of bounded distributive lattices.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

In mathematics, the characteristic of a ring R, often denoted char(R), is defined to be the smallest positive number of copies of the ring's multiplicative identity (1) that will sum to the additive identity (0). If no such number exists, the ring is said to have characteristic zero.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Three-dimensional space</span> Geometric model of the physical space

In geometry, a three-dimensional space is a mathematical space in which three values (coordinates) are required to determine the position of a point. Most commonly, it is the three-dimensional Euclidean space, that is, the Euclidean space of dimension three, which models physical space. More general three-dimensional spaces are called 3-manifolds. The term may also refer colloquially to a subset of space, a three-dimensional region, a solid figure.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

<span class="mw-page-title-main">Sine and cosine</span> Fundamental trigonometric functions

In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted as and .

<span class="mw-page-title-main">Real number</span> Number representing a continuous quantity

In mathematics, a real number is a number that can be used to measure a continuous one-dimensional quantity such as a distance, duration or temperature. Here, continuous means that pairs of values can have arbitrarily small differences. Every real number can be almost uniquely represented by an infinite decimal expansion.

Quantum volume is a metric that measures the capabilities and error rates of a quantum computer. It expresses the maximum size of square quantum circuits that can be implemented successfully by the computer. The form of the circuits is independent from the quantum computer architecture, but compiler can transform and optimize it to take advantage of the computer's features. Thus, quantum volumes for different architectures can be compared.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

Chinchilla is a family of large language models (LLMs) developed by the research team at Google DeepMind, presented in March 2022.

OpenAI o1 is a generative pre-trained transformer. A preview of o1 was released by OpenAI on September 12, 2024. o1 spends time "thinking" before it answers, making it more effective in complex reasoning tasks, science and programming.

References

  1. Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.
  2. "MMLU Dataset". HuggingFace. 24 July 2024.
  3. 1 2 3 4 Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv: 2009.03300 [cs.CY].
  4. 1 2 OpenAI o1 System Card. OpenAI. p. 33. Retrieved 13 September 2024.
  5. "Multi-task Language Understanding on MMLU | Leaderboard". Papers with Code. Retrieved 2024-10-10.