In artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language models.
It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. [1] [2]
The MMLU was released by Dan Hendrycks and a team of researchers in 2020 [3] and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy. [3] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. [3] As of 2024, some of the most powerful language models, such as o1, Gemini and Claude 3, were reported to achieve scores around 90%. [4] [5]
The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively. [3] The correct answers are marked in boldface:
Find all in such that is a field.
(A) 0 (B) 1 (C) 2 (D) 3
Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?
(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties
Organisation | LLM | MMLU |
---|---|---|
OpenAI | o1-preview | 90.8 [4] |
Rubik's AI | Nova-Pro | 88.8 |
Anthropic | Claude 3.5 Sonnet | 88.7 |
Meta | Llama-3.1 405B | 88.6 |
xAI | Grok-2 | 87.5 |
Anthropic | Claude 3 Opus | 86.8 |
Meta | Llama-3.1 70B | 86.0 |
Gemini-1.5 Pro | 85.9 | |
Inflection | Inflection-2.5 | 85.5 |
Mistral | Mistral Large 2 | 84.0 |
Reka | Reka Core | 83.2 |
AI21 | Jamba-1.5 Large | 81.2 |
In the field of artificial intelligence (AI), tasks that are hypothesized to require artificial general intelligence to solve are informally known as AI-complete or AI-hard. Calling a problem AI-complete reflects the belief that it cannot be solved by a simple specific algorithm.
In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted i, called the imaginary unit and satisfying the equation ; every complex number can be expressed in the form , where a and b are real numbers. Because no real number satisfies the above equation, i was called an imaginary number by René Descartes. For the complex number ,a is called the real part, and b is called the imaginary part. The set of complex numbers is denoted by either of the symbols or C. Despite the historical nomenclature, "imaginary" complex numbers have a mathematical existence as firm as that of the real numbers, and they are fundamental tools in the scientific description of the natural world.
The exponential function is a mathematical function denoted by or . Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, although it can be extended to the complex numbers or generalized to other mathematical objects like matrices or Lie algebras. The exponential function originated from the operation of taking powers of a number, but various modern definitions allow it to be rigorously extended to all real arguments , including irrational numbers. Its ubiquity in pure and applied mathematics led mathematician Walter Rudin to consider the exponential function to be "the most important function in mathematics".
In mathematics, an inequality is a relation which makes a non-equal comparison between two numbers or other mathematical expressions. It is used most often to compare two numbers on the number line by their size. The main types of inequality are less than (<) and greater than (>).
In the mathematical discipline of set theory, forcing is a technique for proving consistency and independence results. Intuitively, forcing can be thought of as a technique to expand the set theoretical universe to a larger universe by introducing a new "generic" object .
In mathematics, complex geometry is the study of geometric structures and constructions arising out of, or described by, the complex numbers. In particular, complex geometry is concerned with the study of spaces such as complex manifolds and complex algebraic varieties, functions of several complex variables, and holomorphic constructions such as holomorphic vector bundles and coherent sheaves. Application of transcendental methods to algebraic geometry falls in this category, together with more geometric aspects of complex analysis.
Algebraic varieties are the central objects of study in algebraic geometry, a sub-field of mathematics. Classically, an algebraic variety is defined as the set of solutions of a system of polynomial equations over the real or complex numbers. Modern definitions generalize this concept in several different ways, while attempting to preserve the geometric intuition behind the original definition.
In abstract algebra, a semiring is an algebraic structure. Semirings are a generalization of rings, dropping the requirement that each element must have an additive inverse. At the same time, semirings are a generalization of bounded distributive lattices.
In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.
In mathematics, the characteristic of a ring R, often denoted char(R), is defined to be the smallest positive number of copies of the ring's multiplicative identity (1) that will sum to the additive identity (0). If no such number exists, the ring is said to have characteristic zero.
A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.
In geometry, a three-dimensional space is a mathematical space in which three values (coordinates) are required to determine the position of a point. Most commonly, it is the three-dimensional Euclidean space, that is, the Euclidean space of dimension three, which models physical space. More general three-dimensional spaces are called 3-manifolds. The term may also refer colloquially to a subset of space, a three-dimensional region, a solid figure.
In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.
In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted as and .
In mathematics, a real number is a number that can be used to measure a continuous one-dimensional quantity such as a distance, duration or temperature. Here, continuous means that pairs of values can have arbitrarily small differences. Every real number can be almost uniquely represented by an infinite decimal expansion.
Quantum volume is a metric that measures the capabilities and error rates of a quantum computer. It expresses the maximum size of square quantum circuits that can be implemented successfully by the computer. The form of the circuits is independent from the quantum computer architecture, but compiler can transform and optimize it to take advantage of the computer's features. Thus, quantum volumes for different architectures can be compared.
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.
Chinchilla is a family of large language models (LLMs) developed by the research team at Google DeepMind, presented in March 2022.
OpenAI o1 is a generative pre-trained transformer. A preview of o1 was released by OpenAI on September 12, 2024. o1 spends time "thinking" before it answers, making it more effective in complex reasoning tasks, science and programming.