MMLU

Last updated April 22, 2024

Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of language models. It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models.^[1]

The MMLU was released by a team of researchers in 2020^[1] and was designed to be more challenging than then-existing benchmarks such as GLUE (2018) on which new language models were achieving better-than-human accuracy.^[2] At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy.^[2] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.^[2] As of 2024, some of the most powerful language models, such as Claude 3 and GPT-4, were reported to achieve scores in the mid-80s.^[3] Google's Gemini Ultra model achieved a score of 90%, the highest yet recorded.^[1]

Examples

The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively.^[2] The correct answers are marked in boldface:

Find all $c$ in $\mathbb {Z} _{3}$ such that $\mathbb {Z} _{3}[x]/(x^{2}+c)$ is a field.
(A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?
(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Related Research Articles

In the mathematical discipline of set theory, forcing is a technique for proving consistency and independence results. Intuitively, forcing can be thought of as a technique to expand the set theoretical universe $to a larger universe by introducing a new "generic" object .$

In abstract algebra, a semiring is an algebraic structure. It is a generalization of a ring, dropping the requirement that each element must have an additive inverse. At the same time, it is a generalization of bounded distributive lattices.

In mathematics, the characteristic of a ring $R$ , often denoted $char(R)$ , is defined to be the smallest positive number of copies of the ring's multiplicative identity ( $1$ ) that will sum to the additive identity ( $0$ ). If no such number exists, the ring is said to have characteristic zero.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

Anthropic PBC is a U.S.-based artificial intelligence (AI) startup company, founded in 2021, researching artificial intelligence as a public-benefit company to develop AI systems to “study their safety properties at the technological frontier” and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI’s ChatGPT and Google’s Gemini.

OpenAI is a U.S.-based artificial intelligence (AI) research organization founded in December 2015, researching artificial intelligence with the goal of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As one of the leading organizations of the AI boom, it has developed several large language models, advanced image generation models, and previously, released open-source models. Its release of ChatGPT has been credited with starting the AI boom.

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

A foundation model is a machine learning or deep learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT. The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) created and popularized the term.

ChatGPT is a chatbot developed by OpenAI and launched on November 30, 2022. Based on large language models, it enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. Successive user prompts and replies are considered at each conversation stage as context.

Chinchilla is a family of large language models developed by the research team at DeepMind, presented in March 2022. It is named "chinchilla" because it is a further development over a previous model family named Gopher. Both model families were trained in order to investigate the scaling laws of large language models.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

Gemini is a family of multimodal large language models developed by Google DeepMind, serving as the successor to LaMDA and PaLM 2. Comprising Gemini Ultra, Gemini Pro, and Gemini Nano, it was announced on December 6, 2023, positioned as a competitor to OpenAI's GPT-4. It powers the chatbot of the same name.

Mistral AI is a French company selling artificial intelligence (AI) products. It was founded in April 2023 by previous employees of Meta Platforms and Google DeepMind. The company raised €385 million in October 2023, and in December 2023, it was valued at more than $2 billion.

Claude is a family of large language models developed by Anthropic. The first model was released in March 2023. Claude 3, released in March 2024, can also analyze images.

References

1 2 3 Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.
1 2 3 4 Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv: 2009.03300 .
↑ "Introducing the next generation of Claude". Anthropic AI. 4 March 2024.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[nyt-1] 1 2 3 Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.

[paper-2] 1 2 3 4 Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv: 2009.03300 .

[claude3-3] "Introducing the next generation of Claude". Anthropic AI. 4 March 2024.

[1]

[2]

[3]