MMLU

Last updated

Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of language models. It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models. [1]

The MMLU was released by a team of researchers in 2020 [1] and was designed to be more challenging than then-existing benchmarks such as GLUE (2018) on which new language models were achieving better-than-human accuracy. [2] At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy. [2] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. [2] As of 2024, some of the most powerful language models, such as Claude 3 and GPT-4, were reported to achieve scores in the mid-80s. [3] Google's Gemini Ultra model achieved a score of 90%, the highest yet recorded. [1]

Examples

The following examples are taken from the "Abstract Algebra" and "International Law" tasks, respectively. [2] The correct answers are marked in boldface:

Find all in such that is a field.

(A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the ICCPR be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Related Research Articles

<span class="mw-page-title-main">Exponential function</span> Mathematical function, denoted exp(x) or e^x

The exponential function is a mathematical function denoted by or . Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, although it can be extended to the complex numbers or generalized to other mathematical objects like matrices or Lie algebras. The exponential function originated from the operation of taking powers of a number, but various modern definitions allow it to be rigorously extended to all real arguments , including irrational numbers. Its ubiquitous occurrence in pure and applied mathematics led mathematician Walter Rudin to consider the exponential function to be "the most important function in mathematics".

<span class="mw-page-title-main">Inequality (mathematics)</span> Mathematical relation expressed with < or ≤

In mathematics, an inequality is a relation which makes a non-equal comparison between two numbers or other mathematical expressions. It is used most often to compare two numbers on the number line by their size. The main types of inequality are less than and greater than.

In the mathematical discipline of set theory, forcing is a technique for proving consistency and independence results. Intuitively, forcing can be thought of as a technique to expand the set theoretical universe to a larger universe by introducing a new "generic" object .

In abstract algebra, a semiring is an algebraic structure. It is a generalization of a ring, dropping the requirement that each element must have an additive inverse. At the same time, it is a generalization of bounded distributive lattices.

In mathematics, the characteristic of a ring R, often denoted char(R), is defined to be the smallest positive number of copies of the ring's multiplicative identity (1) that will sum to the additive identity (0). If no such number exists, the ring is said to have characteristic zero.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

<span class="mw-page-title-main">OpenAI</span> Artificial intelligence research organization

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015, researching artificial intelligence with the goal of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work." As one of the leading organizations of the AI boom, it has developed several large language models, advanced image generation models, and previously, released open-source models. Its release of ChatGPT has been credited with starting the AI boom.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

A foundation model, also known as large AI model, is a machine learning or deep learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT. The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) created and popularized the term.

Chinchilla is a family of large language models developed by the research team at DeepMind, presented in March 2022. It is named "chinchilla" because it is a further development over a previous model family named Gopher. Both model families were trained in order to investigate the scaling laws of large language models.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

Mistral AI is a French company selling artificial intelligence (AI) products. It was founded in April 2023 by previous employees of Meta Platforms and Google DeepMind. The company raised €385 million in October 2023, and in December 2023, it was valued at more than $2 billion.

GPT-4o is a multilingual, multimodal generative pre-trained transformer designed by OpenAI. It was announced by OpenAI's CTO Mira Murati during a live-streamed demo on 13 May 2024 and released the same day. GPT-4o is free, but with a usage limit that is 5 times higher for ChatGPT Plus subscribers. It can process and generate text, images and audio. Its API is twice as fast and half the price of its predecessor, GPT-4 Turbo.

References

  1. 1 2 3 Roose, Kevin (15 April 2024). "A.I. Has a Measurement Problem". The New York Times.
  2. 1 2 3 4 Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding". arXiv: 2009.03300 .
  3. "Introducing the next generation of Claude". Anthropic AI. 4 March 2024.