Top-p sampling

Last updated

Top-p sampling, also called nucleus sampling, is a technique for autoregressive language model decoding proposed by Ari Holtzman in 2019. [1] Before the introduction of nucleus sampling, maximum likelihood decoding and beam search were the standard techniques for text generation, but, both of these decoding strategies are prone to generating texts that are repetitive and otherwise unnatural. [2] Top-p sampling avoids this by setting a threshold p and then restricting the sampling to the set of most probable tokens with cumulative probability less than p.

Top-k sampling is similar except that the sample is taken from the k-highest probability tokens regardless of their cumulative probability. The advantage of top-p sampling is that one avoids the difficult problem of choosing the optimal value of k which can vary depending on the shape of the output distribution and the particular task and dataset. [3]

The top-p sampling technique is used in popular large language model applications like ChatGPT and is implemented in language modeling frameworks like Hugging Face and Cohere. [4]

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

<span class="mw-page-title-main">Inverse transform sampling</span> Basic method for pseudo-random number sampling

Inverse transform sampling is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

<span class="mw-page-title-main">Continuous uniform distribution</span> Uniform distribution on an interval

In probability theory and statistics, the continuous uniform distributions or rectangular distributions are a family of symmetric probability distributions. Such a distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds. The bounds are defined by the parameters, and which are the minimum and maximum values. The interval can either be closed or open. Therefore, the distribution is often abbreviated where stands for uniform distribution. The difference between the bounds defines the interval length; all intervals of the same length on the distribution's support are equally probable. It is the maximum entropy probability distribution for a random variable under no constraint other than that it is contained in the distribution's support.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

Non-uniform random variate generation or pseudo-random number sampling is the numerical practice of generating pseudo-random numbers (PRN) that follow a given probability distribution. Methods are typically based on the availability of a uniformly distributed PRN generator. Computational algorithms are then used to manipulate a single random variate, X, or often several such variates, into a new random variate Y such that these values have the required distribution. The first methods were developed for Monte-Carlo simulations in the Manhattan project, published by John von Neumann in the early 1950s.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform: a prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

<span class="mw-page-title-main">GPT-J</span> Open source artificial intelligence text generating language model developed by EleutherAI

GPT-J or GPT-J-6B is an open-source large language model (LLM) developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional "6B" in the name refers to the fact that it has 6 billion parameters.

Ari Holtzman is a professor of Computer Science at the University of Chicago and an expert in the area of Natural language processing and Computational linguistics. Previously, Holtzman was a PhD student at the University of Washington where he was advised by Luke Zettlemoyer.

Text-to-Image personalization is a task in deep learning for computer graphics that augments pre-trained text-to-image generative models. In this task, a generative model that was trained on large-scale data, is adapted such that it can generate images of novel, user-provided concepts. These concepts are typically unseen during training, and may represent specific objects or more abstract categories.

T5 is a series of large language models developed by Google AI introduced in 2019. Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

References

  1. Holtzman, Ari; Buys, Jan; Du, Li; Forbes, Maxwell; Choi, Yejin (22 April 2019). "The Curious Case of Neural Text Degeneration". arXiv: 1904.09751 [cs.CL].
  2. Chiusano, Fabio (28 January 2022). "Two minutes NLP — Most used Decoding Methods for Language Models". Medium. Retrieved 23 August 2023.
  3. McCaffrey, James D. (14 October 2021). "Nucleus Sampling for Natural Language Processing" . Retrieved 23 August 2023.
  4. von Platen, Patrick. "How to generate text: using different decoding methods for language generation with Transformers". Hugging Face. Retrieved 23 August 2023.