Perplexity

Last updated

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker. [1]

Contents

Perplexity of a probability distribution

The perplexity PP of a discrete probability distribution p is a concept widely used in information theory, machine learning, and statistical modeling. It is defined as

where H(p) is the entropy (in bits) of the distribution, and x ranges over the events. The base of the logarithm need not be 2: The perplexity is independent of the base, provided that the entropy and the exponentiation use the same base. In some contexts, this measure is also referred to as the (order-1 true) diversity .

Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x. It can be thought of as a measure of uncertainty or "surprise" related to the outcomes.

For a probability distribution p where exactly k outcomes each have a probability of 1/k and all other outcomes have a probability of zero, the perplexity of this distribution is simply k. This is because the distribution models a fair k-sided die, with each of the k outcomes being equally likely. In this context, the perplexity k indicates that there is as much uncertainty as there would be when rolling a fair k-sided die. Even if a random variable has more than k possible outcomes, the perplexity will still be k if the distribution is uniform over k outcomes and zero for the rest. Thus, a random variable with a perplexity of k can be described as being "k-ways perplexed," meaning it has the same level of uncertainty as a fair k-sided die.

Perplexity is sometimes used as a measure of the difficulty of a prediction problem. It is, however, generally not a straight forward representation of the relevant probability. For example, if you have two choices, one with probability 0.9, your chances of a correct guess using the optimal strategy are 90 percent. Yet, the perplexity is 2−0.9 log2 0.9 - 0.1 log2 0.1= 1.38. The inverse of the perplexity, 1/1.38 = 0.72, does not correspond to the 0.9 probability.

The perplexity is the exponentiation of the entropy, a more straightforward quantity. Entropy measures the expected or "average" number of bits required to encode the outcome of the random variable using an optimal variable-length code. It can also be regarded as the expected information gain from learning the outcome of the random variable, providing insight into the uncertainty and complexity of the underlying probability distribution.

Perplexity of a probability model

A model of an unknown probability distribution p, may be proposed based on a training sample that was drawn from p. Given a proposed probability model q, one may evaluate q by asking how well it predicts a separate test sample x1, x2, ..., xN also drawn from p. The perplexity of the model q is defined as

where is customarily 2. Better models q of the unknown distribution p will tend to assign higher probabilities q(xi) to the test events. Thus, they have lower perplexity because they are less surprised by the test sample. This is equivalent to saying that better models have higher likelihoods for the test data, which leads to a lower perplexity value.

The exponent above may be regarded as the average number of bits needed to represent a test event xi if one uses an optimal code based on q. Low-perplexity models do a better job of compressing the test sample, requiring few bits per test element on average because q(xi) tends to be high.

The exponent may also be interpreted as a cross-entropy:

where denotes the empirical distribution of the test sample (i.e., if x appeared n times in the test sample of size N).

By the definition of KL divergence, it is also equal to which is . Consequently, the perplexity is minimized when .

Perplexity per token

In natural language processing, a corpus is a structured collection of texts or documents, and a language model is a probability distribution over entire texts or documents. Consequently, in NLP, the more commonly used measure is perplexity per token (word or, more frequently, sub-word), defined as: where are the documents in the corpus and is the number of tokens in the corpus. This normalizes the perplexity by the length of the text, allowing for more meaningful comparisons between different texts or models rather than documents.

Suppose the average text xi in the corpus has a probability of according to the language model. This would give a model perplexity of 2190 per sentence. However, in NLP, it is more common to normalize by the length of a text. Thus, if the test sample has a length of 1,000 tokens, and could be coded using 7.95 bits per token, one could report a model perplexity of 27.95 = 247 per token. In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each token.

There are two standard evaluation metrics for language models: perplexity or word error rate(WER). The simpler of these measures, WER, is simply the percentage of erroneously recognized words E (deletions, insertions, substitutions) to total number of words N, in a speech recognition task i.e.The second metric, perplexity (per token), is an information theoretic measure that evaluates the similarity of proposed model m to the original distribution p. It can be computed as a inverse of (geometric) average probability of test set T

where N is the number of tokens in test set T. This equation can be seen as the exponentiated cross entropy, where cross entropy H(p;m) is approximated as

Recent advances in language modeling

Since 2007, significant advancements in language modeling have emerged, particularly with the advent of deep learning techniques. Perplexity per token, a measure that quantifies the predictive power of a language model, has remained central to evaluating models such as the dominant transformer models like BERT, GPT-4 and other large language models (LLMs).

This measure was employed to compare different models on the same dataset and guide the optimization of hyperparameters, although it has been found sensitive to factors such as linguistic features and sentence length. [2]

Despite its pivotal role in language model development, perplexity has shown limitations, particularly as an inadequate predictor of speech recognition performance, overfitting and generalization, [3] [4] raising questions about the benefits of blindly optimizing perplexity alone.

Brown Corpus

The lowest perplexity that had been published on the Brown Corpus (1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word/token, corresponding to a cross-entropy of log2247 = 7.95 bits per word or 1.75 bits per letter [5] using a trigram model. While this figure represented the state of the art at the time, advancements in techniques such as deep learning have led to significant improvements in perplexity on other benchmarks, such as the One Billion Word Benchmark. [6]

In the context of the Brown Corpus, simply guessing that the next word is "the" will achieve an accuracy of 7 percent, contrasting with the 1/247 = 0.4 percent that might be expected from a naive use of perplexity. This difference underscores the importance of the statistical model used and the nuanced nature of perplexity as a measure of predictiveness. [7] The guess is based on unigram statistics, not on the trigram statistics that yielded the perplexity of 247, and utilizing trigram statistics would further refine the prediction.

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. It is at the intersection of electronic engineering, mathematics, statistics, computer science, neurobiology, physics, and electrical engineering.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the set and is distributed according to , the entropy is where denotes the sum over the variable's possible values. The choice of base for , the logarithm, varies for different applications. Base 2 gives the unit of bits, while base e gives "natural units" nat, and base 10 gives units of "dits", "bans", or "hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable.

<span class="mw-page-title-main">Bernoulli process</span> Random process of binary (boolean) random variables

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesXi are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable Xi in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. Mathematically, it is defined as

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

<span class="mw-page-title-main">Binary entropy function</span> Entropy of a process with only two probable values

In information theory, the binary entropy function, denoted or , is defined as the entropy of a Bernoulli process with probability of one of two values, and is given by the formula:

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In probability theory and statistics, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution,. The K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

References

  1. Jelinek, F.; Mercer, R. L.; Bahl, L. R.; Baker, J. K. (1977). "Perplexity—a measure of the difficulty of speech recognition tasks". The Journal of the Acoustical Society of America. 62 (S1): S63. Bibcode:1977ASAJ...62Q..63J. doi: 10.1121/1.2016299 . ISSN   0001-4966.
  2. Miaschi, Alessio; Brunato, Dominique; Dell'Orletta, Felice; Venturi, Giulia (2021). "What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity". Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. pp. 40–47. doi: 10.18653/v1/2021.deelio-1.5 . Archived from the original on 2023-10-24. Retrieved 2023-08-24.
  3. Klakow, Dietrich; Peters, Jochen (2002). "Testing the correlation of word error rate and perplexity". Speech Communication. 38 (1–2): 19–28. doi:10.1016/S0167-6393(01)00041-3. ISSN   0167-6393.
  4. Chen, Stanley F; Beeferman, Douglas; Rosenfeld, Roni (2018). "Evaluation Metrics For Language Models". Carnegie Mellon University. doi:10.1184/R1/6605324.v1.
  5. Brown, Peter F.; et al. (March 1992). "An Estimate of an Upper Bound for the Entropy of English" (PDF). Computational Linguistics. 18 (1). Archived (PDF) from the original on 2021-09-17. Retrieved 2007-02-07.
  6. Jozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016). Archived 2021-05-04 at the Wayback Machine
  7. Wilcox, Ethan Gotlieb, et al. "On the predictive power of neural language models for human real-time comprehension behavior." arXiv preprint arXiv:2006.01912 (2020). Archived 2023-08-25 at the Wayback Machine