Perplexity

Last updated

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker. [1]

Contents

Perplexity of a probability distribution

The perplexity PP of a discrete probability distribution p is a concept widely used in information theory, machine learning, and statistical modeling. It is defined as

where H(p) is the entropy (in bits) of the distribution, and x ranges over the events. The base of the logarithm need not be 2: The perplexity is independent of the base, provided that the entropy and the exponentiation use the same base. In some contexts, this measure is also referred to as the (order-1 true) diversity .

Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x. It can be thought of as a measure of uncertainty or "surprise" related to the outcomes.

For the special case of a distribution p with exactly k equaling the constant value 1/k and zeros otherwise, the product can be evaluated simply and the perplexity equals k. For example, this is the case when p models a fair k-sided die, i.e. a uniform distribution over k discrete events. In this sense, a random variable with perplexity k has the same uncertainty as a fair k-sided die. One is said to be "k-ways perplexed" about the value of the random variable. Unless it is a fair k-sided die, more than k values may be possible, but the overall uncertainty is not greater because some values may have a probability greater than 1/k.

Perplexity is sometimes used as a measure of the difficulty of a prediction problem. It is however generally not a straight forward representation of the relevant probability. For example, if you have two choices, one with probability 0.9, your chances of a correct guess using the optimal strategy are 90 percent. Yet, the perplexity is 2−0.9 log2 0.9 - 0.1 log2 0.1= 1.38. The inverse of the perplexity, 1/1.38 = 0.72, does not correspond to the 0.9 probability.

The perplexity is the exponentiation of the entropy, a more straightforward quantity. Entropy measures the expected or "average" number of bits required to encode the outcome of the random variable using an optimal variable-length code. It can also be regarded as the expected information gain from learning the outcome of the random variable, providing insight into the uncertainty and complexity of the underlying probability distribution.

Perplexity of a probability model

A model of an unknown probability distribution p, may be proposed based on a training sample that was drawn from p. Given a proposed probability model q, one may evaluate q by asking how well it predicts a separate test sample x1, x2, ..., xN also drawn from p. The perplexity of the model q is defined as

where is customarily 2. Better models q of the unknown distribution p will tend to assign higher probabilities q(xi) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample.

The exponent above may be regarded as the average number of bits needed to represent a test event xi if one uses an optimal code based on q. Low-perplexity models do a better job of compressing the test sample, requiring few bits per test element on average because q(xi) tends to be high.

The exponent may also be interpreted as a cross-entropy:

where denotes the empirical distribution of the test sample (i.e., if x appeared n times in the test sample of size N).

By the definition of KL divergence, it is also equal to

which is . Consequently, the perplexity is minimized when .

Perplexity per word

In natural language processing, a corpus is a set of sentences or texts, and a language model is a probability distribution over entire sentences or texts. Consequently, in NLP, the more commonly used measure is perplexity per word, defined as:

where are the sentences in the corpus, but is the number of words in the corpus. This normalizes the perplexity by the length of the text, allowing for more meaningful comparisons between different texts or models.

Suppose the average sentence xi in the corpus has a probability of according to the language model. This would give a model perplexity of 2190 per sentence. However, it is more common to normalize for sentence length. Thus, if the test sample's sentences comprised a total of 1,000 words, and could be coded using 7.95 bits per word, one could report a model perplexity of 27.95 = 247 per word. In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each word.

There are two standard evaluation metrics for language models: perplexity or word error rate(WER). The simpler of these measures, WER, is simply the percentage of erroneously recognized words E (deletions, insertions, substitutions) to total number of words N, in a speech recognition task i.e.

The second metric, perplexity(per word), is an information theoretic measure that evaluates the similarity of proposed model m to the original distribution p. It can be computed as a inverse of (geometric) average probability of test set T

where N is the number of words in test set T. Equation 1 can be seen as exponentiated cross entropy, where cross entropy H(p;m) is approximated as

In many ways, WER is a better metric, as any improvement on language modeling benchmarks is meaningful only if it translates into improvements in Automatic Speech Recognition (ASR) or Machine Translation. The problem with WER is that it needs a complete ASR pipeline to evaluate. Also, almost all benchmarking datasets are behind a pay-wall, hence not readily available for evaluation.

Recent advances in language modeling

Since 2007, significant advancements in language modeling had emerged, particularly with the advent of deep learning techniques. Perplexity per word, a measure that quantifies the predictive power of a language model, remained central to evaluating models like transformers, BERT, GPT-4 and others. This measure was employed to compare different models on the same dataset and guide the optimization of hyperparameters, although it has been found sensitive to factors such as linguistic features and sentence length. [2] Despite its pivotal role in language model development, perplexity has shown limitations, particularly as an inadequate predictor of speech recognition performance, where it may not correlate would with word-error rates, [3] [4] raising questions about its accuracy.

Brown Corpus

The lowest perplexity that had been published on the Brown Corpus (1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word, corresponding to a cross-entropy of log2247 = 7.95 bits per word or 1.75 bits per letter [5] using a trigram model. While this figure represented the state of the art at the time, advancements in techniques such as deep learning have led to significant improvements in perplexity on other benchmarks, such as the One Billion Word Benchmark. [6]

In the context of the Brown Corpus, simply guessing that the next word is "the" will achieve an accuracy of 7 percent, contrasting with the 1/247 = 0.4 percent that might be expected from a naive use of perplexity. This difference underscores the importance of the statistical model used and the nuanced nature of perplexity as a measure of predictiveness. [7] The guess is based on unigram statistics, not on the trigram statistics that yielded the perplexity of 247, and utilizing trigram statistics would further refine the prediction.

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

<span class="mw-page-title-main">Entropy (information theory)</span> Expected amount of information needed to specify the output of a stochastic data source

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is

<span class="mw-page-title-main">Skewness</span> Measure of the asymmetry of random variables

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

<span class="mw-page-title-main">Helmholtz free energy</span> Thermodynamic potential

In thermodynamics, the Helmholtz free energy is a thermodynamic potential that measures the useful work obtainable from a closed thermodynamic system at a constant temperature (isothermal). The change in the Helmholtz energy during a process is equal to the maximum amount of work that the system can perform in a thermodynamic process in which temperature is held constant. At constant temperature, the Helmholtz free energy is minimized at equilibrium.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.

In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for the most general way to quantify information while preserving additivity for independent events. In the context of fractal dimension estimation, the Rényi entropy forms the basis of the concept of generalized dimensions.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Differential entropy is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average (surprisal) of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP). Differential entropy is commonly encountered in the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete entropy.

In information theory, Fano's inequality relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

This article discusses how information theory is related to measure theory.

<span class="mw-page-title-main">Quantities of information</span>

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

<span class="mw-page-title-main">Dirichlet process</span> Family of stochastic processes

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

  1. Jelinek, F.; Mercer, R. L.; Bahl, L. R.; Baker, J. K. (1977). "Perplexity—a measure of the difficulty of speech recognition tasks". The Journal of the Acoustical Society of America. 62 (S1): S63–S63. doi: 10.1121/1.2016299 . ISSN   0001-4966.
  2. Miaschi, Alessio; Brunato, Dominique; Dell'Orletta, Felice; Venturi, Giulia (2021). "What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity". Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. pp. 40--47. doi: 10.18653/v1/2021.deelio-1.5 . Archived from the original on 2023-10-24. Retrieved 2023-08-24.
  3. Klakow, Dietrich; Peters, Jochen (2002). "Testing the correlation of word error rate and perplexity". Speech Communication. 38 (1–2): 19–28. doi:10.1016/S0167-6393(01)00041-3. ISSN   0167-6393.
  4. Chen, Stanley F; Beeferman, Douglas; Rosenfeld, Roni (2018). "Evaluation Metrics For Language Models". Carnegie Mellon University.
  5. Brown, Peter F.; et al. (March 1992). "An Estimate of an Upper Bound for the Entropy of English" (PDF). Computational Linguistics. 18 (1). Archived (PDF) from the original on 2021-09-17. Retrieved 2007-02-07.
  6. Jozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016). Archived 2021-05-04 at the Wayback Machine
  7. Wilcox, Ethan Gotlieb, et al. "On the predictive power of neural language models for human real-time comprehension behavior." arXiv preprint arXiv:2006.01912 (2020). Archived 2023-08-25 at the Wayback Machine