BLEU

Last updated May 03, 2024

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Invented at IBM in 2001,^[1] BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Intelligibility or grammatical correctness are not taken into account.

BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.

Mathematical definition

Basic setup

A basic, first attempt at defining the BLEU score would take two arguments: a candidate string ${\hat {y}}$ and a list of reference strings $(y^{(1)},...,y^{(N)})$ . The idea is that $BLEU({\hat {y}};y^{(1)},...,y^{(N)})$ should be close to 1 when ${\hat {y}}$ is similar to $y^{(1)},...,y^{(N)}$ , and close to 0 if not.

As an analogy, the BLEU score is like a language teacher trying to score the quality of a student translation ${\hat {y}}$ by checking how closely it follows the reference answers $y^{(1)},...,y^{(N)}$ .

Since in natural language processing, one should evaluate a large set of candidate strings, one must generalize the BLEU score to the case where one has a list of M candidate strings (called a "corpus") $({\hat {y}}^{(1)},\cdots ,{\hat {y}}^{(M)})$ , and for each candidate string ${\hat {y}}^{(i)}$ , a list of reference candidate strings $S_{i}:=(y^{(i,1)},...,y^{(i,N_{i})})$ .

Given any string $y=y_{1}y_{2}\cdots y_{K}$ , and any integer $n\geq 1$ , we define the set of its n-grams to be

G_{n}(y)=\{y_{1}\cdots y_{n},y_{2}\cdots y_{n+1},\cdots ,y_{K-n+1}\cdots y_{K}\}

Note that it is a set of unique elements, not a multiset allowing redundant elements, so that, for example, $G_{2}(abab)=\{ab,ba\}$ .

Given any two strings $s,y$ , define the substring count $C(s,y)$ to be the number of appearances of $s$ as a substring of $y$ . For example, $C(ab,abcbab)=2$ .

Now, fix a candidate corpus ${\hat {S}}:=({\hat {y}}^{(1)},\cdots ,{\hat {y}}^{(M)})$ , and reference candidate corpus $S=(S_{1},\cdots ,S_{M})$ , where each $S_{i}:=(y^{(i,1)},...,y^{(i,N_{i})})$ .

Modified n-gram precision

Define the modified n-gram precision function to be

p_{n}({\hat {S}};S):={\frac {\sum _{i=1}^{M}\sum _{s\in G_{n}({\hat {y}}^{(i)})}\min(C(s,{\hat {y}}^{(i)}),\max _{y\in S_{i}}C(s,y))}{\sum _{i=1}^{M}\sum _{s\in G_{n}({\hat {y}}^{(i)})}C(s,{\hat {y}}^{(i)})}}

The modified n-gram, which looks complicated, is merely a straightforward generalization of the prototypical case: one candidate sentence and one reference sentence. In this case, it is

p_{n}(\{{\hat {y}}\};\{y\})={\frac {\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}{\sum _{s\in G_{n}({\hat {y}})}C(s,{\hat {y}})}}

To work up to this expression, we start with the most obvious n-gram count summation:

\sum _{s\in G_{n}({\hat {y}})}C(s,y)={\text{number of n-substrings in }}{\hat {y}}{\text{ that appear in }}y

This quantity measures how many n-grams in the reference sentence are reproduced by the candidate sentence. Note that we count the n-substrings, not n-grams. For example, when ${\hat {y}}=aba,y=abababa,n=2$ , all the 2-substrings in ${\hat {y}}$ (ab and ba) appear in $y$ 3 times each, so the count is 6, not 2.

In the above situation, however, the candidate string is too short. Instead of 3 appearances of $ab$ it contains only one, so we add a minimum function to correct for that:

{\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}

This count summation cannot be used to compare between sentences, since it is not normalized. If both the reference and the candidate sentences are long, the count could be big, even if the candidate is of very poor quality. So we normalize it

{\frac {\sum _{s\in G_{n}({\hat {y}})}\min(C(s,{\hat {y}}),C(s,y))}{\sum _{s\in G_{n}({\hat {y}})}C(s,{\hat {y}})}}

The normalization is such that it is always a number in $[0,1]$ , allowing meaningful comparisons between corpuses. It is zero iff none of the n-substrings in candidate is in reference. It is one iff every n-gram in the candidate appears in reference, for at least as many times as in candidate. In particular, if the candidate is a substring of the reference, then it is one.

Brevity penalty

The modified n-gram precision unduly gives a high score for candidate strings that are "telegraphic", that is, containing all the n-grams of the reference strings, but for as few times as possible.

In order to punish candidate strings that are too short, define the brevity penalty to be

BP({\hat {S}};S):=e^{-(r/c-1)^{+}}

where $(r/c-1)^{+}=\max(0,r/c-1)$ is the positive part of $r/c-1$ .

When $r\leq c$ , the brevity penalty $BP=1$ , meaning that we don't punish long candidates, and only punish short candidates.
When $r>c$ , the brevity penalty $BP=e^{1-r/c}$

$c$ is the length of the candidate corpus, that is,

c:=\sum _{i=1}^{M}|{\hat {y}}^{(i)}|

where $|y|$ is the length of $y$ . $r$ is the effective reference corpus length, that is,

r:=\sum _{i=1}^{M}|y^{(i,j)}|

where $y^{(i,j)}=\arg \min _{y\in S_{i}}||y|-|{\hat {y}}^{(i)}||$ , that is, the sentence from $S_{i}$ whose length is as close to $|{\hat {y}}^{(i)}|$ as possible.

Final definition of BLEU

There is not a single definition of BLEU, but a whole family of them, parametrized by the weighting vector $w:=(w_{1},w_{2},\cdots )$ . It is a probability distribution on $\{1,2,3,\cdots \}$ , that is, $\sum _{i=1}^{\infty }w_{i}=1$ , and $\forall i\in \{1,2,3,\cdots \},w_{i}\in [0,1]$ .

With a choice of $w$ , the BLEU score is

BLEU_{w}({\hat {S}};S):=BP({\hat {S}};S)\cdot \exp \left(\sum _{n=1}^{\infty }w_{n}\ln p_{n}({\hat {S}};S)\right)

In words, it is a weighted geometric mean of all the modified n-gram precisions, multiplied by the brevity penalty. We use the weighted geometric mean, rather than the weighted arithmetic mean, to strongly favor candidate corpuses that are simultaneously good according to multiple n-gram precisions.

The most typical choice, the one recommended in the original paper, is $w_{1}=\cdots =w_{4}={\frac {1}{4}}$ .^[2]

Algorithm

This is illustrated in the following example from Papineni et al. (2002):

Example of poor machine translation output with high precision
Candidate	the	the	the	the	the	the	the
Reference 1	the	cat	is	on	the	mat
Reference 2	there	is	a	cat	on	the	mat

Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of,

P={\frac {m}{w_{t}}}={\frac {7}{7}}=1

where $~m$ is number of words from the candidate that are found in the reference, and $~w_{t}$ is the total number of words in the candidate. This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references.

The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count, $~m_{max}$ , in any of the reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus $~m_{max}=2$ .

For the candidate translation, the count $m_{w}$ of each word is clipped to a maximum of $m_{max}$ for that word. In this case, "the" has $~m_{w}=7$ and $~m_{max}=2$ , thus $~m_{w}$ is clipped to 2. These clipped counts $~m_{w}$ are then summed over all distinct words in the candidate. This sum is then divided by the total number of unigrams in the candidate translation. In the above example, the modified unigram precision score would be:

P={\frac {2}{7}}

In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using n-grams. The length which has the "highest correlation with monolingual human judgements" was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information is retained. The longer $n$ -gram scores account for the fluency of the translation, or to what extent it reads like "good English".

Comparing metrics for candidate "the the cat"
Model	Set of grams	Score
Unigram	"the", "the", "cat"	${\frac {1+1+1}{3}}=1$
Grouped Unigram	"the"2, "cat"1	${\frac {1+1}{2+1}}={\frac {2}{3}}$
Bigram	"the the", "the cat"	${\frac {0+1}{2}}={\frac {1}{2}}$

An example of a candidate translation for the same references as above might be:

the cat

In this example, the modified unigram precision would be,

P={\frac {1}{2}}+{\frac {1}{2}}={\frac {2}{2}}

as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two. The modified bigram precision would be $1/1$ as the bigram, "the cat" appears once in the candidate. It has been pointed out that precision is usually twinned with recall to overcome this problem , as the unigram recall of this example would be $3/6$ or $2/7$ . The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.

To produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Let $r$ be the total length of the reference corpus, and $c$ the total length of the translation corpus. If $c\leq r$ , the brevity penalty applies, defined to be $e^{(1-r/c)}$ . (In the case of multiple reference sentences, $r$ is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences. However, in the version of the metric used by NIST evaluations prior to 2009, the shortest reference sentence had been used instead.)

iBLEU is an interactive version of BLEU that allows a user to visually examine the BLEU scores obtained by the candidate translations. It also allows comparing two different systems in a visual and interactive manner which is useful for system development.

Performance

BLEU has frequently been reported as correlating well with human judgement, and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that, although in principle capable of evaluating translations of any language, BLEU cannot, in its present form, deal with languages lacking word boundaries. Designed to be used for several reference translation, in practice it's used with only the single one.[1] BLEU is infamously dependent on the tokenization technique, and scores achieved with different ones are incomparable (which is often overlooked); in order to improve reproducibility and comparability, SacreBLEU variant was designed.^[1]

It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.

Notes

^ Papineni, K., et al. (2002)
^ Papineni, K., et al. (2002)
^ Coughlin, D. (2003)
^ Papineni, K., et al. (2002)
^ Papineni, K., et al. (2002)
^ Papineni, K., et al. (2002)
^ Coughlin, D. (2003)
^ Doddington, G. (2002)
^ Denoual, E. and Lepage, Y. (2005)
^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
^ Lee, A. and Przybocki, M. (2005)
^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
^ Lin, C. and Och, F. (2004)
^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
^ Madnani, N. (2011)

Related Research Articles

In physics, the kinetic energy of an object is the form of energy that it possesses due to its motion.

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In vector calculus, the divergence theorem, also known as Gauss's theorem or Ostrogradsky's theorem, is a theorem relating the flux of a vector field through a closed surface to the divergence of the field in the volume enclosed.

<span class="mw-page-title-main">Moment of inertia</span> Scalar measure of the rotational inertia with respect to a fixed axis of rotation

The moment of inertia, otherwise known as the mass moment of inertia, angular/rotational mass, second moment of mass, or most accurately, rotational inertia, of a rigid body is a quantity that determines the torque needed for a desired angular acceleration about a rotational axis, akin to how mass determines the force needed for a desired acceleration. It depends on the body's mass distribution and the axis chosen, with larger moments requiring more torque to change the body's rate of rotation by a given amount.

In mathematics, a generating function is a representation of an infinite sequence of numbers as the coefficients of a formal power series. Unlike an ordinary series, the formal power series is not required to converge: in fact, the generating function is not actually regarded as a function, and the "variable" remains an indeterminate. Generating functions were first introduced by Abraham de Moivre in 1730, in order to solve the general linear recurrence problem. One can generalize to formal power series in more than one indeterminate, to encode information about infinite multi-dimensional arrays of numbers.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In physics, an operator is a function over a space of physical states onto another space of physical states. The simplest example of the utility of operators is the study of symmetry. Because of this, they are useful tools in classical mechanics. Operators are even more important in quantum mechanics, where they form an intrinsic part of the formulation of the theory.

In linear algebra, the Gram matrix of a set of vectors $in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product . If the vectors are the columns of matrix then the Gram matrix is in the general case that the vector coordinates are complex numbers, which simplifies to for the case that the vector coordinates are real numbers.$

Screw theory is the algebraic calculation of pairs of vectors, such as angular and linear velocity, or forces and moments, that arise in the kinematics and dynamics of rigid bodies.

Particle filters, or sequential Monte Carlo methods, are a set of Monte Carlo algorithms used to find approximate solutions for filtering problems for nonlinear state-space systems, such as signal processing and Bayesian statistical inference. The filtering problem consists of estimating the internal states in dynamical systems when partial observations are made and random perturbations are present in the sensors as well as in the dynamical system. The objective is to compute the posterior distributions of the states of a Markov process, given the noisy and partial observations. The term "particle filters" was first coined in 1996 by Pierre Del Moral about mean-field interacting particle methods used in fluid mechanics since the beginning of the 1960s. The term "Sequential Monte Carlo" was coined by Jun S. Liu and Rong Chen in 1998.

In electromagnetism, charge density is the amount of electric charge per unit length, surface area, or volume. Volume charge density is the quantity of charge per unit volume, measured in the SI system in coulombs per cubic meter (C⋅m⁻³), at any point in a volume. Surface charge density (σ) is the quantity of charge per unit area, measured in coulombs per square meter (C⋅m⁻²), at any point on a surface charge distribution on a two dimensional surface. Linear charge density (λ) is the quantity of charge per unit length, measured in coulombs per meter (C⋅m⁻¹), at any point on a line charge distribution. Charge density can be either positive or negative, since electric charge can be either positive or negative.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

<span class="mw-page-title-main">METEOR</span>

METEOR is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

<span class="mw-page-title-main">Radiative transfer equation and diffusion theory for photon transport in biological tissue</span>

Photon transport in biological tissue can be equivalently modeled numerically with Monte Carlo simulations or analytically by the radiative transfer equation (RTE). However, the RTE is difficult to solve without introducing approximations. A common approximation summarized here is the diffusion approximation. Overall, solutions to the diffusion equation for photon transport are more computationally efficient, but less accurate than Monte Carlo simulations.

Symmetries in quantum mechanics describe features of spacetime and particles which are unchanged under some transformation, in the context of quantum mechanics, relativistic quantum mechanics and quantum field theory, and with applications in the mathematical formulation of the standard model and condensed matter physics. In general, symmetry in physics, invariance, and conservation laws, are fundamentally important constraints for formulating physical theories and models. In practice, they are powerful methods for solving problems and predicting what can happen. While conservation laws do not always give the answer to the problem directly, they form the correct constraints and the first steps to solving a multitude of problems. In application, understanding symmetries can also provide insights on the eigenstates that can be expected. For example, the existence of degenerate states can be inferred by the presence of non commuting symmetry operators or that the non degenerate states are also eigenvectors of symmetry operators.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

References

1 2 3 "BLEU: A Misunderstood Metric from Another Age". 5 November 2022.
↑ Papineni, Kishore; Roukos, Salim; Ward, Todd; Zhu, Wei-Jing (2001). "BLEU". Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. Morristown, NJ, USA: Association for Computational Linguistics: 311. doi: 10.3115/1073083.1073135 . S2CID 11080756.

Bibliography

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation (PDF). ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318. CiteSeerX 10.1.1.19.9416 .
Papineni, K., Roukos, S., Ward, T., Henderson, J and Reeder, F. (2002). "Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results Archived 2016-03-04 at the Wayback Machine " in Proceedings of Human Language Technology 2002, San Diego, pp. 132–137
Callison-Burch, C., Osborne, M. and Koehn, P. (2006) "Re-evaluating the Role of BLEU in Machine Translation Research Archived 2008-12-04 at the Wayback Machine " in 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006 pp. 249–256
Doddington, G. (2002) "Automatic evaluation of machine translation quality using n-gram cooccurrence statistics" in Proceedings of the Human Language Technology Conference (HLT), San Diego, CA pp. 128–132
Coughlin, D. (2003) "Correlating Automated and Human Assessments of Machine Translation Quality Archived 2008-09-06 at the Wayback Machine " in MT Summit IX, New Orleans, USA pp. 23–27
Denoual, E. and Lepage, Y. (2005) "BLEU in characters: towards automatic MT evaluation in languages without word delimiters Archived 2011-07-18 at the Wayback Machine " in Companion Volume to the Proceedings of the Second International Joint Conference on Natural Language Processing pp. 81–86
Lee, A. and Przybocki, M. (2005) NIST 2005 machine translation evaluation official results
Lin, C. and Och, F. (2004) "Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics Archived 2008-07-05 at the Wayback Machine " in Proceedings of the 42nd Annual Meeting of the Association of Computational Linguistics.
Madnani, N. (2011). "iBLEU: Interactively Scoring and Debugging Statistical Machine Translation Systems" in "Proceedings of the Fifth IEEE International Conference on Semantic Computing (Demos), Palo Alto, CA" pp. 213–214

External links

BLEU – Bilingual Evaluation Understudy lecture of Machine Translation course by Karlsruhe Institute for Technology, Coursera

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Marie-1] 1 2 3 "BLEU: A Misunderstood Metric from Another Age". 5 November 2022.

[2] Papineni, Kishore; Roukos, Salim; Ward, Todd; Zhu, Wei-Jing (2001). "BLEU". Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. Morristown, NJ, USA: Association for Computational Linguistics: 311. doi: 10.3115/1073083.1073135 . S2CID 11080756.

[1]

[2]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix