METEOR

Last updated July 01, 2024

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

Algorithm

As with BLEU, the basic unit of evaluation is the sentence, the algorithm first creates an alignment (see illustrations) between two sentences, the candidate translation string, and the reference translation string. The alignment is a set of mappings between unigrams. A mapping can be thought of as a line between a unigram in one string, and a unigram in another string. The constraints are as follows; every unigram in the candidate translation must map to zero or one unigram in the reference. Mappings are selected to produce an alignment as defined above. If there are two alignments with the same number of mappings, the alignment is chosen with the fewest crosses, that is, with fewer intersections of two mappings. From the two alignments shown, alignment (a) would be selected at this point. Stages are run consecutively and each stage only adds to the alignment those unigrams which have not been matched in previous stages. Once the final alignment is computed, the score is computed as follows: Unigram precision $P$ is calculated as:

Examples of pairs of words which will be mapped by each module
Module	Candidate	Reference	Match
Exact	Good	Good	Yes
Stemmer	Goods	Good	Yes
Synonymy	well	Good	Yes

P={\frac {m}{w_{t}}}

Where $m$ is the number of unigrams in the candidate translation that are also found in the reference translation, and $w_{t}$ is the number of unigrams in the candidate translation. Unigram recall $R$ is computed as:

R={\frac {m}{w_{r}}}

Where $m$ is as above, and $w_{r}$ is the number of unigrams in the reference translation. Precision and recall are combined using the harmonic mean in the following fashion, with recall weighted 9 times more than precision:

F_{mean}={\frac {10PR}{R+9P}}

The measures that have been introduced so far only account for congruity with respect to single words but not with respect to larger segments that appear in both the reference and the candidate sentence. In order to take these into account, longer n-gram matches are used to compute a penalty $p$ for the alignment. The more mappings there are that are not adjacent in the reference and the candidate sentence, the higher the penalty will be.

In order to compute this penalty, unigrams are grouped into the fewest possible chunks, where a chunk is defined as a set of unigrams that are adjacent in the hypothesis and in the reference. The longer the adjacent mappings between the candidate and the reference, the fewer chunks there are. A translation that is identical to the reference will give just one chunk. The penalty $p$ is computed as follows,

p=0.5\left({\frac {c}{u_{m}}}\right)^{3}

Where c is the number of chunks, and $u_{m}$ is the number of unigrams that have been mapped. The final score for a segment is calculated as $M$ below. The penalty has the effect of reducing the $F_{mean}$ by up to 50% if there are no bigram or longer matches.

M=F_{mean}(1-p)

To calculate a score over a whole corpus, or collection of segments, the aggregate values for $P$ , $R$ and $p$ are taken and then combined using the same formula. The algorithm also works for comparing a candidate translation against more than one reference translations. In this case the algorithm compares the candidate against each of the references and selects the highest score.

Examples

Reference	the	cat	sat	on	the	mat
Hypothesis	on	the	mat	sat	the	cat
Score	$0.9375={\underset {\text{Fmean}}{1.0000}}\times (1-{\underset {\text{Penalty}}{0.0625}})$
Fmean	$1.0000=10\times {\underset {\text{Precision}}{1.0000}}\times {\frac {\overset {\text{Recall}}{1.0000}}{{\underset {\text{Recall}}{1.0000}}+9\times {\underset {\text{Precision}}{1.0000}}}}$
Penalty	$0.0625=0.5\times {\underset {\text{Fragmentation}}{0.5^{3}}}$
Fragmentation	$0.5={\frac {\overset {\text{Chunks}}{3.0000}}{\underset {\text{Matches}}{6.0000}}}$

Reference	the	cat	sat	on	the	mat
Hypothesis	the	cat	sat	on	the	mat
Score	$0.9977={\underset {\text{Fmean}}{1.0000}}\times (1-{\underset {\text{Penalty}}{0.0023}})$
Fmean	$1.0000=10\times {\underset {\text{Precision}}{1.0000}}\times {\frac {\overset {\text{Recall}}{1.0000}}{{\underset {\text{Recall}}{1.0000}}+9\times {\underset {\text{Precision}}{1.0000}}}}$
Penalty	$0.0023=0.5\times {\underset {\text{Fragmentation}}{0.1667^{3}}}$
Fragmentation	$0.1667={\frac {\overset {\text{Chunks}}{1.0000}}{\underset {\text{Matches}}{6.0000}}}$

Reference	the	cat		sat	on	the	mat
Hypothesis	the	cat	was	sat	on	the	mat
Score	$0.9654={\underset {\text{Fmean}}{0.9836}}\times (1-{\underset {\text{Penalty}}{0.0185}})$
Fmean	$0.9836=10\times {\underset {\text{Precision}}{0.8571}}\times {\frac {\overset {\text{Recall}}{1.0000}}{{\underset {\text{Recall}}{1.0000}}+9\times {\underset {\text{Precision}}{0.8571}}}}$
Penalty	$0.0185=0.5\times {\underset {\text{Fragmentation}}{0.3333^{3}}}$
Fragmentation	$0.3333={\frac {\overset {\text{Chunks}}{2.0000}}{\underset {\text{Matches}}{6.0000}}}$

Notes

^ Banerjee, S. and Lavie, A. (2005)

Related Research Articles

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU.^{Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality,^{^{and remains one of the most popular automated and inexpensive metrics.}}}

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

NIST is a method for evaluating the quality of text which has been translated using machine translation. Its name comes from the US National Institute of Standards and Technology.

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Digital image correlation and tracking is an optical method that employs tracking and image registration techniques for accurate 2D and 3D measurements of changes in images. This method is often used to measure full-field displacement and strains, and it is widely applied in many areas of science and engineering. Compared to strain gauges and extensometers, digital image correlation methods provide finer details about deformation, due to the ability to provide both local and average data.

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

In statistics, the phi coefficient is a measure of association for two binary variables.

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings, and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983.

LEPOR is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.

<span class="mw-page-title-main">Evaluation of binary classifiers</span> Quantitative measurement of accuracy

Evaluation of a binary classifier assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

The Partial Area Under the ROC Curve (pAUC) is a metric for the performance of binary classifier.

P₄ metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P₄ is designed in similar way to F₁ metric, however addressing the criticisms leveled against F₁. It may be perceived as its extension.

References

Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005
Lavie, A., Sagae, K. and Jayaraman, S. (2004) "The Significance of Recall in Automatic Metrics for MT Evaluation" in Proceedings of AMTA 2004, Washington DC. September 2004

External links

The METEOR Automatic Machine Translation Evaluation System (including link for download)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.