LEPOR

Last updated

LEPOR (Length Penalty, Precision, n-gram Position difference Penalty and Recall) is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.

Contents

Background

Since IBM proposed and realized the system of BLEU [1] as the automatic metric for Machine Translation (MT) evaluation, [2] many other methods have been proposed to revise or improve it, such as TER, METEOR, [3] etc. However, there exist some problems in the traditional automatic evaluation metrics. Some metrics perform well on certain languages but weak on other languages, which is usually called as a language bias problem. Some metrics rely on a lot of language features or linguistic information, which makes it difficult for other researchers to repeat the experiments. LEPOR is an automatic evaluation metric that tries to address some of the existing problems. [4] LEPOR is designed with augmented factors and the corresponding tunable parameters to address the language bias problem. Furthermore, in the improved version of LEPOR, i.e. the hLEPOR, [5] it tries to use the optimized linguistic features that are extracted from treebanks. Another advanced version of LEPOR is the nLEPOR metric, [6] which adds the n-gram features into the previous factors. So far, the LEPOR metric has been developed into LEPOR series. [7] [8]

LEPOR metrics have been studied and analyzed by many researchers from different fields, such as machine translation, [9] natural-language generation, [10] and searching, [11] and beyond. LEPOR metrics are getting more attention from scientific researchers in natural language processing.

Design

LEPOR [12] is designed with the factors of enhanced length penalty, precision, n-gram word order penalty, and recall. The enhanced length penalty ensures that the hypothesis translation, which is usually translated by machine translation systems, is punished if it is longer or shorter than the reference translation. The precision score reflects the accuracy of the hypothesis translation. The recall score reflects the loyalty of the hypothesis translation to the reference translation or source language. The n-gram based word order penalty factor is designed for the different position orders between the hypothesis translation and reference translation. The word order penalty factor has been proved to be useful by many researchers, such as the work of Wong and Kit (2008). [13]

In light that the word surface string matching metrics were criticized with lack of syntax and semantic awareness, the further developed LEPOR metric (hLEPOR) investigates the integration of linguistic features, such as part of speech (POS). [14] [15] POS is introduced as a certain functionality of both syntax and semantic point of view, e.g. if a token of output sentence is a verb while it is expected to be a noun, then there shall be a penalty; also, if the POS is the same but the exact word is not the same, e.g. good vs nice, then this candidate shall gain certain credit. The overall score of hLEPOR then is calculated as the combination of word level score and POS level score with a weighting set. Language modelling inspired n-gram knowledge is also extensively explored in nLEPOR. [16] [17] In addition to the n-gram knowledge for n-gram position difference penalty calculation, n-gram is also applied to n-gram precision and n-gram recall in nLEPOR, and the parameter n is an adjustable factor. In addition to POS knowledge in hLEPOR, phrase structure from parsing information is included in a new variant HPPR. [18] In HPPR evaluation modeling, the phrase structure set, such as noun phrase, verb phrase, prepositional phrase, adverbial phrase are considered during the matching from candidate text to reference text.

Software implementation

LEPOR metrics were originally implemented in Perl programming language, [19] and recently the Python version [20] is available by other researchers and engineers, [21] with a press announcement [22] from Logrus Global Language Service company.

Performance

LEPOR series have shown their good performances in the ACL's annual international workshop of statistical machine translation (ACL-WMT). ACL-WMT is held by the special interest group of machine translation (SIGMT) in the international association for computational linguistics (ACL). In the ACL-WMT 2013, [23] there are two translation and evaluation tracks, English-to-other and other-to-English. The "other" languages include Spanish, French, German, Czech and Russian. In the English-to-other direction, nLEPOR metric achieves the highest system-level correlation score with human judgments using the Pearson correlation coefficient, the second highest system-level correlation score with human judgments using the Spearman rank correlation coefficient. In the other-to-English direction, nLEPOR performs moderate and METEOR yields the highest correlation score with human judgments, which is due to the fact that nLEPOR only uses the concise linguistic feature, part-of-speech information, except for the officially offered training data; however, METEOR has used many other external resources, such as the synonyms dictionaries, paraphrase, and stemming, etc.

One extended work and introduction about LEPOR's performances with different conditions including pure word-surface form, POS features, phrase tags features, is described in a thesis from University of Macau. [24]

There is a deep statistical analysis about hLEPOR and nLEPOR performance in WMT13, which shows it performed as one of the best metrics "in both the individual language pair assessment for Spanish-to-English and the aggregated set of 9 language pairs", see the paper (Accurate Evaluation of Segment-level Machine Translation Metrics) "https://www.aclweb.org/anthology/N15-1124" Graham et al. 2015 NAACL (https://github.com/ygraham/segment-mteval)

Applications

LEPOR automatic metric series have been applied and used by many researchers from different fields in natural language processing. For instance, in standard MT and Neural MT. [25] Also outside of MT community, for instance, [26] applied LEPOR in Search evaluation; [27] mentioned the application of LEPOR for code (programming language) generation evaluation; [28] investigated automatic evaluation of natural language generation [29] with metrics including LEPOR, and argued that automatic metrics can help system level evaluations; also LEPOR is applied in image captioning evaluation. [30]

See also

Notes

  1. Papineni et al. (2002)
  2. Han (2016)
  3. Banerjee and Lavie (2005)
  4. Han et al. (2012)
  5. Han et al. (2013a)
  6. Han et al. (2013b)
  7. Han et al. (2014)
  8. Han (2014)
  9. Graham et al. (2015)
  10. Novikova et al. (2017)
  11. Liu et al. (2021)
  12. Han et al. (2012)
  13. Wong and Kit (2008)
  14. Han et al. (2013a)
  15. Han (2014)
  16. Han et al. (2013b)
  17. Han (2014)
  18. Han et al. (2013c)
  19. "GitHub - aaronlifenghan/Aaron-project-lepor: LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors". GitHub . 8 January 2022.
  20. "HLepor: This is Python port of original algorithm by Aaron Li-Feng Han".
  21. "GitHub - lHan87/LEPOR". GitHub . 5 May 2021.
  22. Global, Logrus (2021-04-30). "Logrus Global Adds hLEPOR Translation-quality Evaluation Metric Python Implementation on PyPi.org". Slator (Press release). Retrieved 2022-11-02.
  23. ACL-WMT (2013)
  24. Han (2014)
  25. Marzouk and Hansen-Schirra (2019)
  26. Liu et al. (2021)
  27. Liguori et al. (2021)
  28. Novikova et al. (2017)
  29. Celikyilmaz et al. (2020)
  30. Qiu et al. (2020)

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Association for Computational Linguistics</span>

The Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing. Its namesake conference is one of the primary high impact conferences for natural language processing research, along with EMNLP. The conference is held each summer in locations where significant computational linguistics research is carried out.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Round-trip translation (RTT), also known as back-and-forth translation, recursive translation and bi-directional translation, is the process of translating a word, phrase or text into another language, then translating the result back into the original language, using machine translation (MT) software. It is often used by laypeople to evaluate a machine translation system, or to test whether a text is suitable for MT when they are unfamiliar with the target language. Because the resulting text can often differ substantially from the original, RTT can also be a source of entertainment.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation, and has more recently been superseded by neural machine translation in many applications.

Martin Kay was a computer scientist, known especially for his work in computational linguistics.

NIST is a method for evaluating the quality of text which has been translated using machine translation. Its name comes from the US National Institute of Standards and Technology.

<span class="mw-page-title-main">METEOR</span>

METEOR is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.

<span class="mw-page-title-main">Word embedding</span> Method in natural language processing

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

A confusion network is a natural language processing method that combines outputs from multiple automatic speech recognition or machine translation systems. Confusion networks are simple linear directed acyclic graphs with the property that each a path from the start node to the end node goes through all the other nodes. The set of words represented by edges between two nodes is called a confusion set. In machine translation, the defining characteristic of confusion networks is that they allow multiple ambiguous inputs, deferring committal translation decisions until later stages of processing. This approach is used in the open source machine translation software Moses and the proprietary translation API in IBM Bluemix Watson.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

Mona Talat Diab is a computer science professor at George Washington University and a research scientist with Facebook AI. Her research focuses on natural language processing, computational linguistics, cross lingual/multilingual processing, computational socio-pragmatics, Arabic language processing, and applied machine learning.

Ani Nenkova is Principal Scientist at Adobe Research, currently on leave from her position as an Associate Professor of Computer and Information Science at the University of Pennsylvania. Her research focuses on computational linguistics and artificial intelligence, with an emphasis on developing computational methods for analysis of text quality and style, discourse, affect recognition, and summarization.

References