Europarl Corpus

Last updated

The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish). [1] With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. [1] The latest release (2012) [2] comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavic (Bulgarian, Czech, Polish, Slovak, Slovene), Finno-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. [1]

Contents

The data that makes up the corpus was extracted from the website of the European Parliament and then prepared for linguistic research. [1] After sentence splitting and tokenization the sentences were aligned across languages with the help of an algorithm developed by Gale & Church (1993). [1]

The corpus has been compiled and expanded by a group of researchers led by Philipp Koehn at the University of Edinburgh. Initially, it was designed for research purposes in statistical machine translation (SMT). However, since its first release it has been used for multiple other research purposes, including for example word sense disambiguation. EUROPARL is also available to search via the corpus management system Sketch Engine. [3]

Europarl Corpus and statistical machine translation

In his paper "Europarl: A Parallel Corpus for Statistical Machine Translation", [1] Koehn sums up in how far the Europarl corpus is useful for research in SMT. He uses the corpus to develop SMT systems translating each language into each of the other ten languages of the corpus making it 110 systems. This enables Koehn to establish SMT systems for uncommon language pairs that have not been considered by SMT developers beforehand, such as Finnish–Italian for example.

Quality assessment

The Europarl corpus may not only be used for developing SMT systems but also for their assessment. By measuring the output of the systems against the original corpus data for the target language the adequacy of the translation can be assessed. Koehn uses the BLEU metric by Papineni et al. (2002) for this, which counts the coincidences of the two compared versions—SMT output and corpus data—and calculates a score on this basis. [4] The more similar the two versions are, the higher the score, and therefore the quality of the translation. [1] Results reflect that some SMT systems perform better than others, e.g., Spanish–French (40.2) in comparison to Dutch–Finnish (10.3). [1] Koehn states that the reason for this is that related languages are easier to translate into each other than those that are not. [1]

Back translation

Furthermore, Koehn uses the SMT systems and the Europarl corpus data to investigate whether back translation is an adequate method for the evaluation of machine translation systems. For each language except English he compares the BLEU scores for translating that language from and into English (e.g. English > Spanish, Spanish > English) with those that can be achieved by measuring the original English data against the output obtained by translation from English into each language and back translation into English (e.g. English > Spanish > English). [1] The results indicate that the scores for back translation are far higher than those for monodirectional translation and what is more important they do not correlate at all with the monodirectional scores. For example, the monodirectional scores for English<>Greek (27.2 and 23.2) are lower than those for English<>Portuguese (30.1 and 27.2). Yet the back translation score of 56.5 for Greek is higher than the one for Portuguese, which gets 53.6. [1] Koehn explains this with the fact that errors committed in the translation process might simply be reversed by back translation resulting in high coincidences of in- and output. [1] This, however, does not allow any conclusions about the quality of the text in the actual target language. [1] Therefore, Koehn does not consider back translation an adequate method for the assessment of machine translation systems.

Notes and references

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 Koehn, Philipp (2005): "Europarl: A Parallel Corpus for Statistical Machine Translation", in: MT Summit, pp. 79–86.
  2. European Parliament Proceedings Parallel Corpus 1996-2011
  3. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., ... & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-36.
  4. Papineni, Kishore et al (2002): "BLEU. A method for automatic evaluation of machine translation", in: Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL), pp. 311–318.

Related Research Articles

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Parallel text Text placed alongside its translation or translations

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

A pivot language, sometimes also called a bridge language, is an artificial or natural language used as an intermediary language for translation between many different languages – to translate between any pair of languages A and B, one translates A to the pivot language P, then from P to B. Using a pivot language avoids the combinatorial explosion of having translators across every combination of the supported languages, as the number of combinations of language is linear, rather than quadratic – one need only know the language A and the pivot language P, rather than needing a different translator for every possible combination of A and B.

<span class="mw-page-title-main">Google Translate</span> Multilingual neural machine translation service

Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an API that helps developers build browser extensions and software applications. As of September 2022, Google Translate supports 133 languages at various levels, and as of April 2016, claimed over 500 million total users, with more than 100 billion words translated daily, after the company stated in May 2013 that it served over 200 million people daily.

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

Linguistic categories include

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

<span class="mw-page-title-main">Microsoft Translator</span> Machine translation cloud service by Microsoft

Microsoft Translator is a multilingual machine translation cloud service provided by Microsoft. Microsoft Translator is a part of Microsoft Cognitive Services and integrated across multiple consumer, developer, and enterprise products; including Bing, Microsoft Office, SharePoint, Microsoft Edge, Microsoft Lync, Yammer, Skype Translator, Visual Studio, and Microsoft Translator apps for Windows, Windows Phone, iPhone and Apple Watch, and Android phone and Android Wear.

Philipp Koehn is a computer scientist and researcher in the field of machine translation. His primary research interest is statistical machine translation and he is one of the inventors of a method called phrase based machine translation. This is a sub-field of statistical translation methods that employs sequences of words as the basis of translation, expanding the previous word based approaches. A 2003 paper which he authored with Franz Josef Och and Daniel Marcu called Statistical phrase-based translation has attracted wide attention in Machine translation community and has been cited over a thousand times. Phrase based methods are widely used in machine translation applications in industry.

Moses for Mere Mortals (MMM) is a free open source software composed of a set of scripts designed to allow the automation of processes for the installation and operation of the Moses Open Source Translation System, a statistical machine translation system.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system. The motivation for developing hybrid machine translation systems stems from the failure of any single technique to achieve a satisfactory level of accuracy. Many hybrid machine translation systems have been successful in improving the accuracy of the translations, and there are several popular machine translation systems which employ hybrid methods.

<span class="mw-page-title-main">Sketch Engine</span>

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

The EuroMatrixPlus is a project that ran from March 2009 to February 2012. EuroMatrixPlus succeeded a project called EuroMatrix and continued in further development and improvement of machine translation (MT) systems for languages of the European Union (EU).

Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.