Parallel text

Last updated
The Rosetta Stone, a stele engraved with the same decree in both of the Ancient Egyptian scripts as well as Ancient Greek. Its discovery was key to deciphering the Ancient Egyptian language. Rosetta Stone.JPG
The Rosetta Stone, a stele engraved with the same decree in both of the Ancient Egyptian scripts as well as Ancient Greek. Its discovery was key to deciphering the Ancient Egyptian language.

A parallel text is a text placed alongside its translation or translations. [1] [2] Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla (Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

Contents

Large collections of parallel texts are called parallel corpora (see text corpus). Alignments of parallel corpora at sentence level are prerequisite for many areas of linguistic research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.

Parallel texts may be used in language education. [3]

Types of parallel corpora

Parallel corpora can be classified into four main categories:[ citation needed ]

Noise in corpora

Large corpora used as training sets for machine translation algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events.

However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between bilingual elements represented in both corpora and monolingual elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages. [4]

Bitext

In the field of translation studies a bitext is a merged document composed of both source- and target-language versions of a given text.

Bitexts are generated by a piece of software called an alignment tool, or a bitext tool, which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a bitext database or a bilingual corpus, and can be consulted with a search tool.

Bitexts and translation memories

Bitexts have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as Translation Memory eXchange (TMX), a standard XML format for exchanging translation memories between computer-assisted translation (CAT) programs, allow preserving the original order of sentences.

Bitexts are designed to be consulted by a human translator, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance.

In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up. [5]

Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including Linguée, Reverso, and Tradooit. [6] [7] [8]

See also

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<span class="mw-page-title-main">Dictionary-based machine translation</span>

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

<span class="mw-page-title-main">Concordance (publishing)</span> List of words or terms in a published book

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.

Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach.

<span class="mw-page-title-main">Bitext word alignment</span> Identifying translation relationships among the words in a bitext

Bitext word alignment or simply word alignment is the natural language processing task of identifying translation relationships among the words in a bitext, resulting in a bipartite graph between the two sides of the bitext, with an arc between two words if and only if they are translations of one another. Word alignment is typically done after sentence alignment has already identified pairs of sentences that are translations of one another.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

The MAtrixware REsearch Collection (MAREC) is a standardised patent data corpus available for research purposes. MAREC seeks to represent patent documents of several languages in order to answer specific research questions. It consists of 19 million patent documents in different languages, normalised to a highly specific XML schema.

<span class="mw-page-title-main">Linguee</span> Online bilingual concordance

Linguee is an online bilingual concordance that provides an online dictionary for a number of language pairs, including many bilingual sentence pairs. As a translation aid, Linguee differs from machine translation services like Babel Fish and is more similar in function to a translation memory. Linguee is operated by Cologne-based DeepL GmbH, which was established in Cologne in December 2008.

<span class="mw-page-title-main">Tatoeba</span> Online project collecting example sentences

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase "tatoeba" (例えば), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is run by Association Tatoeba, a French non-profit organization funded through donations.

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.

Reverso is a French company specialized in AI-based language tools, translation aids, and language services. These include online translation based on neural machine translation (NMT), contextual dictionaries, online bilingual concordances, grammar and spell checking and conjugation tools.

The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.

References

  1. Chan, Sin-Wai (2015). Routledge Encyclopedia of Translation Technology. London: Routledge. ISBN   978-1-315-74912-9.
  2. Williams, Philip; Sennrich, Rico; Post, Matt; Koehn, Philipp (2016). Syntax-based Statistical Machine Translation. Morgan & Claypool. ISBN   978-1-62705-502-4.
  3. Abdallah, A. (2021). Impact of using parallel text strategy on teaching reading to intermediate II level students. International Journal on Social and Education Sciences (IJonSES), 3(1), 95-108. https://doi.org/10.46328/ijonses.48
  4. Wołk, Krzysztof (2015). "Noisy-Parallel and Comparable Corpora Filtering Methodology for the Extraction of Bi-Lingual Equivalent Data at Sentence Level". Computer Science. 16 (2): 169–184. arXiv: 1510.04500 . Bibcode:2015arXiv151004500W. doi: 10.7494/csci.2015.16.2.169 . S2CID   12860633.
  5. Harris, B. (March 1988). "Bi-Text, A New Concept in Translation Theory" (PDF). Language Monthly. 54: 8–10. Archived from the original (PDF) on 2018-03-02.
  6. Genette, Marie (2016). How Reliable Are Online Bilingual Concordancers? An investigation of Linguee, TradooIT, WeBiText and ReversoContext and Their Reliability Through a Contrastive Analysis of Complex Prepositions from French to English (M.A. thesis). Université catholique de Louvain & Universitetet i Oslo. hdl: 10852/51577 .
  7. "TradooIT – Concordancier bilingue".
  8. Désilets, Alain; Farley, Benoît; Stojanović, Marta; Patenaude, Geneviève (2008). WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content (PDF). Proceedings of Translating and the Computer. Vol. 30. pp. 27–28. S2CID   14586900.

Parallel corpora

Documentation

Alignment tools

  1. Ralf, Ralf Steinberger; Pouliquen, Bruno; Widiger, Anna; Ignat, Camelia; Erjavec, Tomaž; Tufiş, Dan; Varga, Dániel (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.