Apertium

Last updated
Apertium
Stable release
3.8.3 [1]   OOjs UI icon edit-ltr-progressive.svg / 1 November 2022;11 months ago (1 November 2022)
Repository github.com/apertium
Written in C++
Operating system POSIX compatible and Windows NT (limited support)
Available in35 languages, see below
Type Rule-based machine translation
License GNU General Public License
Website www.apertium.org

Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

Contents

Overview

Apertium is a transfer-based machine translation system, which uses finite state transducers for all of its lexical transformations, and Constraint Grammar taggers as well as hidden Markov models or Perceptrons for part-of-speech tagging / word category disambiguation. [2] A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use (possibly recursive) rules defined in a Context-free grammar. [3]

Many existing machine translation systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is free software and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth.

At present (December 2020), Apertium has released 51 stable language pairs, [4] delivering fast translation with reasonably intelligible results (errors are easily corrected). Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.

History

Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spanish government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data (dictionaries, rules) in well-specified XML formats.

Language data developed for it (in collaboration with the Universidade de Vigo, the Universitat Politècnica de Catalunya and the Universitat Pompeu Fabra) currently support (in stable version) the Arabic, Aragonese, Asturian, Basque, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English, Esperanto, French, Galician, Hindi, Icelandic, Indonesian, Italian, Kazakh, Macedonian, Malaysian, Maltese, Northern Sami, Norwegian (Bokmål and Nynorsk), Occitan, Polish, Portuguese, Romanian, Russian, Sardinian, Serbo-Croatian, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian, Urdu, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa.

The project has taken part in the 2009, [5] 2010, [6] 2011, [7] 2012, [8] 2013 [9] and 2014 [10] editions of Google Summer of Code and the 2010, [11] 2011, [12] 2012, [13] 2013, [14] 2014, [15] 2015, [16] 2016 [17] and 2017 [18] editions of Google Code-In.

Translation methodology

Pipeline of Apertium machine translation system Apertium-pipeline.png
Pipeline of Apertium machine translation system

This is an overall, step-by-step view how Apertium works.

The diagram displays the steps that Apertium takes to translate a source-language text (the text we want to translate) into a target-language text (the translated text).

  1. Source language text is passed into Apertium for translation.
  2. The deformatter removes formatting markup (HTML, RTF, etc.) that should be kept in place but not translated.
  3. The morphological analyser segments the text (expanding elisions, marking set phrases, etc.), and looks up segments in the language dictionaries, returning dictionary forms and tags for all matches. In pairs that involve agglutinative morphology, including a number of Turkic languages, a Helsinki Finite State Transducer (HFST) is used. Otherwise, an Apertium-specific finite state transducer system called lttoolbox, [19] is used.
  4. The morphological disambiguator (the morphological analyser and the morphological disambiguator together form the part of speech tagger ) resolves ambiguous segments (i.e., when there is more than one match) by choosing one match. Apertium uses Constraint Grammar rules (with the vislcg3 parser [20] ) for most of its language pairs.
  5. Retokenisation uses a finite state transducer to match sequences of lexical units and may reorder or translate tags (often used for translating idiomatic expressions into something that more approaches the target language grammar)
  6. Lexical transfer looks up disambiguated source-language basewords to find their target-language equivalents (i.e., mapping source language to target language). For lexical transfer, Apertium uses an XML-based dictionary format called bidix. [21]
  7. Lexical selection chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific XML-based technology, apertium-lex-tools, [22] to perform lexical selection.
  8. Structural transfer (i.e., it is an XML format that allows writing complex structural transfer rules) can consist of one-step chunking transfer, three-step chunking transfer or a CFG-based transfer module. The chunking modules flag grammatical differences between the source language and target language (e.g. gender or number agreement) by creating a sequence of chunks containing markers for this. They then reorder or modify chunks in order to produce a grammatical translation in the target-language. The newer CFG-based module matches input sequences into possible parse trees, selecting the best-ranking one and applying transformation rules on the tree.
  9. The morphological generator uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer, [23] just like the morphological analyser. A morphological transducer both analyses and generates forms.
  10. The post-generator makes any necessary orthographic changes due to the contact of words (e.g. elisions).
  11. The reformatter replaces formatting markup (HTML, RTF, etc.) that was removed by the deformatter in the first step.
  12. Apertium delivers the target-language translation.

Language pairs

List of currently stable language pairs, hover over the language codes to see the languages that they represent.

afaranasteubrbgcadanleneofifrgldehinisiditkazmkmsmtsmenbnnocptroschbsslvessvtaturdcy
Afrikaans NoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Arabic NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (←)NoNoNoNoNoNoNoNoNoNoNoNoNoNo
Aragonese NoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNo
Asturian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNo
Basque NoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNo
Breton NoNoNoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Bulgarian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Catalan NoNoYes (⇄)NoNoNoNoNoNoYes (⇄)Yes (→)NoYes (⇄)NoNoNoNoNoYes (←)NoNoNoNoNoNoNoYes (⇄)Yes (⇄)NoYes (→)NoNoYes (⇄)NoNoNoNo
Danish NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)Yes (⇄)NoNoNoNoNoNoNoYes (←)NoNoNo
Dutch Yes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
English NoNoNoNoYes (←)NoNoYes (⇄)NoNoYes (⇄)NoNoYes (⇄)NoNoYes (←)NoNoNoYes (←)NoNoNoNoNoNoNoNoNoYes (←)NoYes (⇄)NoNoNoYes (←)
Esperanto NoNoNoNoNoNoNoYes (←)NoNoYes (⇄)NoYes (←)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Finnish NoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
French NoNoNoNoNoYes (←)NoYes (⇄)NoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoYes (⇄)NoNoNo
Galician NoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoYes (⇄)NoNoNoNo
German NoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Hindi NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)No
Icelandic NoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNo
Indonesian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Italian NoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNo
Kazakh NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNo
Macedonian NoNoNoNoNoNoYes (⇄)NoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (←)NoNoNoNoNoNo
Malaysian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Maltese NoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Northern Sami NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNo
Norwegian (Bokmål)NoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (←)Yes (⇄)NoNoNoNoNoNoNoNoNoNoNo
Norwegian (Nynorsk)NoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNo
Occitan NoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoYes (←)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNo
Portuguese NoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNo
Romanian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (←)NoNoNoNo
Sardinian NoNoNoNoNoNoNoYes (←)NoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Serbo-Croatian NoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNo
Slovenian NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNo
Spanish NoNoYes (⇄)Yes (⇄)Yes (←)NoNoYes (⇄)NoNoYes (⇄)Yes (→)NoYes (⇄)Yes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)Yes (⇄)Yes (←)NoNoNoNoNoNoNo
Swedish NoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Tatar NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Urdu NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoYes (⇄)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo
Welsh NoNoNoNoNoNoNoNoNoNoYes (→)NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo

See also

Notes

  1. "Release 3.8.3 Latest". 1 November 2022. Retrieved 2 March 2023.
  2. Francis M. Tyers (2010) "Rule-based Breton to French machine translation Archived 2016-11-17 at the Wayback Machine ". 'Proceedings of the 14th Annual Conference of the European Association of Machine Translation, EAMT10', pp. 174--181
  3. Khanna, Tanmai; Washington, Jonathan N.; Tyers, Francis M.; Bayatlı, Sevilay; Swanson, Daniel G.; Pirinen, Tommi A.; Tang, Irene; Alòs i Font, Hèctor (1 December 2021). "Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages". Machine Translation. 35 (4): 475–502. doi:10.1007/s10590-021-09260-6.
  4. "Apertium".
  5. "Accepted organizations for Google Summer of Code 2009".
  6. "Accepted organizations for Google Summer of Code 2010".
  7. "Accepted organizations for Google Summer of Code 2011".
  8. "Accepted organizations for Google Summer of Code 2012".
  9. "Accepted organizations for Google Summer of Code 2013".
  10. "Accepted organizations for Google Summer of Code 2014".
  11. "Accepted organizations for Google Code-in 2010".
  12. "Accepted organizations for Google Code-in 2011".
  13. "Accepted organizations for Google Code In 2012".
  14. "Accepted organizations for Google Code-in 2013".
  15. "Accepted organizations for Google Code-in 2014".
  16. "Accepted organizations for Google Code-in 2015".
  17. "Accepted organizations for Google Code-in 2016".
  18. "Accepted organizations for Google Code-in 2017".
  19. "Lttoolbox - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
  20. "VISL". beta.visl.sdu.dk. Retrieved 2016-01-19.
  21. "Bilingual dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
  22. "Constraint-based lexical selection module - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
  23. "Morphological dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

Related Research Articles

<span class="mw-page-title-main">Machine translation</span> Use of software for language translation

Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

In computer science, a preprocessor is a program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages.

<span class="mw-page-title-main">Wiktionary</span> Multilingual online dictionary

Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 192 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.

<span class="mw-page-title-main">Interlingual machine translation</span> Type of machine translation

Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the interlingual approach is an alternative to the direct approach and the transfer approach.

<span class="mw-page-title-main">OmegaT</span> Computer assisted translation tool written in Java

OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

XLIFF is an XML-based bitext format created to standardize the way localizable data are passed between and among tools during a localization process and a common format for CAT tool exchange. The XLIFF Technical Committee (TC) first convened at OASIS in December 2001, but the first fully ratified version of XLIFF appeared as XLIFF Version 1.2 in February 2008. Its current specification is v2.1 released on 2018-02-13, which is backwards compatible with v2.0 released on 2014-08-05.

<span class="mw-page-title-main">General Architecture for Text Engineering</span>

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

Moses is a free software, statistical machine translation engine that can be used to train statistical models of text translation from a source language to a target language, developed by the University of Edinburgh. Moses then allows new source-language text to be decoded using these models to produce automatic translations in the target language. Training requires a parallel corpus of passages in the two languages, typically manually translated sentence pairs. Moses is released under the LGPL licence and available both as source code and binaries for Windows and Linux. Its development is primarily supported by the EuroMatrix project, with funding by the European Commission.

Machine translation is an algorithm which attempts to translate text or speech from one natural language to another.

Language resource management Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.

<span class="mw-page-title-main">Transfer-based machine translation</span>

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

Technical translation is a type of specialized translation involving the translation of documents produced by technical writers, or more specifically, texts which relate to technological subject areas or texts which deal with the practical application of scientific and technological information. While the presence of specialized terminology is a feature of technical texts, specialized terminology alone is not sufficient for classifying a text as "technical" since numerous disciplines and subjects which are not "technical" possess what can be regarded as specialized terminology. Technical translation covers the translation of many kinds of specialized texts and requires a high level of subject knowledge and mastery of the relevant terminology and writing conventions.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

The following outline is provided as an overview of and topical guide to natural-language processing:

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

References

End-user services and software

(All services are based on the Apertium engine)

Online translation websites

Offline applications