Apertium

Apertium
	Apertium-tolk, a simple desktop user interface for Apertium that translates as the user types
Stable release	3.9.4 / 28 December 2023;7 months ago
Repository	github.com/apertium
Written in	C++
Operating system	POSIX compatible and Windows NT (limited support)
Available in	35 languages, see below
Type	Rule-based machine translation
License	GNU General Public License
Website	www.apertium.org

Last updated August 25, 2024

Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

Overview

Apertium is a transfer-based machine translation system, which uses finite state transducers for all of its lexical transformations, and Constraint Grammar taggers as well as hidden Markov models or Perceptrons for part-of-speech tagging / word category disambiguation.^[2] A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use (possibly recursive) rules defined in a Context-free grammar.^[3]

Many existing machine translation systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is free software and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth.

At present (December 2020), Apertium has released 51 stable language pairs,^[4] delivering fast translation with reasonably intelligible results (errors are easily corrected). Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.

History

Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spanish government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data (dictionaries, rules) in well-specified XML formats.

Language data developed for it (in collaboration with the Universidade de Vigo, the Universitat Politècnica de Catalunya and the Universitat Pompeu Fabra) currently support (in stable version) the Arabic, Aragonese, Asturian, Basque, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English, Esperanto, French, Galician, Hindi, Icelandic, Indonesian, Italian, Kazakh, Macedonian, Malaysian, Maltese, Northern Sami, Norwegian (Bokmål and Nynorsk), Occitan, Polish, Portuguese, Romanian, Russian, Sardinian, Serbo-Croatian, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian, Urdu, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa.

The project has taken part in the 2009,^[5] 2010,^[6] 2011,^[7] 2012,^[8] 2013^[9] and 2014^[10] editions of Google Summer of Code and the 2010,^[11] 2011,^[12] 2012,^[13] 2013,^[14] 2014,^[15] 2015,^[16] 2016^[17] and 2017^[18] editions of Google Code-In.

Translation methodology

Pipeline of Apertium machine translation system Apertium-pipeline.png — Pipeline of Apertium machine translation system

This is an overall, step-by-step view how Apertium works.

The diagram displays the steps that Apertium takes to translate a source-language text (the text we want to translate) into a target-language text (the translated text).

Source language text is passed into Apertium for translation.
The deformatter removes formatting markup (HTML, RTF, etc.) that should be kept in place but not translated.
The morphological analyser segments the text (expanding elisions, marking set phrases, etc.), and looks up segments in the language dictionaries, returning dictionary forms and tags for all matches. In pairs that involve agglutinative morphology, including a number of Turkic languages, a Helsinki Finite State Transducer (HFST) is used. Otherwise, an Apertium-specific finite state transducer system called lttoolbox,^[19] is used.
The morphological disambiguator (the morphological analyser and the morphological disambiguator together form the part of speech tagger ) resolves ambiguous segments (i.e., when there is more than one match) by choosing one match. Apertium uses Constraint Grammar rules (with the vislcg3 parser^[20]) for most of its language pairs.
Retokenisation uses a finite state transducer to match sequences of lexical units and may reorder or translate tags (often used for translating idiomatic expressions into something that more approaches the target language grammar)
Lexical transfer looks up disambiguated source-language basewords to find their target-language equivalents (i.e., mapping source language to target language). For lexical transfer, Apertium uses an XML-based dictionary format called bidix.^[21]
Lexical selection chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific XML-based technology, apertium-lex-tools,^[22] to perform lexical selection.
Structural transfer (i.e., it is an XML format that allows writing complex structural transfer rules) can consist of one-step chunking transfer, three-step chunking transfer or a CFG-based transfer module. The chunking modules flag grammatical differences between the source language and target language (e.g. gender or number agreement) by creating a sequence of chunks containing markers for this. They then reorder or modify chunks in order to produce a grammatical translation in the target-language. The newer CFG-based module matches input sequences into possible parse trees, selecting the best-ranking one and applying transformation rules on the tree.
The morphological generator uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer,^[23] just like the morphological analyser. A morphological transducer both analyses and generates forms.
The post-generator makes any necessary orthographic changes due to the contact of words (e.g. elisions).
The reformatter replaces formatting markup (HTML, RTF, etc.) that was removed by the deformatter in the first step.
Apertium delivers the target-language translation.

Supported languages

As of August2024, the following 108 pairs and 50 languages and languages varieties are supported by Apertium.

Afrikaans to Dutch
Arabic to Maltese
Aragonese to Catalan
Aragonese to Spanish
Arpitan (Franco-Provençal) to French
Basque to English
Basque to Spanish
Belarusian to Russian
Breton to French
Bulgarian to Macedonian
Catalan to Aragonese
Catalan to English
Catalan to Esperanto
Catalan to French
Catalan to Italian
Catalan to Occitan
Catalan to Aranese
Catalan to Portuguese
Catalan to Brazilian Portuguese
Catalan to European Portuguese (traditional spelling)
Catalan to Romanian
Catalan to Sardinian
Catalan to Spanish
Crimean Tatar to Turkish
Danish to Norwegian (Bokmål)
Danish to Norwegian (Nynorsk)
Danish to Swedish
Dutch to Afrikaans
English to Catalan
English to Valencian
English to Esperanto
English to Galician
English to Serbo-Croatian
English to Spanish
Esperanto to English
French to Arpitan (Franco-Provençal)
French to Catalan
French to Esperanto
French to Occitan
French to Gascon
French to Spanish
Galician to English
Galician to Portuguese
Galician to Spanish
Hindi to Urdu
Icelandic to English
Icelandic to Swedish
Indonesian to Malay
Italian to Catalan
Italian to Sardinian
Italian to Spanish
Kazakh to Tatar
Macedonian to Bulgarian
Macedonian to English
Malay to Indonesian
Maltese to Arabic
Northern Sámi to Norwegian (Bokmål)
Norwegian (Bokmål) to Danish
Norwegian (Bokmål) to Norwegian (Nynorsk)
Norwegian (Bokmål) to East Norwegian, vi→vi
Norwegian (Bokmål) to Swedish
Norwegian (Nynorsk) to Danish
Norwegian (Nynorsk) to Norwegian (Bokmål)
Norwegian (Nynorsk) to East Norwegian, vi→vi
Norwegian (Nynorsk) to Swedish
East Norwegian, vi→vi to Norwegian (Nynorsk)
Occitan to Catalan
Occitan to French
Occitan to Spanish
Aranese to Catalan
Aranese to Spanish
Gascon to French
Polish to Silesian
Portuguese to Catalan
Portuguese to Galician
Portuguese to Spanish
Romanian to Catalan
Romanian to Spanish
Russian to Belarusian
Russian to Ukrainian
Sardinian to Italian
Serbo-Croatian to English
Serbo-Croatian to Macedonian
Serbo-Croatian to Slovenian
Silesian to Polish
Slovenian to Serbo-Croatian
Spanish to Aragonese
Spanish to Asturian
Spanish to Catalan
Spanish to Valencian
Spanish to English
Spanish to Esperanto
Spanish to French
Spanish to Galician
Spanish to Italian
Spanish to Occitan
Spanish to Aranese
Spanish to Portuguese
Spanish to Brazilian Portuguese
Swedish to Danish
Swedish to Icelandic
Swedish to Norwegian (Bokmål)
Swedish to Norwegian (Nynorsk)
Tatar to Kazakh
Turkish to Crimean Tatar
Ukrainian to Russian
Urdu to Hindi
Welsh to English

Notes

↑ . 28 December 2023 https://github.com/apertium/apertium/releases/tag/v3.9.4.{{cite web}}: Missing or empty |title= (help)
↑ Francis M. Tyers (2010) "Rule-based Breton to French machine translation Archived 2016-11-17 at the Wayback Machine ". 'Proceedings of the 14th Annual Conference of the European Association of Machine Translation, EAMT10', pp. 174--181
↑ Khanna, Tanmai; Washington, Jonathan N.; Tyers, Francis M.; Bayatlı, Sevilay; Swanson, Daniel G.; Pirinen, Tommi A.; Tang, Irene; Alòs i Font, Hèctor (1 December 2021). "Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages". Machine Translation. 35 (4): 475–502. doi: 10.1007/s10590-021-09260-6 . hdl: 10037/22990 .
↑ "Apertium".
↑ "Accepted organizations for Google Summer of Code 2009".
↑ "Accepted organizations for Google Summer of Code 2010".
↑ "Accepted organizations for Google Summer of Code 2011".
↑ "Accepted organizations for Google Summer of Code 2012".
↑ "Accepted organizations for Google Summer of Code 2013".
↑ "Accepted organizations for Google Summer of Code 2014".
↑ "Accepted organizations for Google Code-in 2010".
↑ "Accepted organizations for Google Code-in 2011".
↑ "Accepted organizations for Google Code In 2012".
↑ "Accepted organizations for Google Code-in 2013".
↑ "Accepted organizations for Google Code-in 2014".
↑ "Accepted organizations for Google Code-in 2015".
↑ "Accepted organizations for Google Code-in 2016".
↑ "Accepted organizations for Google Code-in 2017".
↑ "Lttoolbox - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
↑ "VISL". beta.visl.sdu.dk. Retrieved 2016-01-19.
↑ "Bilingual dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
↑ "Constraint-based lexical selection module - Apertium". wiki.apertium.org. Retrieved 2016-01-19.
↑ "Morphological dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

Related Research Articles

Catalan, known in the Valencian Community and Carche as Valencian, is a Western Romance language. It is the official language of Andorra, and an official language of three autonomous communities in eastern Spain: Catalonia, the Balearic Islands and the Valencian Community, where it is called Valencian. It has semi-official status in the Italian comune of Alghero, and it is spoken in the Pyrénées-Orientales department of France and in two further areas in eastern Spain: the eastern strip of Aragon and the Carche area in the Region of Murcia. The Catalan-speaking territories are often called the Països Catalans or "Catalan Countries".

In historical linguistics, cognates or lexical cognates are sets of words that have been inherited in direct descent from an etymological ancestor in a common parent language.

Norwegian is a North Germanic language from the Indo-European language family spoken mainly in Norway, where it is an official language. Along with Swedish and Danish, Norwegian forms a dialect continuum of more or less mutually intelligible local and regional varieties; some Norwegian and Swedish dialects, in particular, are very close. These Scandinavian languages, together with Faroese and Icelandic as well as some extinct languages, constitute the North Germanic languages. Faroese and Icelandic are not mutually intelligible with Norwegian in their spoken form because continental Scandinavian has diverged from them. While the two Germanic languages with the greatest numbers of speakers, English and German, have close similarities with Norwegian, neither is mutually intelligible with it. Norwegian is a descendant of Old Norse, the common language of the Germanic peoples living in Scandinavia during the Viking Age.

Occitan, also known as lenga d'òc by its native speakers, sometimes also referred to as Provençal, is a Romance language spoken in Southern France, Monaco, Italy's Occitan Valleys, as well as Spain's Val d'Aran in Catalonia; collectively, these regions are sometimes referred to as Occitania. It is also spoken in Calabria in a linguistic enclave of Cosenza area. Some include Catalan in Occitan, as the distance between this language and some Occitan dialects is similar to the distance between different Occitan dialects. Catalan was considered a dialect of Occitan until the end of the 19th century and still today remains its closest relative.

Gascon is the vernacular Romance variety spoken mainly in the region of Gascony, France. It is often considered a variety of Occitan, although some authors consider it a different language.

Ó, ó (o-acute) is a letter in the Czech, Emilian-Romagnol, Faroese, Hungarian, Icelandic, Kashubian, Polish, Slovak, Karakalpak, and Sorbian languages. This letter also appears in the Afrikaans, Catalan, Dutch, Irish, Nynorsk, Bokmål, Occitan, Portuguese, Spanish, Italian and Galician languages as a variant of letter "o". In some cases, the letter "ó" is used in some languages as in a high rising tone. It is sometimes also used in English for loanwords.

Aranese is a standardized form of the Pyrenean Gascon variety of the Occitan language spoken in the Val d'Aran, in northwestern Catalonia close to the Spanish border with France, where it is one of the three official languages beside Catalan and Spanish. In 2010, it was declared the third official language in Catalonia by the Parliament of Catalonia.

lernu! is a multilingual, web-based free project for promoting and teaching Esperanto. The name Lernu comes from the imperative form of the Esperanto verb lerni, meaning "to learn". The site is run by E@I, an international youth organization, which started as a working group of the World Esperanto Youth Organization.

There are two Norwegian language editions of Wikipedia: one for articles written in Bokmål or Riksmål, and one for articles written in Nynorsk or Høgnorsk. There are currently 634,233 articles on the Norwegian Wikipedia edition in Bokmål/Riksmål, and 170,247 articles on the Nynorsk edition.

Softcatalà is a non-profit association that promotes the use of the Catalan language on computing, Internet and new technologies. This association consists of computer specialists, philologists, translators, students and all kind of volunteers that work in the field of translating software into Catalan, in order to preserve this language in the English-controlled software environment. They also offer several linguistic tools to help users improve their language knowledge.

Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of August 2024, Google Translate supports 243 languages at various levels. It served over 200 million people daily in May 2013,, and over 500 million total users as of April 2016, with more than 100 billion words translated daily.

Gollum browser is a web application for accessing the encyclopedia, Wikipedia. Since 2017, Gollum is no longer accessible online.

The majority of languages of Spain belong to the Romance language family, of which Spanish is the only one with official status in the whole country. Others, including Catalan/Valencian and Galician, enjoy official status in their respective autonomous regions, similar to Basque in the northeast of the country. A number of other languages and dialects belonging to the Romance continuum exist in Spain, such as Aragonese, Asturian, Fala and Aranese Occitan.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

This article compares several selected client-based genealogy programs. Web-based genealogy software is not included.

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

There are four languages with official status in Catalonia : Catalan; Spanish, which is official throughout Spain; Aranese, a dialect of Occitan spoken in the Aran Valley; and Catalan Sign Language. Many other languages are spoken in Catalonia as a result of recent immigration from all over the world.

<span class="mw-page-title-main">Western Romance languages</span> Subdivision of the Romance languages

Western Romance languages are one of the two subdivisions of a proposed subdivision of the Romance languages based on the La Spezia–Rimini Line. They include the Gallo-Romance, Occitano-Romance and Iberian Romance branches. Gallo-Italic may also be included. The subdivision is based mainly on the use of the "s" for pluralization, the weakening of some consonants and the pronunciation of "Soft C" as /t͡s/ rather than /t͡ʃ/ as in Italian and Romanian.

Yandex Translate is a web service provided by Yandex, intended for the translation of web pages into another language.

Google Neural Machine Translation (GNMT) was a neural machine translation (NMT) system developed by Google and introduced in November 2016 that used an artificial neural network to increase fluency and accuracy in Google Translate. The neural network consisted of two main blocks, an encoder and a decoder, both of LSTM architecture with 8 1024-wide layers each and a simple 1-layer 1024-wide feedforward attention mechanism connecting them. The total number of parameters has been variously described as over 160 million, approximately 210 million, 278 million or 380 million. It used WordPiece tokenizer, and beam search decoding strategy. It ran on Tensor Processing Units.

References

Corbí-Bellot, M. et al. (2005) "An open-source shallow-transfer machine translation engine for the romance languages of Spain" in Proceedings of the European Association for Machine Translation, 10th Annual Conference, Budapest 2005, pp. 79–86
Armentano-Oller, C. et al. (2006) "Open-source Portuguese-Spanish machine translation" in Lecture Notes in Computer Science 3960 [Computational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006], p 50–59.
Forcada, M. L. et al. (2010) "Documentation of the Open-Source Shallow-Transfer Machine Translation Platform Apertium" in Departament de Llenguatges i Sistemes Informatics, University of Alacant.
Forcada, M. L. et al. (2011) "Apertium: a free/open-source platform for rule-based machine translation". in " doi : 10.1007/s10590-011-9090-0

External links

End-user services and software

(All services are based on the Apertium engine)

Online translation websites

Apertium Translation home
Prompsit Translator Archived 2016-12-26 at archive.today
PoliTraductor Translator
University d' Alacant Translator
Universitat Oberta de Catalunya Translator Archived 2016-01-17 at the Wayback Machine

Offline applications

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wikidata-400c6d1c54d9082d71c93453b981d0219bd81a5a-v14-1] . 28 December 2023 https://github.com/apertium/apertium/releases/tag/v3.9.4.{{cite web}}: Missing or empty |title= (help)

[2] Francis M. Tyers (2010) "Rule-based Breton to French machine translation Archived 2016-11-17 at the Wayback Machine ". 'Proceedings of the 14th Annual Conference of the European Association of Machine Translation, EAMT10', pp. 174--181

[3] Khanna, Tanmai; Washington, Jonathan N.; Tyers, Francis M.; Bayatlı, Sevilay; Swanson, Daniel G.; Pirinen, Tommi A.; Tang, Irene; Alòs i Font, Hèctor (1 December 2021). "Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages". Machine Translation. 35 (4): 475–502. doi: 10.1007/s10590-021-09260-6 . hdl: 10037/22990 .

[4] "Apertium".

[5] "Accepted organizations for Google Summer of Code 2009".

[6] "Accepted organizations for Google Summer of Code 2010".

[7] "Accepted organizations for Google Summer of Code 2011".

[8] "Accepted organizations for Google Summer of Code 2012".

[9] "Accepted organizations for Google Summer of Code 2013".

[10] "Accepted organizations for Google Summer of Code 2014".

[11] "Accepted organizations for Google Code-in 2010".

[12] "Accepted organizations for Google Code-in 2011".

[13] "Accepted organizations for Google Code In 2012".

[14] "Accepted organizations for Google Code-in 2013".

[15] "Accepted organizations for Google Code-in 2014".

[16] "Accepted organizations for Google Code-in 2015".

[17] "Accepted organizations for Google Code-in 2016".

[18] "Accepted organizations for Google Code-in 2017".

[19] "Lttoolbox - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

[20] "VISL". beta.visl.sdu.dk. Retrieved 2016-01-19.

[21] "Bilingual dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

[22] "Constraint-based lexical selection module - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

[23] "Morphological dictionary - Apertium". wiki.apertium.org. Retrieved 2016-01-19.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]