Stable release | |
---|---|
Repository | github |
Written in | C++ |
Operating system | POSIX compatible and Windows NT (limited support) |
Available in | 35 languages, see below |
Type | Rule-based machine translation |
License | GNU General Public License |
Website | www |
Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.
Apertium is a transfer-based machine translation system, which uses finite state transducers for all of its lexical transformations, and Constraint Grammar taggers as well as hidden Markov models or Perceptrons for part-of-speech tagging / word category disambiguation. [2] A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use (possibly recursive) rules defined in a Context-free grammar. [3]
Many existing machine translation systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is free software and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth.
At present (December 2020), Apertium has released 51 stable language pairs, [4] delivering fast translation with reasonably intelligible results (errors are easily corrected). Being an open-source project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.
Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spanish government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data (dictionaries, rules) in well-specified XML formats.
Language data developed for it (in collaboration with the Universidade de Vigo, the Universitat Politècnica de Catalunya and the Universitat Pompeu Fabra) currently support (in stable version) the Arabic, Aragonese, Asturian, Basque, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English, Esperanto, French, Galician, Hindi, Icelandic, Indonesian, Italian, Kazakh, Macedonian, Malaysian, Maltese, Northern Sami, Norwegian (Bokmål and Nynorsk), Occitan, Polish, Portuguese, Romanian, Russian, Sardinian, Serbo-Croatian, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian, Urdu, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa.
The project has taken part in the 2009, [5] 2010, [6] 2011, [7] 2012, [8] 2013 [9] and 2014 [10] editions of Google Summer of Code and the 2010, [11] 2011, [12] 2012, [13] 2013, [14] 2014, [15] 2015, [16] 2016 [17] and 2017 [18] editions of Google Code-In.
This is an overall, step-by-step view how Apertium works.
The diagram displays the steps that Apertium takes to translate a source-language text (the text we want to translate) into a target-language text (the translated text).
List of currently stable language pairs, hover over the language codes to see the languages that they represent.
af | ar | an | ast | eu | br | bg | ca | da | nl | en | eo | fi | fr | gl | de | hin | is | id | it | kaz | mk | ms | mt | sme | nb | nn | oc | pt | ro | sc | hbs | slv | es | sv | tat | urd | cy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afrikaans | — | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Arabic | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (←) | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Aragonese | No | No | — | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No |
Asturian | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No |
Basque | No | No | No | No | — | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No |
Breton | No | No | No | No | No | — | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Bulgarian | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Catalan | No | No | Yes (⇄) | No | No | No | No | — | No | No | Yes (⇄) | Yes (→) | No | Yes (⇄) | No | No | No | No | No | Yes (←) | No | No | No | No | No | No | No | Yes (⇄) | Yes (⇄) | No | Yes (→) | No | No | Yes (⇄) | No | No | No | No |
Danish | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | Yes (⇄) | No | No | No | No | No | No | No | Yes (←) | No | No | No |
Dutch | Yes (⇄) | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
English | No | No | No | No | Yes (←) | No | No | Yes (⇄) | No | No | — | Yes (⇄) | No | No | Yes (⇄) | No | No | Yes (←) | No | No | No | Yes (←) | No | No | No | No | No | No | No | No | No | Yes (←) | No | Yes (⇄) | No | No | No | Yes (←) |
Esperanto | No | No | No | No | No | No | No | Yes (←) | No | No | Yes (⇄) | — | No | Yes (←) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Finnish | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
French | No | No | No | No | No | Yes (←) | No | Yes (⇄) | No | No | No | Yes (→) | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | Yes (⇄) | No | No | No |
Galician | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | Yes (⇄) | No | No | No | No |
German | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Hindi | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No |
Icelandic | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No |
Indonesian | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Italian | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No |
Kazakh | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No |
Macedonian | No | No | No | No | No | No | Yes (⇄) | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | Yes (←) | No | No | No | No | No | No |
Malaysian | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Maltese | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Northern Sami | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | No |
Norwegian (Bokmål) | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (←) | — | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No |
Norwegian (Nynorsk) | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | — | No | No | No | No | No | No | No | No | No | No | No |
Occitan | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | Yes (←) | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | Yes (⇄) | No | No | No | No |
Portuguese | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | Yes (⇄) | No | No | No | No |
Romanian | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No | Yes (←) | No | No | No | No |
Sardinian | No | No | No | No | No | No | No | Yes (←) | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | — | No | No | No | No | No | No | No |
Serbo-Croatian | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | — | Yes (⇄) | No | No | No | No | No |
Slovenian | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | — | No | No | No | No | No |
Spanish | No | No | Yes (⇄) | Yes (⇄) | Yes (←) | No | No | Yes (⇄) | No | No | Yes (⇄) | Yes (→) | No | Yes (⇄) | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | Yes (⇄) | Yes (←) | No | No | No | — | No | No | No | No |
Swedish | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No | No |
Tatar | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No | No |
Urdu | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes (⇄) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — | No |
Welsh | No | No | No | No | No | No | No | No | No | No | Yes (→) | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | — |
Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.
In computer science, a preprocessor is a program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages.
Wiktionary is a multilingual, web-based project to create a free content dictionary of terms in all natural languages and in a number of artificial languages. These entries may contain definitions, images for illustration, pronunciations, etymologies, inflections, usage examples, quotations, related terms, and translations of terms into other languages, among other features. It is collaboratively edited via a wiki. Its name is a portmanteau of the words wiki and dictionary. It is available in 192 languages and in Simple English. Like its sister project Wikipedia, Wiktionary is run by the Wikimedia Foundation, and is written collaboratively by volunteers, dubbed "Wiktionarians". Its wiki software, MediaWiki, allows almost anyone with access to the website to create and edit entries.
Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the interlingual approach is an alternative to the direct approach and the transfer approach.
OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.
Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.
XLIFF is an XML-based bitext format created to standardize the way localizable data are passed between and among tools during a localization process and a common format for CAT tool exchange. The XLIFF Technical Committee (TC) first convened at OASIS in December 2001, but the first fully ratified version of XLIFF appeared as XLIFF Version 1.2 in February 2008. Its current specification is v2.1 released on 2018-02-13, which is backwards compatible with v2.0 released on 2014-08-05.
General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.
Moses is a free software, statistical machine translation engine that can be used to train statistical models of text translation from a source language to a target language, developed by the University of Edinburgh. Moses then allows new source-language text to be decoded using these models to produce automatic translations in the target language. Training requires a parallel corpus of passages in the two languages, typically manually translated sentence pairs. Moses is released under the LGPL licence and available both as source code and binaries for Windows and Linux. Its development is primarily supported by the EuroMatrix project, with funding by the European Commission.
Machine translation is an algorithm which attempts to translate text or speech from one natural language to another.
Language resource management Lexical markup framework, is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles and methods relating to language resources in the contexts of multilingual communication.
Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.
Technical translation is a type of specialized translation involving the translation of documents produced by technical writers, or more specifically, texts which relate to technological subject areas or texts which deal with the practical application of scientific and technological information. While the presence of specialized terminology is a feature of technical texts, specialized terminology alone is not sufficient for classifying a text as "technical" since numerous disciplines and subjects which are not "technical" possess what can be regarded as specialized terminology. Technical translation covers the translation of many kinds of specialized texts and requires a high level of subject knowledge and mastery of the relevant terminology and writing conventions.
Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.
The following outline is provided as an overview of and topical guide to natural-language processing:
Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.
memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.
(All services are based on the Apertium engine)