Transfer-based machine translation

Last updated June 06, 2021

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.^[1]

Design

Both transfer-based and interlingua-based machine translation have the same idea: to make a translation it is necessary to have an intermediate representation that captures the "meaning" of the original sentence in order to generate the correct translation. In interlingua-based MT this intermediate representation must be independent of the languages in question, whereas in transfer-based MT, it has some dependence on the language pair involved.

The way in which transfer-based machine translation systems work varies substantially, but in general they follow the same pattern: they apply sets of linguistic rules which are defined as correspondences between the structure of the source language and that of the target language. The first stage involves analysing the input text for morphology and syntax (and sometimes semantics) to create an internal representation. The translation is generated from this representation using both bilingual dictionaries and grammatical rules.

It is possible with this translation strategy to obtain fairly high quality translations, with accuracy in the region of 90%^{[ vague ]} (although this is highly dependent on the language pair in question, for example the distance between the two).

Operation

In a rule-based machine translation system the original text is first analysed morphologically and syntactically in order to obtain a syntactic representation. This representation can then be refined to a more abstract level putting emphasis on the parts relevant for translation and ignoring other types of information. The transfer process then converts this final representation (still in the original language) to a representation of the same level of abstraction in the target language. These two representations are referred to as "intermediate" representations. From the target language representation, the stages are then applied in reverse.

Analysis and transformation

Various methods of analysis and transformation can be used before obtaining the final result. Along with these statistical approaches may be augmented generating hybrid systems. The methods which are chosen and the emphasis depends largely on the design of the system, however, most systems include at least the following stages:

Morphological analysis. Surface forms of the input text are classified as to part-of-speech (e.g. noun, verb, etc.) and sub-category (number, gender, tense, etc.). All of the possible "analyses" for each surface form are typically made output at this stage, along with the lemma of the word.
Lexical categorisation. In any given text some of the words may have more than one meaning, causing ambiguity in analysis. Lexical categorisation looks at the context of a word to try to determine the correct meaning in the context of the input. This can involve part-of-speech tagging and word sense disambiguation.
Lexical transfer. This is basically dictionary translation; the source language lemma (perhaps with sense information) is looked up in a bilingual dictionary and the translation is chosen.
Structural transfer. While the previous stages deal with words, this stage deals with larger constituents, for example phrases and chunks. Typical features of this stage include concordance of gender and number, and re-ordering of words or phrases.
Morphological generation. From the output of the structural transfer stage, the target language surface forms are generated.

Transfer types

One of the main features of transfer-based machine translation systems is a phase that "transfers" an intermediate representation of the text in the original language to an intermediate representation of text in the target language. This can work at one of two levels of linguistic analysis, or somewhere in between. The levels are:

Superficial transfer (or syntactic). This level is characterised by transferring "syntactic structures" between the source and target languages. It is suitable for languages in the same family or of the same type, for example in the Romance languages between Spanish, Catalan, French, Italian, etc.
Deep transfer (or semantic). This level constructs a semantic representation that is dependent on the source language. This representation can consist of a series of structures which represent the meaning. In these transfer systems predicates are typically produced. The translation also typically requires structural transfer. This level is used to translate between more distantly related languages (e.g. Spanish-English or Spanish-Basque, etc.)

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language to create an executable program.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive for further processing, such as optimization and translation. A "good" IR must be accurate – capable of representing the source code without loss of information – and independent of any particular source or target language. An IR may take one of several forms: an in-memory data structure, or a special tuple- or stack-based code readable by the program. In the latter case it is also called an intermediate language.

Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the interlingual approach is an alternative to the direct approach and the transfer approach.

Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

Language production is the production of spoken or written language. In psycholinguistics, it describes all of the stages between having a concept to express and translating that concept into linguistic form. These stages have been described in two types of processing models: the lexical access models and the serial models. Through these models, psycholinguists can look into how speech is produced in different ways, such as when the speaker is bilingual. Psycholinguists learn more about these models and different kinds of speech by using language production research methods that include collecting speech errors and elicited production tasks.

Sentence processing takes place whenever a reader or listener processes a language utterance, either in isolation or in the context of a conversation or a text. Many studies of the human language comprehension process have focused on reading of single utterances (sentences) without context. Extensive research has shown that language comprehension is affected by context preceding a given utterance as well as many other factors.

Meaning–text theory (MTT) is a theoretical linguistic framework, first put forward in Moscow by Aleksandr Žolkovskij and Igor Mel’čuk, for the construction of models of natural language. The theory provides a large and elaborate basis for linguistic description and, due to its formal character, lends itself particularly well to computer applications, including machine translation, phraseology, and lexicography.

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

PROMT is a Russian company focused upon the development of machine translation systems. At the moment PROMT translators exist for more than 25 languages. PROMT is headquartered in Saint Petersburg, Russia and also has offices in Moscow, Russia, San Francisco, the US, and Hamburg, Germany.

ETAP-3 is a proprietary linguistic processing system focusing on English and Russian. It was developed in Moscow, Russia at the Institute for Information Transmission Problems. It is a rule-based system which uses the Meaning-Text Theory as its theoretical foundation. At present, there are several applications of ETAP-3, such as a machine translation tool, a converter of the Universal Networking Language, an interactive learning tool for Russian language learners and a syntactically annotated corpus of Russian language. Demo versions of some of these tools are available online.

MedSLT is a medium-ranged open source spoken language translator developed by the University of Geneva. It is funded by the Swiss National Science Foundation. The system has been designed for the medical domain. It currently covers the doctor-patient diagnosis dialogues for the domains of headache, chest and abdominal pain in English, French, Japanese, Spanish, Catalan and Arabic. The vocabulary used ranges from 350 to 1000 words depending on the domain and language pair.

The following outline is provided as an overview of and topical guide to natural language processing:

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

The Integrational theory of language is the general theory of language that has been developed within the general linguistic approach of integrational linguistics.

References

↑ Jurafsky, Daniel; Martin, James H. (2009). Speech and Language Processing . Pearson. pp. 906–908.