Hybrid machine translation

Last updated

Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system. The motivation for developing hybrid machine translation systems stems from the failure of any single technique to achieve a satisfactory level of accuracy. Many hybrid machine translation systems have been successful in improving the accuracy of the translations, and there are several popular machine translation systems which employ hybrid methods.

Contents

Approaches

Multi-engine

This approach to hybrid machine translation involves running multiple machine translation systems in parallel. The final output is generated by combining the output of all the sub-systems. Most commonly, these systems use statistical and rule-based translation subsystems, [1] but other combinations have been explored. For example, researchers at Carnegie Mellon University have had some success combining example-based, transfer-based, knowledge-based and statistical translation sub-systems into one machine translation system. [2]

Statistical rule generation

This approach involves using statistical data to generate lexical and syntactic rules. The input is then processed with these rules as if it were a rule-based translator. [1] This approach attempts to avoid the difficult and time-consuming task of creating a set of comprehensive, fine-grained linguistic rules by extracting those rules from the training corpus. This approach still suffers from many problems of normal statistical machine translation, namely that the accuracy of the translation will depend heavily on the similarity of the input text to the text of the training corpus. As a result, this technique has had the most success in domain-specific applications, and has the same difficulties with domain adaptation as many statistical machine translation systems. [3]

Multi-Pass

This approach involves serially processing the input multiple times. The most common technique used in multi-pass machine translation systems is to pre-process the input with a rule-based machine translation system. The output of the rule-based pre-processor is passed to a statistical machine translation system, which produces the final output. This technique is used to limit the amount of information a statistical system need consider, significantly reducing the processing power required. It also removes the need for the rule-based system to be a complete translation system for the language, significantly reducing the amount of human effort and labor necessary to build the system. [4]

Confidence-Based

This approach differs from the other hybrid approaches in that in most cases only one translation technology is used. A confidence metric is produced for each translated sentence from which a decision can be made whether to try a secondary translation technology or to proceed with the initial translation output. SMT is also used when common error patterns such as multiple repeat words appear in sequence, as is common with NMT when the attention mechanism is confused.

See also

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, math, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Word-sense disambiguation (WSD) is an open problem in computational linguistics concerned with identifying which sense of a word is used in a sentence. The solution to this issue impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Machine learning Study of algorithms that improve automatically through experience

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content.

Dictionary-based machine translation

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

Network traffic simulation is a process used in telecommunications engineering to measure the efficiency of a communications network.

Genetic fuzzy systems are fuzzy systems constructed by using genetic algorithms or genetic programming, which mimic the process of natural evolution, to identify its structure and parameter.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data. For example, a multimodal question answering system employs multiple modalities at both question (input) and answer (output) level.

Grammar induction

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

Transfer-based machine translation

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

The following outline is provided as an overview of and topical guide to natural language processing:

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

References

  1. 1 2 Hutchins, J. 2007. Machine translation: A concise history. Computer-aided translation: Theory and practice.
  2. Hogan, C. and Frederking, R. 1998. An Evaluation of the Multi-engine MT Architecture. Lecture Notes in Computer Science, 1529 pp. 113-123.
  3. Chang, J. and Su, K. 1997. Corpus-based statistics-oriented (CBSO) machine translation researches in Taiwan. AMTA (1997), pp. 165--173.
  4. Hovy, E. 1996. Deepening wisdom or compromised principles?-the hybridization of statistical and symbolic MT systems. IEEE Expert, 11 (2), pp. 16--18.