Rule-based machine translation

Last updated

Rule-based machine translation (RBMT; "Classical Approach" of MT) is machine translation systems based on linguistic information about source and target languages basically retrieved from (unilingual, bilingual or multilingual) dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences (in some source language), an RBMT system generates them to output sentences (in some target language) on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

Contents

History

The first RBMT systems were developed in the early 1970s. The most important steps of this evolution were the emergence of the following RBMT systems:

Today, other common RBMT systems include:

Types of RBMT

There are three different types of rule-based machine translation systems:

  1. Direct Systems (Dictionary Based Machine Translation) map input to output with basic rules.
  2. Transfer RBMT Systems (Transfer Based Machine Translation) employ morphological and syntactical analysis.
  3. Interlingual RBMT Systems (Interlingua) use an abstract meaning. [1] [2]

RBMT systems can also be characterized as the systems opposite to Example-based Systems of Machine Translation (Example Based Machine Translation), whereas Hybrid Machine Translations Systems make use of many principles derived from RBMT.

Basic principles

The main approach of RBMT systems is based on linking the structure of the given input sentence with the structure of the demanded output sentence, necessarily preserving their unique meaning. The following example can illustrate the general frame of RBMT:

A girl eats an apple. Source Language = English; Demanded Target Language = German

Minimally, to get a German translation of this English sentence one needs:

  1. A dictionary that will map each English word to an appropriate German word.
  2. Rules representing regular English sentence structure.
  3. Rules representing regular German sentence structure.

And finally, we need rules according to which one can relate these two structures together.

Accordingly, we can state the following stages of translation:

1st: getting basic part-of-speech information of each source word:
a = indef.article; girl = noun; eats = verb; an = indef.article; apple = noun
2nd: getting syntactic information about the verb "to eat":
NP-eat-NP; here: eat – Present Simple, 3rd Person Singular, Active Voice
3rd: parsing the source sentence:
(NP an apple) = the object of eat

Often only partial parsing is sufficient to get to the syntactic structure of the source sentence and to map it onto the structure of the target sentence.

4th: translate English words into German
a (category = indef.article) => ein (category = indef.article)
girl (category = noun) => Mädchen (category = noun)
eat (category = verb) => essen (category = verb)
an (category = indef. article) => ein (category = indef.article)
apple (category = noun) => Apfel (category = noun)
5th: Mapping dictionary entries into appropriate inflected forms (final generation):
A girl eats an apple. => Ein Mädchen isst einen Apfel.

Components

The RBMT system contains:

a SL dictionary - needed by the source language morphological analyser for morphological analysis,
a bilingual dictionary - used by the translator to translate source language words into target language words,
a TL dictionary - needed by the target language morphological generator to generate target language words. [3]

The RBMT system makes use of the following:

Advantages

Shortcomings

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of linguistics and computer science. It is primarily concerned with processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Syntax</span> System responsible for combining morphemes into complex structures

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:

Case roles, according to the work by Fillmore (1967), are the semantic roles of noun phrases in relation to the syntactic structures that contain these noun phrases. The term case role is most widely used for purely semantic relations, including theta roles and thematic roles, that can be independent of the morpho-syntax. The concept of case roles is related to the larger notion of Case which is defined as a system of marking dependent nouns for the type of semantic or syntactic relationship they bear to their heads. Case traditionally refers to inflectional marking.

Lexical semantics, as a subfield of linguistic semantics, is the study of word meanings. It includes the study of how words structure their meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas Link Grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.

In linguistics, nominalization or nominalisation is the use of a word that is not a noun as a noun, or as the head of a noun phrase. This change in functional category can occur through morphological transformation, but it does not always. Nominalization can refer, for instance, to the process of producing a noun from another part of speech by adding a derivational affix, but it can also refer to the complex noun that is formed as a result.

Ngalakan (Ngalakgan) is an Australian Aboriginal language of the Ngalakgan people. It has not been fully acquired by children since the 1930s. It is one of the Northern Non-Pama–Nyungan languages formerly spoken in the Roper river region of the Northern Territory. It is most closely related to Rembarrnga.

<span class="mw-page-title-main">Interlingual machine translation</span> Type of machine translation

Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the interlingual approach is an alternative to the direct approach and the transfer approach.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Linguistic categories include

Nanosyntax is an approach to syntax where the terminal nodes of syntactic parse trees may be reduced to units smaller than a morpheme. Each unit may stand as an irreducible element and not be required to form a further "subtree." Due to its reduction to the smallest terminal possible, the terminals are smaller than morphemes. Therefore, morphemes and words cannot be itemised as a single terminal, and instead are composed by several terminals. As a result, Nanosyntax can serve as a solution to phenomena that are inadequately explained by other theories of syntax.

A multiword expression (MWE), also called phraseme, is a lexeme-like unit made up of a sequence of two or more lexemes that has properties that are not predictable from the properties of the individual lexemes or their normal mode of combination. MWEs differ from lexemes in that the latter are required by many sources to have meaning that cannot be derived from the meaning of separate components. While MWEs must have some properties that cannot be derived from the same property of the components, the property in question does not need to be meaning.

<span class="mw-page-title-main">Transfer-based machine translation</span>

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

The following outline is provided as an overview of and topical guide to natural-language processing:

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time, there are just over 200 treebanks of more than 100 languages available in the UD inventory.

References

  1. Koehn, Philipp (2010). Statistical Machine Translation. Cambridge: Cambridge University Press. p. 15. ISBN   9780521874151.
  2. Nirenburg, Sergei (1989). "Knowledge-Based Machine Translation". Machine Trandation 4 (1989), 5 - 24. Kluwer Academic Publishers. 4 (1): 5–24. JSTOR   40008396.
  3. Hettige, B.; Karunananda, A.S. (2011). "Computational Model of Grammar for English to Sinhala Machine Translation". 2011 International Conference on Advances in ICT for Emerging Regions (ICTer). pp. 26–31. doi:10.1109/ICTer.2011.6075022. ISBN   978-1-4577-1114-5. S2CID   45871137.
  4. Lonsdale, Deryle; Mitamura, Teruko; Nyberg, Eric (1995). "Acquisition of Large Lexicons for Practical Knowledge-Based MT". Machine Translation. Kluwer Academic Publishers. 9 (3–4): 251–283. doi:10.1007/BF00980580. S2CID   1106335.
  5. Lagarda, A.-L.; Alabau, V.; Casacuberta, F.; Silva, R.; Díaz-de-Liaño, E. (2009). "Statistical Post-Editing of a Rule-Based Machine Translation System" (PDF). Proceedings of NAACL HLT 2009: Short Papers, pages 217–220, Boulder, Colorado. Association for Computational Linguistics. Retrieved 20 June 2012.

Literature