Arabic machine translation

Last updated

Arabic is one of the major languages that have been given attention by machine translation (MT) researchers since the very early days of MT and specifically in the U.S. The language has always been considered "due to its morphological, syntactic, phonetic and phonological properties [to be] one of the most difficult languages for written and spoken language processing." [1]

Contents

Arabic "differs tremendously in terms of its characters, morphology and diacritization from other languages." [1] Accordingly, researchers cannot always import solutions from other languages, and today Arabic machine translation still needs more efforts to be improved, mainly in the area of semantic representation systems, which are essential for achieving high-quality translation.

Approaches for the study of machine processing of Arabic

Particularistic approaches

Particularistic approaches describe the linguistic features of Arabic and use them for a local processing approach specific to the internal linguistic system of Arabic. They are concerned with the morphological and semantic aspects of Arabic. Sakhr is one of the Arabic speaking groups developing systematically machine processing of Arabic. [2]

Universalist approaches

Universalist approaches use the methods and systems proved to be useful in other languages like English or French making some adaptations if necessary. The focus here is on the syntactic aspects of the linguistic system in general. This approach is followed by most of the companies producing software applications for Arabic.

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas Link Grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.

Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Interlingual machine translation Type of machine translation

Interlingual machine translation is one of the classic approaches to machine translation. In this approach, the source language, i.e. the text to be translated is transformed into an interlingua, i.e., an abstract language-independent representation. The target language is then generated from the interlingua. Within the rule-based machine translation paradigm, the interlingual approach is an alternative to the direct approach and the transfer approach.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another.

A foreign language writing aid is a computer program or any other instrument that assists a non-native language user in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. Assisted aspects of writing include: lexical, syntactic, lexical semantic and idiomatic expression transfer, etc. Different types of foreign language writing aids include automated proofreading applications, text corpora, dictionaries, translation aids and orthography aids.

Linguistic categories include

Transfer-based machine translation

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

Quranic Arabic Corpus

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

PROMT is a Russian company focused upon the development of machine translation systems. At the moment PROMT translators exist for more than 25 languages. PROMT is headquartered in Saint Petersburg, Russia and also has offices in Moscow, Russia, San Francisco, the US, and Hamburg, Germany.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

The following outline is provided as an overview of and topical guide to natural-language processing:

Machine translation (MT) algorithms may be classified by their operating principle. MT may be based on a set of linguistic rules, or on large bodies (corpora) of already existing parallel texts. Rule-based methodologies may consist in a direct word-by-word translation, or operate via a more abstract representation of meaning: a representation either specific to the language pair, or a language-independent interlingua. Corpora-based methodologies rely on machine learning and may follow specific examples taken from the parallel texts, or may calculate statistical probabilities to select a preferred option out of all possible translations.

NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.

References

  1. 1 2 Zughoul, Muhammad; Abu-Alshaar, Awatef (3 August 2005). "English/Arabic/English Machine Translation: A Historical Perspective". Translators' Journal. 50 (3): 1022–1041. Retrieved 2 June 2011.
  2. "Arabic Machine Translation". Archived from the original on 15 July 2011. Retrieved 4 June 2011.