Controlled language in machine translation

Last updated

Using controlled language in machine translation poses several problems.

Contents

In an automated translation, the first step in order to understand the controlled language is to know what it is and to distinguish between natural language and controlled language.

The main problem in machine translation is a linguistic problem. Language is ambiguous and the system tries to model a language on lexical and grammatical way. In order to solve this problem there are a lot of alternatives, e.g. a glossary related with the text’s topic can be used.

Benefits of using a controlled language

It is enabling to produce texts easier to read, more comprehensible and easier to retain, as well as with better vocabulary and style. Reasons for introducing a controlled language include:

Controlled language and translation

One of the biggest challenges facing organizations that wish to reduce the cost and time involved in their translations is the fact that even in environments that combine content management systems with translation memory technology, the percentage of un-translated segments per new document remains fairly high. While it is certainly possible to manage content on the sentence/segment level, the current best practice seems to be to chunk at the topic level. Which means that reuse occurs at a fairly high level of granularity.

Related Research Articles

Computer programming is the process of performing a particular computation, usually by designing and building an executable computer program. Programming involves tasks such as analysis, generating algorithms, profiling algorithms' accuracy and resource consumption, and the implementation of algorithms. The source code of a program is written in one or more languages that are intelligible to programmers, rather than machine code, which is directly executed by the central processing unit. The purpose of programming is to find a sequence of instructions that will automate the performance of a task on a computer, often for solving a given problem. Proficient programming thus usually requires expertise in several different subjects, including knowledge of the application domain, specialized algorithms, and formal logic.

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

Parallel text Text placed alongside its translation or translations

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

A programming tool or software development tool is a computer program that software developers use to create, debug, maintain, or otherwise support other programs and applications. The term usually refers to relatively simple programs, that can be combined to accomplish a task, much as one might use multiple hands to fix a physical object. The most basic tools are a source code editor and a compiler or interpreter, which are used ubiquitously and continuously. Other tools are used more or less depending on the language, development methodology, and individual engineer, often used for a discrete task, like a debugger or profiler. Tools may be discrete programs, executed separately – often from the command line – or may be parts of a single large program, called an integrated development environment (IDE). In many cases, particularly for simpler use, simple ad hoc techniques are used instead of a tool, such as print debugging instead of using a debugger, manual timing instead of a profiler, or tracking bugs in a text file or spreadsheet instead of a bug tracking system.

Computer-aided translation (CAT), also referred to as machine-assisted translation (MAT) or machine-aided human translation (MAHT), is the use of software to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

In computer science, the term automatic programming identifies a type of computer programming in which some mechanism generates a computer program to allow human programmers to write the code at a higher abstraction level.

ASD-STE100 Simplified Technical English (STE) is a controlled language developed in the early 1980s to help second-language speakers of English to unambiguously understand technical manuals written in English. It was initially applicable to commercial aviation. It then became a requirement for defense projects, including land and sea vehicles. As a consequence, today, many maintenance manuals are written in STE.

A translation management system (TMS), formerly globalization management system (GMS), is a type of software for automating many parts of the human language translation process and maximizing translator efficiency. The idea of a translation management system is to automate all repeatable and non-essential work that can be done by software/systems and leaving only the creative work of translation and review to be done by human beings. A translation management system generally includes at least two types of technology: process management technology to automate the flow of work, and linguistic technology to aid the translator.

Integrated logistic support (ILS) is a technology in the system engineering to lower a product life cycle cost and decrease demand for logistics by the maintenance system optimization to ease the product support. Although originally developed for military purposes, it is also widely used in commercial customer service organisations.

OmegaT Computer assisted translation tool written in Java

OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.

Technical translation is a type of specialized translation involving the translation of documents produced by technical writers, or more specifically, texts which relate to technological subject areas or texts which deal with the practical application of scientific and technological information. While the presence of specialized terminology is a feature of technical texts, specialized terminology alone is not sufficient for classifying a text as "technical" since numerous disciplines and subjects which are not "technical" possess what can be regarded as specialized terminology. Technical translation covers the translation of many kinds of specialized texts and requires a high level of subject knowledge and mastery of the relevant terminology and writing conventions.

Google Translator Toolkit was an online computer-assisted translation tool (CAT) - a web application designed to allow translators to edit the translations that Google Translate automatically generates using its own and/or user-uploaded files of appropriate glossaries and translation memory. With the Google Translator Toolkit, translators could organize their work and use shared translations, glossaries and translation memories. It allowed translators to upload and translate Microsoft Word documents, OpenDocument, RTF, HTML, text, and Wikipedia articles.

Post-editing is the process whereby humans amend machine-generated translation to achieve an acceptable final product. A person who post-edits is called a post-editor. The concept of post-editing is linked to that of pre-editing. In the process of translating a text via machine translation, best results may be gained by pre-editing the source text – for example by applying the principles of controlled language – and then post-editing the machine output. It is distinct from editing, which refers to the process of improving human generated text. Post-edited text may afterwards be revised to ensure the quality of the language choices are proofread to correct simple mistakes.

Corpora in Translation Studies Gradually the translator’s workplace has changed over the last ten years. Personal computers now have the capacity to process information easier and quicker than ever before, and so today's computer could be considered an important or even essential tool in translation. However, problems arise in the use of computers in translation, as the computer is no substitute for traditional tools such as monolingual and bilingual dictionaries, terminologies and encyclopaedias on paper or in digital format and although we can easily access a large amount of information, we need to find the right and reliable information.

The following outline is provided as an overview of and topical guide to natural-language processing:

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

MateCat

MateCat is a web-based computer-assisted translation (CAT) tool. MateCat is released as open source software under the Lesser General Public License (LGPL) from the Free Software Foundation.

Lingotek

Lingotek is a cloud-based translation services provider, offering translation management software and professional linguistic services for web content, software platforms, product documentation and electronic documents.

References

    Sources