Caitra

Last updated January 31, 2020 • 4 min readFrom Wikipedia, The Free Encyclopedia

Caitra is a translation Computer Assisted Tool, or CAT, developed by the University of Edinburgh. Provided from an online platform, Caitra is based on AJAX Web.2 technologies and the Moses decoder. The web page of the tool is implemented with Ruby on Rails, an open source web framework, and C++.

History

Machine Translation (MT) systems are typically used by readers who do not need a thorough translation and want quick access to the foreign language. Professional translators usually require advanced machine translation tools to make their work easier and to give a higher quality translation to their clients. The Trans-Type project (Langlais et al., 2000) gave a pioneer approach to the MT as an aid to human translators. This translation tool would suggest different translations for a segment while providing the translator an opportunity to accept the suggested translation or overwrite it with their own translation, which in turn would trigger new potential translations to the tool. This is, however, not necessarily suitable for professional translators. Tools with post-edition facilities have also been developed as an intermediate field between typical MT and human translators in order to integrate MT and human translation and to achieve the desired results. The School of Informatics and the Machine Translation Group of the University of Edinburgh, created a research program, CAITRA, to analyze the benefits of different types of MTs and to explore the interaction between the machine and the user in order to develop new CAT tools.

Properties

Caitra is programmed with an open-source web framework, Ruby on Rails (Thomasand Hansson, 2008). The online platform uses Ajax-style Web 2.0 technologies (Raymond, 2007) connected to a MySQL database-driven back-end. The machine translation back-end is powered by the statistical sentence-based MT, Moses (Koehn et al., 2007). C++ is integrated to improve the speed of the process of translation suggestions. The tool is provided online by the School of Informatics as a study of the user’s interaction with the tool, as well as the ability for members suggest additional features and fixes to the program.

The user inputs text into the provided text box. Caitra processes the text as the user clicks the "Upload" icon. The process may last a few minutes, and Caitra will find different options for the translation, one of them is taken by default. Once the process is finished, translators have multiple options of assistance, presented in an interface. The segment for translation is the sentence and so Caitra works with only one sentence at the same time.

Interactive Machine translation

The Trans-Type project (Langlais et al., 2000) has done an investigation about Interactive Machine Translation, consisting of sentence-segment translation aided by a CAT tool, which suggests several different options for the translation. The human translators may choose one of them or provide their own translation if they do not like the offered translations. This process is similar to the auto-completion tool used in several office programs.

The statistical translation system is followed to generate the predictions for translation. These predictions are provided in short phrases, according to the statistical phrase-based translation model. This model also makes it easier for the user to read the predictions. The suggestions and user actions are stored in a large database. During the user interaction, Caitra quickly matches user input against a graph using a string edit distance measure. The prediction is the optimal completion path that matches the user input with (a) minimal string edit distance and (b) highest sentence translation probability. This computation takes place at the server and is implemented in C++, as Philipp Koehn explains.^[1] Once the user accepts a suggestion, a new one is displayed as well the typing of a new segment. The acceptance of suggestions depends on the pair of languages and the complexity of the text. Preliminary studies about CAITRA suggest that users usually accept 50-80% of predictions generated by the system.

Translation process

Once the text is uploaded, users can see the result of the machine translation and edit the text based on the predictions. The prediction table is displayed by clicking the edit icon. The text is divided into sentences, which are also divided into smaller units. Predictions for these units appear in a box, and the most likely suggestion has a different colour in the highest part of the table. Predictions are accepted by clicking on them and the system updates the election to the user input. The database is made of amounts of pairs of translated texts and translations. The most likely prediction is the result of previous matches in the database. The user's choices are scored in the database to be used in future translations. These predictions help not only professional translators, but also novice translators who do not know the vocabulary and people without knowledge of the foreign language.

Post-editing Machine Translation process

Users can review their translation and make any change to correct possible mistakes. The changes appear in the output display.

User activity

Caitra stores the allotted time in which the users accept a prediction or write their own translation. The actions have different importance for the future predictions depending on the user's actions and in the time they need to perform their translation. Every action, pause or movement is relevant in order to improve future translations.

Related Research Articles

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. The most famous example is the Rosetta Stone.

Computer-assisted translation or computer-aided translation (CAT) is a form of language translation in which a human translator uses computer hardware to support and facilitate the translation process.

Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In graphical user interfaces, users can typically press the tab key to accept a suggestion or the down arrow key to accept one of several.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations. It usually operates at sentence-level segments, but some translation technology allows matching at a phrasal level. It is used when the translator is working with translation memory (TM). It uses approximate string matching.

Open Language Tools is a Java project released by Sun Microsystems under the terms of Sun’s CDDL.

Microsoft Translator is a multilingual machine translation cloud service provided by Microsoft. Microsoft Translator is integrated across multiple consumer, developer, and enterprise products; including Bing, Microsoft Office, SharePoint, Microsoft Edge, Microsoft Lync, Yammer, Skype Translator, Visual Studio, Internet Explorer, and Microsoft Translator apps for Windows, Windows Phone, iPhone and Apple Watch, and Android phone and Android Wear.

Google Translator Toolkit was an online computer-assisted translation tool (CAT) - a web application designed to allow translators to edit the translations that Google Translate automatically generates using its own and/or user-uploaded files of appropriate glossaries and translation memory. With the Google Translator Toolkit, translators could organize their work and use shared translations, glossaries and translation memories. It allowed translators to upload and translate Microsoft Word documents, OpenDocument, RTF, HTML, text, and Wikipedia articles.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

Interactive machine translation (IMT), is a specific sub-field of computer-aided translation. Under this translation paradigm, the computer software that assists the human translator attempts to predict the text the user is going to input by taking into account all the information it has available. Whenever such prediction is wrong and the user provides feedback to the system, a new prediction is performed considering the new information available. Such process is repeated until the translation provided matches the user's expectations.

Philipp Koehn is a computer scientist and researcher in the field of machine translation. His primary research interest is statistical machine translation and he is one of the inventors of a method called phrase based machine translation which is a sub-field of statistical translation methods that employs sequences of words as the basis of translation, expanding the previous word based approaches. A 2003 paper which he authored with Franz Josef Och and Daniel Marcu called Statistical phrase-based translation has attracted wide attention in Machine translation community and has been cited over a thousand times. Phrase based methods are widely used in machine translation applications in industry. An example of such systems are Google Translate and Omniscien Technologies.

The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to the present. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.

The following outline is provided as an overview of and topical guide to natural language processing:

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

MateCat is a web-based computer-assisted translation (CAT) tool, of which there are several on the current market. MateCat is released as open source software under the Lesser General Public License (LGPL) from the Free Software Foundation.

Yandex.Translate Web service company Yandex, intended for translation of text or web pages into another language.

Yandex.Translate is a web service provided by Yandex, intended for the translation of text or web pages into another language.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

The EuroMatrixPlus is a project that ran from March 2009 to February 2012. EuroMatrixPlus succeeded a project called EuroMatrix and continued in further development and improvement of machine translation (MT) systems for languages of the European Union (EU).

References

↑ Koehn, Phillip. "A Web-Based Interactive Computer Aided Translation Tool" (PDF). School of Informatics, University of Edinburgh.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. (2007) "Moses: Open Source Toolkit for Statistical Machine Translation". Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.
Olivia Craciunescu, "Machine Translation and Computer-Assisted Translation:a New Way of Translating?"