MateCat

Last updated
The MateCat Tool editing page Figure 2 The MateCat Tool editing page.jpg
The MateCat Tool editing page

MateCat is a web-based computer-assisted translation (CAT) tool, released as open-source software under the Lesser General Public License (LGPL).

Contents

Overview

MateCat ("Machine Translation Enhanced Computer Assisted Translation") is a 3-year research project (Nov 2011 Oct 2014) funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement No. 287688. [1] It has received over €2,500,000 of European funds. [2]

The project consortium was led by FBK (Fondazione Bruno Kessler), an international research center based in Trento, Italy, and included Translated, an AI-based language solution provider founded by Marco Trombetti and Isabelle Andrieu, Université du Maine, and the University of Edinburgh.

CAT tools

CAT tools provide access to translation memories (TMs), terminology databases, concordance tools and, more recently, to machine translation (MT) engines. The integration of suggestions from an MT engine as a complement to TM matches is motivated by recent studies, [3] [4] [5] which have shown that post-editing MT suggestions improves the level of accuracy in translations.

MateCat facilitates editing machine translation results and manages the localization workflow. It leverages knowledge of field-specific language (for example, legal terminology) to improve translation suggestions, and also uses machine learning to automatically improve suggestions over time. [6] It's designed to function both as a translation workbench product and as a research platform for integrating new MT functions, running post-editing experiments, and measuring user productivity.

Technology

Statistical MT

MateCat runs as a web server that connects with other services via open APIs: the TM service MyMemory, [7] the commercial Google Translate (GT) service, ModernMT, and a list of Moses-based [8] services specified in a configuration file. While MyMemory and GT are always available, Moses servers have to be installed and set-up. Moses allows MateCat to extend the GT API to support self-tuning, user-adaptive, and informative MT functions. The open-source version of MateCat natively supports the XLIFF [9] file format, but converters can be configured to support other formats. The tool supports Unicode (UTF-8) encoding, including non-Latin alphabets and right-to-left languages, and handles texts that embed mark-up tags. It supports concordances, terminology databases, and customizable quality estimation components, and provides an API for the Moses Toolkit that can be customized to languages and domains.

MT support

The tool supports Moses-based servers able to provide an enhanced CAT-MT communication. In particular, the GT API is augmented with feedback information provided to the MT engine every time a segment is post-edited as well as enriched MT output, including confidence scores, word lattices, etc. The developed MT server supports multi-threading to serve multiple translators, handles text segments including tags and adapts from the post-edits performed by each user [10]

Context-aware translation

MateCat also provides suggestions by MT which are consistent with respect not only to the already edited segments but also, in theory, to the whole document. This context information will be embedded in the statistical models and should enable better disambiguation, for instance, between lexical alternatives. The context-based models will combine information about recurring terms and expressions extracted during the document analysis with the corresponding chosen and confirmed translations as soon as they become available. In particular, translation constraints related to inter-sentence and intra-sentence anaphoric expressions, to syntactic concordances, and to lexical coherence will be taken into account by means of specific statistical models.

Real-time processing

The core components of traditional MT systems, that is, the translation and the language models, are generally static: they never change after an initial training phase. This means that they are unsuitable for a dynamic environment like the one that MateCat is designing for translators. In order to model the dynamic changes depicted in the two previous tasks, MateCat developed innovative data-structures that can be rapidly and effectively updated as soon as a new translation is supplied by the user, and innovative, efficient algorithms for performing this adaptation in such a way that the whole process takes place in real time and is transparent to the translator. Moreover, efficiency will be improved by taking advantage of single CPU multithreading, as well as distributed computing facilities running on private clusters or computer clouds.

Edit log

Figure 1 - The MateCat Tool edit log page. Figure 1 The MateCat Tool edit log page.png
Figure 1 - The MateCat Tool edit log page.

During post-editing the tool collects timing information for each segment, which is updated every time the segment is opened and closed. Moreover, for each segment, information is collected about the generated suggestions and the one that has actually been post-edited. This information is accessible at any time through a link in the Editing Page, named Editing Log. The Editing Log page (Figure 1) shows a summary of the overall editing performed so far on the project, such as the average translation speed and post-editing effort and the percentage of top suggestions coming from MT or the TM. Moreover, for each segment, sorted from the slowest to the fastest in terms of translation speed, detailed statistics about the performed edit operations are reported. This information, with even more details, can be also downloaded as a CSV file to perform a more detailed post-editing analysis. While the information shown in the Edit Log page is very useful to monitor progress of a translation project in real time, the CSV file is a fundamental source of information for detailed productivity analyses once the project is ended.

Applications

MateCat has been used by the MateCat project to investigate new MT functions [11] and to evaluate them in a real professional setting, in which translators have at their disposal all the sources of information they are used to working with. Moreover, taking advantage of its flexibility and ease of use, the tool has been recently used for data collection and education purposes (a course on CAT technology for students in translation studies). An initial version of the tool has also been leveraged by the CasmaCat project [12] to create a workbench, [13] particularly suitable for investigating advanced interaction modalities such as interactive MT, eye tracking, and handwritten input. Currently the tool is employed by the translation agency Translated for their internal translation projects and is being tested by several international companies, both language service providers and IT companies. This has made it possible to collect continuous feedback from hundreds of translators, which, besides helping us to improve the robustness of the tool, is also influencing the way new MT functions will be integrated to supply the best help to the final user.


Related Research Articles

<span class="mw-page-title-main">Machine translation</span> Use of software for language translation

Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

<span class="mw-page-title-main">Parallel text</span> Text placed alongside its translation or translations

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

<span class="mw-page-title-main">OmegaT</span> Computer assisted translation tool written in Java

OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.

Language localisation is the process of adapting a product's translation to a specific country or region. It is the second phase of a larger process of product translation and cultural adaptation to account for differences in distinct markets, a process known as internationalisation and localisation.

<span class="mw-page-title-main">Wordfast</span>

The name Wordfast is used for any number of translation memory products developed by Wordfast LLC. The original Wordfast product, now called Wordfast Classic, was developed by Yves Champollion in 1999 as a cheaper alternative to Trados, a translation memory program. The current Wordfast products run on a variety of platforms but use largely compatible translation memory formats, and often also have similar workflows. The software is most popular with freelance translators, although some of the products are also suited for corporate environments.

<span class="mw-page-title-main">Virtaal</span>

Virtaal is a computer-assisted translation tool written in the Python programming language. It is free software developed and maintained by Translate.org.za.

Trados Studio is a computer-assisted translation software tool which offers a complete, centralized translation environment for editing, reviewing and managing translation projects and terminology. It is available both as a local desktop tool or online. Trados, owned by RWS, also provides a suite of intelligent machine translation products.

GlobalSight is a free and open source translation management system (TMS) released under the Apache License 2.0. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards. It was developed in the Java programming language and uses a MySQL database. GlobalSight also supports computer-assisted translation and machine translation.

Open Language Tools is a Java project released by Sun Microsystems under the terms of Sun's CDDL.

<span class="mw-page-title-main">Microsoft Translator</span> Machine translation cloud service by Microsoft

Microsoft Translator is a multilingual machine translation cloud service provided by Microsoft. Microsoft Translator is a part of Microsoft Cognitive Services and integrated across multiple consumer, developer, and enterprise products, including Bing, Microsoft Office, SharePoint, Microsoft Edge, Microsoft Lync, Yammer, Skype Translator, Visual Studio, and Microsoft Translator apps for Windows, Windows Phone, iPhone and Apple Watch, and Android phone and Android Wear.

openTMS is an acronym for Open Source Translation Management System.

Google Translator Toolkit was an online computer-assisted translation tool (CAT)—a web application designed to permit translators to edit the translations that Google Translate automatically generated using its own and/or user-uploaded files of appropriate glossaries and translation memory. The toolkit was designed to let translators organize their work and use shared translations, glossaries and translation memories, and was compatible with Microsoft Word, HTML, and other formats.

Caitra is a translation Computer Assisted Tool, or CAT, developed by the University of Edinburgh. Provided from an online platform, Caitra is based on AJAX Web.2 technologies and the Moses decoder. The web page of the tool is implemented with Ruby on Rails, an open source web framework, and C++.

Post-editing is the process whereby humans amend machine-generated translation to achieve an acceptable final product. A person who post-edits is called a post-editor. The concept of post-editing is linked to that of pre-editing. In the process of translating a text via machine translation, best results may be gained by pre-editing the source text – for example by applying the principles of controlled language – and then post-editing the machine output. It is distinct from editing, which refers to the process of improving human generated text. Post-edited text may afterwards be revised to ensure the quality of the language choices are proofread to correct simple mistakes.

Interactive machine translation (IMT), is a specific sub-field of computer-aided translation. Under this translation paradigm, the computer software that assists the human translator attempts to predict the text the user is going to input by taking into account all the information it has available. Whenever such prediction is wrong and the user provides feedback to the system, a new prediction is performed considering the new information available. Such process is repeated until the translation provided matches the user's expectations.

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012, and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

The EuroMatrixPlus is a project that ran from March 2009 to February 2012. EuroMatrixPlus succeeded a project called EuroMatrix and continued in further development and improvement of machine translation (MT) systems for languages of the European Union (EU).

<span class="mw-page-title-main">Marco Trombetti</span>

Marco Trombetti is an Italian computer scientist, entrepreneur, investor, and ocean sailor. He co-founded Translated, a pioneer of artificial intelligence in the language industry. Through Translated, he helped develop the first AI-powered open-source Computer Assisted Translation (CAT) tool, Matecat, which also introduced the first adaptive machine translation system. He is considered one of the most influential leaders and innovators in the language industry. His research on progress toward the language singularity, presented during a keynote at the Association for Machine Translation in the Americas (AMTA) conference in 2022, has provided invaluable insights into the field of artificial intelligence.

References

  1. José, M., & Machado, B. (2014). Free and open-source software — a translator’s good friend, 3. Retrieved from http://ec.europa.eu/translation/portuguese/magazine
  2. EUROPEAN COMMISSION. (2017). EUROPEAN COMMISSION STAFF WORKING DOCUMENT INTERIM EVALUATION of HORIZON 2020 ANNEX 2. Brussels. Retrieved from http://ec.europa.eu/transparency/regdoc/rep/10102/2017/EN/SWD-2017-221-F1-EN-MAIN-PART-12.PDF
  3. Marcello Federico; Alessandro Cattelan; Marco Trombetti (2012). "Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (AMTA)" (PDF). Amta2012.amtaweb.org. Archived from the original (PDF) on 30 October 2014. Retrieved 30 October 2014.
  4. Spence Green; Jeffrey Heer; Christopher D Manning (2013). The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Chi '13. Dl.acm.org. pp. 439–448. doi:10.1145/2470654.2470718. ISBN   9781450318990. S2CID   119828 . Retrieved 30 October 2014.
  5. Samuel Läubli; Mark Fishel; Gary Massey; Maureen Ehrensberger-Dow; Martin Volk (2013). "Assessing Post-Editing Efficiency in a Realistic Translation Environment. In Michel Simard Sharon O'Brien and Lucia Specia (eds.), editors, Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice" (PDF). Nice, France: Mt-archive.info. pp. 83–91. Retrieved 30 October 2014.
  6. "MateCat".
  7. "MyMemory is the world's largest Translation Memory (TM) built collaboratively via MT and human contributions". Mymemory.translated.net. Retrieved 30 October 2014.
  8. "Moses is the most popular open source statistical MT toolkit". Statmt.org. Retrieved 30 October 2014.
  9. "Docs.oasis-open.org". Docs.oasis-open.org. Retrieved 30 October 2014.
  10. Nicola Bertoldi, Mauro Cettolo, and Marcello Federico. 2013. Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation. In Proceedings of the MT Summit XIV, pages 35–42, Nice, France, September.
  11. Bertoldi et al., 2013; Cettolo et al., 2013; Turchi et al., 2013; Turchi et al., 2014
  12. "Casmacat.eu". Casmacat.eu. Retrieved 30 October 2014.
  13. Vicent Alabau, Ragnar Bonk, Christian Buck, Michael Carl, Francisco Casacuberta, Mercedes Garca-Martiınez, Jesus Gonzalez, Philipp Koehn, Luis Leiva, Bartolomé Mesa-Lao, Daniel Oriz, Hervé Saint-Amand, German Sanchis, and Chara Tsiukala. 2013. Advanced computer-aided translation with a web-based workbench. In Proceedings of Workshop on Post-editing Technology and Practice, pages 55–62.