Fuzzy matching (computer-assisted translation)

Last updated

Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations. It usually operates at sentence-level segments, but some translation technology allows matching at a phrasal level. It is used when the translator is working with translation memory (TM). It uses approximate string matching.

Contents

Background

When an exact match cannot be found in the TM database for the text being translated, there is an option to search for a match that is less than exact; the translator sets the threshold of the fuzzy match to a percentage value less than 100%, and the database will then return any matches in its memory corresponding to that percentage. Its primary function is to assist the translator by speeding up the translation process; fuzzy matching is not designed to replace the human translator.

History

Because of the polymorphous and dynamic nature of language, particularly English (which accounts for 90% of all source texts undergoing translation in the localisation industry[ citation needed ]), methods are always being sought to make the translation process easier and faster. Since the late 1980s, translation memory tools have been developed to increase productivity and make the whole translation process faster for the translator.

In the 1990s, fuzzy matching began to take off as a prominent feature of TM tools, and despite some issues concerning the extra work involved in editing a fuzzy match "proposal", it is still a popular subset of TM. It is currently a feature of most popular TM tools.

Methodology

The TM tool searches the database to locate segments that are an approximate match for a segment in a new source text to be translated. The TM, in effect, "proposes" the match to the translator; it is then up to the translator to accept this proposal or to edit this proposal to more fully equate with the new source text that is undergoing translation. In this way, fuzzy matching can speed up the translation process and lead to increased productivity.

This raises questions about the quality of the resulting translations. On occasions a translator is under pressure to deliver on time and is thus led to accept a fuzzy match proposal without checking its suitability and context. TM databases are built up by input from numerous different translators working on a variety of different texts, with a danger that sentences extracted from this word "tapestry" will be a stitched-together hodgepodge of styles, and the antithesis of the striven-after consistency – what some critics have dubbed "sentence salad". The question of faith in the TM's proposals can be a problem when trying to strike a balance between a faster translation process and the quality of that translation. Nevertheless, fuzzy matching is still an important part of the translator's tool-kit.

Related Research Articles

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units that have previously been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”. Individual words are handled by terminology bases and are not within the domain of TM.

<span class="mw-page-title-main">Parallel text</span> Text placed alongside its translation or translations

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a human, and certain aspects of the process are facilitated by software; this is in contrast with machine translation (MT), in which the translation is created by a computer, optionally with some human intervention.

A translation management system (TMS), formerly globalization management system (GMS), is a type of software for automating many parts of the human language translation process and maximizing translator efficiency. The idea of a translation management system is to automate all repeatable and non-essential work that can be done by software/systems and leaving only the creative work of translation and review to be done by human beings. A translation management system generally includes at least two types of technology: process management technology to automate the flow of work, and linguistic technology to aid the translator.

<span class="mw-page-title-main">OmegaT</span> Computer assisted translation tool written in Java

OmegaT is a computer-assisted translation tool written in the Java programming language. It is free software originally developed by Keith Godfrey in 2000, and is currently developed by a team led by Aaron Madlon-Kay.

<span class="mw-page-title-main">Wordfast</span>

The name Wordfast is used for any number of translation memory products developed by Wordfast LLC. The original Wordfast product, now called Wordfast Classic, was developed by Yves Champollion in 1999 as a cheaper alternative to Trados, a well-known translation memory program. The current Wordfast products run on a variety of platforms, but use largely compatible translation memory formats, and often also have similar workflows. The software is most popular with freelance translators, although some of the products are also suited for corporate environments.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

<span class="mw-page-title-main">Virtaal</span>

Virtaal is a computer-assisted translation tool written in the Python programming language. It is free software developed and maintained by Translate.org.za.

Trados Studio is a computer-assisted translation software tool which offers a complete, centralised translation environment for editing, reviewing and managing translation projects and terminology- either offline in a desktop tool or online in the cloud. Trados Studio is part of the Trados product portfolio, a suite of intelligent translation products owned by RWS that enables freelance translators, language service providers (LSPs) and corporations to streamline processes and improve efficiencies while keeping costs down.

GlobalSight is a free and open source translation management system (TMS) released under the Apache License 2.0. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards. It was developed in the Java programming language and uses a MySQL database. GlobalSight also supports computer-assisted translation and machine translation.

Open Language Tools is a Java project released by Sun Microsystems under the terms of Sun’s CDDL.

openTMS is an acronym for Open Source Translation Management System.

Google Translator Toolkit was an online computer-assisted translation tool (CAT)—a web application designed to permit translators to edit the translations that Google Translate automatically generates using its own and/or user-uploaded files of appropriate glossaries and translation memory. With the Google Translator Toolkit, translators can organize their work and use shared translations, glossaries and translation memories. It allowed translators to upload and translate Microsoft Word documents, OpenDocument, RTF, HTML, text, and Wikipedia articles.

Caitra is a translation Computer Assisted Tool, or CAT, developed by the University of Edinburgh. Provided from an online platform, Caitra is based on AJAX Web.2 technologies and the Moses decoder. The web page of the tool is implemented with Ruby on Rails, an open source web framework, and C++.

Post-editing is the process whereby humans amend machine-generated translation to achieve an acceptable final product. A person who post-edits is called a post-editor. The concept of post-editing is linked to that of pre-editing. In the process of translating a text via machine translation, best results may be gained by pre-editing the source text – for example by applying the principles of controlled language – and then post-editing the machine output. It is distinct from editing, which refers to the process of improving human generated text. Post-edited text may afterwards be revised to ensure the quality of the language choices are proofread to correct simple mistakes.

Using controlled language in machine translation poses several problems.

The name MetaTexis is used for several software products developed by MetaTexis Software and Services. The main software products are MetaTexis for Word and the MetaTexis Server. MetaTexis for Word is a translation memory software, also called a Computer-assisted translation tool, that runs inside Microsoft Word. The MetaTexis Server is a server software for translation memories (TMs) and terminology databases (TDBs) that allows numerous translators to work with the same TMs and TDBs via LAN or Internet.

memoQ is a proprietary computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt., formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012 and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments.

<span class="mw-page-title-main">MateCat</span>

MateCat is a web-based computer-assisted translation (CAT) tool. MateCat is released as open source software under the Lesser General Public License (LGPL) from the Free Software Foundation.