Philipp Koehn | |
---|---|
Born | |
Citizenship | Germany |
Alma mater | Albert Schweitzer High School (Erlangen), University of Erlangen-Nuremberg, University of Tennessee, University of Southern California |
Known for | Europarl corpus, Moses |
Awards | Finalist – 2013 EPO European Inventor Award |
Scientific career | |
Fields | computer science, natural language processing, machine translation, cross-language information retrieval |
Institutions | University of Edinburgh, Johns Hopkins University |
Doctoral advisor | Kevin Knight |
Philipp Koehn (born 1 August 1971 in Erlangen, West Germany) is a computer scientist and researcher in the field of machine translation. [1] [2] His primary research interest is statistical machine translation and he is one of the inventors of a method called phrase based machine translation. This is a sub-field of statistical translation methods that employs sequences of words (or so-called "phrases") as the basis of translation, expanding the previous word based approaches. A 2003 paper which he authored with Franz Josef Och and Daniel Marcu called Statistical phrase-based translation has attracted wide attention in Machine translation community and has been cited over a thousand times. [3] Phrase based methods are widely used in machine translation applications in industry.
Philipp Koehn received his PhD in computer science in 2003 from the University of Southern California, where he worked at the Information Sciences Institute advised by Kevin Knight. After a year as a postdoctoral fellow under Michael Collins at the Massachusetts Institute of Technology, he joined the University of Edinburgh as a lecturer in the School of Informatics in 2005. He was appointed reader in 2010 and professor in 2012. In 2014, he was appointed professor at the computer science department of The Johns Hopkins University, where he is affiliated with the Center for Language and Speech Processing.
The Moses machine translation decoder is an open source project that was created by and is maintained under the guidance of Philipp Koehn. [4] The Moses decoder is a platform for developing Statistical machine translation systems given a parallel corpus for any language pair. [5] The decoder was mainly developed by Hieu Hoang and Philipp Koehn at the University of Edinburgh and extended during a Johns Hopkins University Summer Workshop and further developed under Euromatrix and GALE project funding. The decoder (which is part of a complete statistical machine translation toolkit) is the de facto benchmark for research in the field.
Although Koehn continues to play a major role in the development of Moses, the Moses decoder was supported by the European Framework 6 projects Euromatrix, TC-Star, the European Framework 7 projects EuroMatrixPlus, Let's MT, META-NET and MosesCore and the DARPA GALE project, as well as several universities such as the University of Edinburgh, the University of Maryland, ITC-irst, Massachusetts Institute of Technology, and others. Substantial additional contributors to the Moses decoder include Hieu Hoang, Chris Dyer, Josh Schroeder, Marcello Federico, Richard Zens, and Wade Shen.
The Europarl corpus is a set of documents that consists of the proceedings of the European Parliament from 1996 to the present. The corpus has been compiled and expanded by a group of researchers led by Philipp Koehn at University of Edinburgh. The data that makes up the corpus was extracted from the website of the European Parliament and then prepared for linguistic research. The latest release (2012) comprised up to 60 million words per language, [6] with 21 European languages represented: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavic (Bulgarian, Czech, Polish, Slovak, Slovene), Finno-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
SYSTRAN, founded by Dr. Peter Toma in 1968, is one of the oldest machine translation companies. SYSTRAN has done extensive work for the United States Department of Defense and the European Commission.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Eurotra was a machine translation project established and funded by the European Commission from 1978 until 1992.
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.
The School of Informatics is an academic unit of the University of Edinburgh, in Scotland, responsible for research, teaching, outreach and commercialisation in informatics. It was created in 1998 from the former department of artificial intelligence, the Centre for Cognitive Science and the department of computer science, along with the Artificial Intelligence Applications Institute (AIAI) and the Human Communication Research Centre.
Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach.
Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another.
Moses is a free software, statistical machine translation engine that can be used to train statistical models of text translation from a source language to a target language, developed by the University of Edinburgh. Moses then allows new source-language text to be decoded using these models to produce automatic translations in the target language. Training requires a parallel corpus of passages in the two languages, typically manually translated sentence pairs. Moses is released under the LGPL licence and available both as source code and binaries for Windows and Linux. Its development is primarily supported by the EuroMatrix project, with funding by the European Commission.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
Caitra is a translation Computer Assisted Tool, or CAT, developed by the University of Edinburgh. Provided from an online platform, Caitra is based on AJAX Web.2 technologies and the Moses decoder. The web page of the tool is implemented with Ruby on Rails, an open source web framework, and C++.
Interactive machine translation (IMT), is a specific sub-field of computer-aided translation. Under this translation paradigm, the computer software that assists the human translator attempts to predict the text the user is going to input by taking into account all the information it has available. Whenever such prediction is wrong and the user provides feedback to the system, a new prediction is performed considering the new information available. Such process is repeated until the translation provided matches the user's expectations.
The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.
MateCat is a web-based computer-assisted translation (CAT) tool, released as open-source software under the Lesser General Public License (LGPL).
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
The EuroMatrix is a project that ran from September 2006 to February 2009. The project aimed to develop and improve machine translation (MT) systems between all official languages of the European Union (EU).
The EuroMatrixPlus is a project that ran from March 2009 to February 2012. EuroMatrixPlus succeeded a project called EuroMatrix and continued in further development and improvement of machine translation (MT) systems for languages of the European Union (EU).
Google Neural Machine Translation (GNMT) was a neural machine translation (NMT) system developed by Google and introduced in November 2016 that used an artificial neural network to increase fluency and accuracy in Google Translate. The neural network consisted of two main blocks, an encoder and a decoder, both of LSTM architecture with 8 1024-wide layers each and a simple 1-layer 1024-wide feedforward attention mechanism connecting them. The total number of parameters has been variously described as over 160 million, approximately 210 million, 278 million or 380 million. By 2020, the system had been replaced by another deep learning system based on transformers.
A confusion network is a natural language processing method that combines outputs from multiple automatic speech recognition or machine translation systems. Confusion networks are simple linear directed acyclic graphs with the property that each a path from the start node to the end node goes through all the other nodes. The set of words represented by edges between two nodes is called a confusion set. In machine translation, the defining characteristic of confusion networks is that they allow multiple ambiguous inputs, deferring committal translation decisions until later stages of processing. This approach is used in the open source machine translation software Moses and the proprietary translation API in IBM Bluemix Watson.
Arantza Díaz de Ilarraza Sánchez is a professor of informatics at the University of the Basque Country. In 1981, she began her work as a lecturer at the Faculty of Informatics of Donostia. As a specialist in language and computer technology, she has held positions of responsibility in Basque technology institutions.