Verbmobil was a long-term interdisciplinary Language Technology (esp. Machine Translation) research project with the aim of developing a system that could recognize, translate and produce natural utterances and thus "translate spontaneous speech robustly and bidirectionally for German/English and German/Japanese". [1]
Verbmobil research was carried out between 1993 and 2000 and received a total of 116 million German marks (roughly 60 million euros) in funding from Germany's Federal Ministry of Research and Technology, the Bundesministerium für Forschung und Technologie; industry partners (such as DaimlerChrysler, Siemens and Philips) contributed an additional 52 million DM (26 million euros).
In the Verbmobil II project, the University of Tübingen created semi-automatically annotated treebanks for German, Japanese and English spontaneous speech. TüBa-D/S [2] contains approximately 38,000 sentences or 360,000 words. TüBa-E/S [3] contains approximately 30,000 sentences or 310,000 words. TüBa-J/S [4] contains approximately 18,000 sentences or 160,000 words.
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.
Machine translation is use of either rule-based or probabilistic machine learning approaches to translation of text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas link grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.
The Technische Universität Darmstadt, commonly known as TU Darmstadt, is a research university in the city of Darmstadt, Germany. It was founded in 1877 and received the right to award doctorates in 1899. In 1882, it was the first university in the world to set up a chair in electrical engineering. In 1883, the university founded the first faculty of electrical engineering and introduced the world's first degree course in electrical engineering. In 2004, it became the first German university to be declared as an autonomous university. TU Darmstadt has assumed a pioneering role in Germany. Computer science, electrical engineering, artificial intelligence, mechatronics, business informatics, political science and many more courses were introduced as scientific disciplines in Germany by Darmstadt faculty.
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of 2022, Google Translate supports 133 languages at various levels; it claimed over 500 million total users as of April 2016, with more than 100 billion words translated daily, after the company stated in May 2013 that it served over 200 million people daily.
In the history of artificial intelligence, an AI winter is a period of reduced funding and interest in artificial intelligence research. The field has experienced several hype cycles, followed by disappointment and criticism, followed by funding cuts, followed by renewed interest years or even decades later.
Machine translation is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another.
The German Research Center for Artificial Intelligence is one of the world's largest nonprofit contract research institutes for software technology based on artificial intelligence (AI) methods. DFKI was founded in 1988, and has facilities in the German cities of Kaiserslautern, Saarbrücken, Lübeck, Oldenburg, Osnabrück, Bremen, Darmstadt and Berlin.
Jaime Guillermo Carbonell was a computer scientist who made seminal contributions to the development of natural language processing tools and technologies. His extensive research in machine translation resulted in the development of several state-of-the-art language translation and artificial intelligence systems. He earned his B.S. degrees in Physics and in Mathematics from MIT in 1975 and did his Ph.D. under Dr. Roger Schank at Yale University in 1979. He joined Carnegie Mellon University as an assistant professor of computer science in 1979 and lived in Pittsburgh from then. He was affiliated with the Language Technologies Institute, Computer Science Department, Machine Learning Department, and Computational Biology Department at Carnegie Mellon.
Michael Kohlhase is a German computer scientist and professor at University of Erlangen–Nuremberg, where he is head of the KWARC research group.
Hans Uszkoreit is a German computational linguist.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.
Sebastian Möller is an expert for quality of experience and speech technology.
The following outline is provided as an overview of and topical guide to natural-language processing:
The EuroMatrixPlus is a project that ran from March 2009 to February 2012. EuroMatrixPlus succeeded a project called EuroMatrix and continued in further development and improvement of machine translation (MT) systems for languages of the European Union (EU).
Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016 that uses an artificial neural network to increase fluency and accuracy in Google Translate. The neural network consists of two main blocks, an encoder and a decoder, both of LSTM architecture with 8 1024-wide layers each and a simple 1-layer 1024-wide feedforward attention mechanism connecting them. The total number of parameters has been variously described as over 160 million, approximately 210 million, 278 million or 380 million.