Richard Sproat

Last updated
Richard William Sproat
Alma mater University of California, San Diego (B.A., 1981)
Massachusetts Institute of Technology (Ph.D., 1985) [1]
Scientific career
Fields Computational linguistics
Institutions Google (2012present)
Thesis On Deriving the Lexicon  (1985)
Doctoral advisor Ken Hale

Richard Sproat is a computational linguist currently working for Google as a researcher on text normalization. [1]

Contents

Linguistics

Sproat graduated from Massachusetts Institute of Technology in 1985, under the supervision of Kenneth L. Hale. [2] His PhD thesis is one of the earliest work that derives morphosyntactically complex forms from the module which produces the phonological form that realizes these morpho-syntactic expressions, one of the core ideas in Distributed Morphology. [3]

One of Sproat's main contributions to computational linguistics is in the field of text normalization, where his work with colleagues in 2001, Normalization of non-standard words, [4] was considered a seminal work in formalizing this component of speech synthesis systems. He has also worked on computational morphology [5] and the computational analysis of writing systems. [6]

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Greek language Indo-European language

Greek is an independent branch of the Indo-European family of languages, native to Greece, Cyprus, Albania, other parts of the Eastern Mediterranean and the Black Sea. It has the longest documented history of any living Indo-European language, spanning at least 3,400 years of written records. Its writing system is the Greek alphabet, which has been used for over 2,600 years; previously, Greek was recorded in writing systems such as Linear B and the Cypriot syllabary. The alphabet arose from the Phoenician script and was in turn the basis of the Latin, Cyrillic, Armenian, Coptic, Gothic, and many other writing systems.

Language Communication using symbols (such as words) structured with grammar

A language is a structured system of communication used by humans, based on speech and gesture, sign, or often writing. The structure of language is its grammar and the free components are its vocabulary. Many languages, including the most widely-spoken ones, have writing systems that enable sounds or signs to be recorded for later reactivation.

The following outline is provided as an overview of and topical guide to linguistics:

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

In linguistics, morphology is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word's pronunciation and meaning. Morphology differs from morphological typology, which is the classification of languages based on their use of words, and lexicology, which is the study of words and how they make up a language's vocabulary.

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Sanskrit Ancient Indo-Aryan language of South Asia

Sanskrit is a classical language of South Asia belonging to the Indo-Aryan branch of the Indo-European languages. It arose in South Asia after its predecessor languages had diffused there from the northwest in the late Bronze Age. Sanskrit is the sacred language of Hinduism, the language of classical Hindu philosophy, and of historical texts of Buddhism and Jainism. It was a link language in ancient and medieval South Asia, and upon transmission of Hindu and Buddhist culture to Southeast Asia, East Asia and Central Asia in the early medieval era, it became a language of religion and high culture, and of the political elites in some of these regions. As a result, Sanskrit had a lasting impact on the languages of South Asia, Southeast Asia and East Asia, especially in their formal and learned vocabularies.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.

In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

In an English-speaking country, Standard English (SE) is the variety of English that has undergone substantial regularisation and is associated with formal schooling, language assessment, and official print publications, such as public service announcements and newspapers of record, etc. It is local to nowhere: its grammatical and lexical components are no longer regionally marked, although many of them originated in different, non-adjacent dialects, and it has very little of the variation found in spoken or earlier written varieties of English. According to Trudgill, Standard English is a dialect pre-eminently used in writing that is largely distinguishable from other English dialects by means of its grammar.

Classical Arabic Form of the Arabic language used in Umayyad and Abbasid literary texts

Classical Arabic or Quranic Arabic is the standardized literary form of the Arabic language used from the 7th century and throughout the Middle Ages, most notably in Umayyad and Abbasid literary texts, such as poetry, elevated prose, and oratory, and is also the liturgical language of Islam.

Indus script Short strings of symbols associated with the Indus Valley Civilization

The Indus script is a corpus of symbols produced by the Indus Valley Civilization. Most inscriptions containing these symbols are extremely short, making it difficult to judge whether or not these symbols constituted a script used to record a language, or even symbolise a writing system. In spite of many attempts, the 'script' has not yet been deciphered, but efforts are ongoing. There is no known bilingual inscription to help decipher the script, and the script shows no significant changes over time. However, some of the syntax varies depending upon location.

Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time.

Linguistics is the scientific study of language. It encompasses the analysis of every aspect of language, as well as the methods for studying and modeling them.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

The following outline is provided as an overview of and topical guide to natural language processing:

References

  1. 1 2 Sproat, Richard. "Richard Sproat" . Retrieved 29 November 2020.
  2. Sproat, Richard. "On Deriving the Lexicon". MITWPL. Retrieved 29 November 2020.
  3. Wiltschko, Martina. The Universal Structure of Categories: Towards a Formal Typology. Cambridge. p. 83. ISBN   9781107038516.
  4. Sproat, Richard; Black, Alan W.; Chen, Stanley; Kumar, Shankar; Ostendorf, Mari; Richards, Christopher (1 July 2001). "Normalization of non-standard words". Computer Speech & Language. 15 (3): 287–333. doi:10.1006/csla.2001.0169.
  5. Sproat, Richard (1992). Morphology and Computation. MIT Press. ISBN   9780262527026.
  6. Sproat, Richard (2000). A Computational theory of Writing Systems. Cambridge. ISBN   9780521663403.