Richard Sproat

Last updated
Richard William Sproat
Alma mater University of California, San Diego (B.A., 1981)
Massachusetts Institute of Technology (Ph.D., 1985) [1]
Scientific career
Fields Computational linguistics
Institutions Google (2012present)
Thesis On Deriving the Lexicon  (1985)
Doctoral advisor Ken Hale

Richard Sproat is a computational linguist currently working for Sakana AI as a research scientist. [1] Prior to joining Sakana AI, Sproat worked for Google between 2012 and 2024 [2] on text normalization [3] and speech recognition. [1]

Contents

Linguistics

Sproat graduated from Massachusetts Institute of Technology in 1985, under the supervision of Kenneth L. Hale. [4] His PhD thesis is one of the earliest work that derives morphosyntactically complex forms from the module which produces the phonological form that realizes these morpho-syntactic expressions, one of the core ideas in Distributed Morphology. [5]

One of Sproat's main contributions to computational linguistics is in the field of text normalization, where his work with colleagues in 2001, Normalization of non-standard words, [6] was considered a seminal work in formalizing this component of speech synthesis systems. He has also worked on computational morphology [7] and the computational analysis of writing systems. [8]

Related Research Articles

<span class="mw-page-title-main">Machine translation</span> Computerized translation between natural languages

Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

<span class="mw-page-title-main">Grammar checker</span> Computer program that verifies written text for grammatical correctness

A grammar checker, in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a word processor, but are also available as a stand-alone application that can be activated from within programs that work with editable text.

<span class="mw-page-title-main">Indus script</span> Symbols of the Indus Valley Civilisation

The Indus script, also known as the Harappan script and the Indus Valley script, is a corpus of symbols produced by the Indus Valley Civilisation. Most inscriptions containing these symbols are extremely short, making it difficult to judge whether or not they constituted a writing system used to record a Harappan language, any of which are yet to be identified. Despite many attempts, the "script" has not yet been deciphered. There is no known bilingual inscription to help decipher the script, which shows no significant changes over time. However, some of the syntax varies depending upon location.

Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.

Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation, that superseded the previous rule-based approach that required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural machine translation.

The earliest historical linguistic evidence of the spoken Chinese language dates back approximately 4500 years, while examples of the writing system that would become written Chinese are attested in a body of inscriptions made on bronze vessels and oracle bones during the Late Shang period, with the very oldest dated to c. 1200 BCE.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

<span class="mw-page-title-main">Virtual assistant</span> Software agent

A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to simulate human conversation, such as via online chat, to facilitate interaction with their users. The interaction may be via text, graphical interface, or voice - as some virtual assistants are able to interpret human speech and respond via synthesized voices.

Linguistics is the scientific study of language. The areas of linguistic analysis are syntax, semantics (meaning), morphology, phonetics, phonology, and pragmatics. Subdisciplines such as biolinguistics and psycholinguistics bridge many of these divisions.

A writing system comprises a set of symbols, called a script, as well as the rules by which the script represents a particular language. The earliest writing was invented during the late 4th millennium BC. Throughout history, each writing system invented without prior knowledge of writing gradually evolved from a system of proto-writing that included a small number of ideographs, which were not fully capable of encoding spoken language, and lacked the ability to express a broad range of ideas.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

The following outline is provided as an overview of and topical guide to natural-language processing:

In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

References

  1. 1 2 3 Richard Sproat
  2. Curriculum vitae
  3. Sodimana, Keshan; Silva, Pasindu De; Sproat, Richard; Theeraphol, A.; Li, Chen Fang; Gutkin, Alexander; Sarin, Supheakmungkol; Pipatsrisawat, Knot (2018). "Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems" (PDF). 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018). pp. 147–151. doi:10.21437/SLTU.2018-31. S2CID   53333966.
  4. Sproat, Richard. "On Deriving the Lexicon". MITWPL. Retrieved 29 November 2020.
  5. Wiltschko, Martina (24 July 2014). The Universal Structure of Categories: Towards a Formal Typology. Cambridge. p. 83. ISBN   9781107038516.
  6. Sproat, Richard; Black, Alan W.; Chen, Stanley; Kumar, Shankar; Ostendorf, Mari; Richards, Christopher (1 July 2001). "Normalization of non-standard words". Computer Speech & Language. 15 (3): 287–333. doi:10.1006/csla.2001.0169.
  7. Sproat, Richard (1992). Morphology and Computation. MIT Press. ISBN   9780262527026.
  8. Sproat, Richard (2000). A Computational theory of Writing Systems. Cambridge. ISBN   9780521663403.