![]() |
Stable release | 5-prealpha / August 3, 2015 |
---|---|
Written in | Java |
Operating system | Cross-platform |
Type | Image library |
License | BSD-style [1] |
Website | cmusphinx |
Stable release | 5-prealpha / August 5, 2015 |
---|---|
Written in | C |
Operating system | Cross-platform |
Type | Image library |
License | BSD-style |
Website | cmusphinx |
CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).
In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech decoders come with acoustic models and sample applications. The available resources include in addition software for acoustic model training, language model compilation and a public domain pronunciation dictionary, cmudict.
Sphinx encompasses a number of software systems, described below.
Sphinx is a continuous-speech, speaker-independent recognition system making use of hidden Markov acoustic models (HMMs) and an n-gram statistical language model. It was developed by Kai-Fu Lee. Sphinx featured feasibility of continuous-speech, speaker-independent large-vocabulary recognition, the possibility of which was in dispute at the time (1986). [2]
Sphinx is of historical interest only; it has been superseded in performance by subsequent versions.
A fast performance-oriented recognizer, originally developed by Xuedong Huang at Carnegie Mellon and released as open-source with a BSD-style license on SourceForge by Kevin Lenzo at LinuxWorld in 2000. Sphinx 2 focuses on real-time recognition suitable for spoken language applications. As such it incorporates functionality such as end-pointing, partial hypothesis generation, dynamic language model switching and so on. It is used in dialog systems and language learning systems. It can be used in computer based PBX systems such as Asterisk. Sphinx 2 code has also been incorporated into a number of commercial products. It is no longer under active development (other than for routine maintenance). Current real-time decoder development is taking place in the Pocket Sphinx project. [3]
Sphinx 2 used a semi-continuous representation for acoustic modeling (i.e., a single set of Gaussians is used for all models, with individual models represented as a weight vector over these Gaussians). Sphinx 3 adopted the prevalent continuous HMM representation and has been used primarily for high-accuracy, non-real-time recognition. Recent developments (in algorithms and in hardware) have made Sphinx 3 "near" real-time, although not yet suitable for critical interactive applications. Sphinx 3 is under active development and in conjunction with SphinxTrain provides access to a number of modern modeling techniques, such as LDA/MLLT, MLLR and VTLN, that improve recognition accuracy (see the article on Speech Recognition for descriptions of these techniques).
Sphinx 4 is a complete rewrite of the Sphinx engine with the goal of providing a more flexible framework for research in speech recognition, written entirely in the Java programming language. Sun Microsystems supported the development of Sphinx 4 and contributed software engineering expertise to the project. Participants included individuals at MERL, MIT and CMU. (Currently supported languages are C, C++, C#, Python, Ruby, Java, and JavaScript.)
Current development goals include:
A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation.
Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Dabbala Rajagopal "Raj" Reddy is an Indian-American computer scientist and a winner of the Turing Award. He is one of the early pioneers of artificial intelligence and has served on the faculty of Stanford and Carnegie Mellon for over 50 years. He was the founding director of the Robotics Institute at Carnegie Mellon University. He was instrumental in helping to create Rajiv Gandhi University of Knowledge Technologies in India, to cater to the educational needs of the low-income, gifted, rural youth. He was the founding chairman of International Institute of Information Technology, Hyderabad. He was the first person of Asian origin to receive the Turing Award, in 1994, sometimes known as the Nobel Prize of computer science, for his work in the field of artificial intelligence.
The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites. It is distributed under a free software license similar to the BSD License.
Julius is a speech recognition engine, specifically a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time computing (RTC) decoding on most current personal computers (PCs) in 60k word dictation task using word trigram (3-gram) and context-dependent Hidden Markov model (HMM). Major search methods are fully incorporated.
As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.
Project LISTEN was a 25-year research project at Carnegie Mellon University to improve children's reading skills. Project LISTEN. The project created a computer-based Reading Tutor that listens to a child reading aloud, corrects errors, helps when the child is stuck or encounters a hard word, provides hints, assesses progress, and presents more advanced text when the child is ready. The Reading Tutor has been used daily by hundreds of children in field tests at schools in the United States, Canada, Ghana, and India. Thousands of hours of usage logged at multiple levels of detail, including millions of words read aloud, have been stored in a database that has been mined to improve the Tutor's interactions with students. An extensive list of publications can be found at Carnegie Mellon University.
The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.
Edmund Melson Clarke, Jr. was an American computer scientist and academic noted for developing model checking, a method for formally verifying hardware and software designs. He was the FORE Systems Professor of Computer Science at Carnegie Mellon University. Clarke, along with E. Allen Emerson and Joseph Sifakis, received the 2007 ACM Turing Award.
Xuedong David Huang is a Chinese American computer scientist and technology executive who has made contributions to spoken language processing and artificial intelligence, including Azure AI Services. He is Zoom's chief technology officer after serving as Microsoft's Technical Fellow and Azure AI Chief Technology Officer for 30 years. Huang is a strong advocate of AI for Accessibility, and AI for Cultural Heritage.
LumenVox is a privately held speech recognition software company based in San Diego, California. LumenVox has been described as one of the market leaders in the speech recognition software industry.
RWTH ASR is a proprietary speech recognition toolkit.
Janus Recognition Toolkit (JRTk), sometimes referred to as Janus, is a general purpose speech recognition toolkit developed and maintained by the Interactive Systems Laboratories at Carnegie Mellon University and Karlsruhe Institute of Technology. It is useful for both research and application development and is part of the JANUS speech-to-speech translation system.
The following outline is provided as an overview of and topical guide to natural-language processing:
In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).
Roni Rosenfeld is an Israeli-American computer scientist and computational epidemiologist, currently serving as the head of the Machine Learning Department at Carnegie Mellon University. He is an international expert in machine learning, infectious disease forecasting, statistical language modeling and artificial intelligence.
Mei-Yuh Hwang is a Taiwanese speech recognition researcher who works for Mobvoi in Redmond, Washington, and holds a position as affiliate professor of electrical and computer engineering at the University of Washington.
Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.