CMU Sphinx

Last updated
Sphinx4
Stable release
5-prealpha / August 3, 2015;8 years ago (2015-08-03)
Written in Java
Operating system Cross-platform
Type Image library
License BSD-style [1]
Website cmusphinx.github.io/wiki/
Pocketsphinx
Stable release
5-prealpha / August 5, 2015;8 years ago (2015-08-05)
Written in C
Operating system Cross-platform
Type Image library
License BSD-style
Website cmusphinx.github.io/wiki/

CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).

Contents

In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech decoders come with acoustic models and sample applications. The available resources include in addition software for acoustic model training, language model compilation and a public domain pronunciation dictionary, cmudict.

Sphinx encompasses a number of software systems, described below.

Sphinx

Sphinx is a continuous-speech, speaker-independent recognition system making use of hidden Markov acoustic models (HMMs) and an n-gram statistical language model. It was developed by Kai-Fu Lee. Sphinx featured feasibility of continuous-speech, speaker-independent large-vocabulary recognition, the possibility of which was in dispute at the time (1986). Sphinx is of historical interest only; it has been superseded in performance by subsequent versions. An archival article [2] describes the system in detail.

Sphinx 2

A fast performance-oriented recognizer, originally developed by Xuedong Huang at Carnegie Mellon and released as open-source with a BSD-style license on SourceForge by Kevin Lenzo at LinuxWorld in 2000. Sphinx 2 focuses on real-time recognition suitable for spoken language applications. As such it incorporates functionality such as end-pointing, partial hypothesis generation, dynamic language model switching and so on. It is used in dialog systems and language learning systems. It can be used in computer based PBX systems such as Asterisk. Sphinx 2 code has also been incorporated into a number of commercial products. It is no longer under active development (other than for routine maintenance). Current real-time decoder development is taking place in the Pocket Sphinx project. An archival article [3] describes the system.

Sphinx 3

Sphinx 2 used a semi-continuous representation for acoustic modeling (i.e., a single set of Gaussians is used for all models, with individual models represented as a weight vector over these Gaussians). Sphinx 3 adopted the prevalent continuous HMM representation and has been used primarily for high-accuracy, non-real-time recognition. Recent developments (in algorithms and in hardware) have made Sphinx 3 "near" real-time, although not yet suitable for critical interactive applications. Sphinx 3 is under active development and in conjunction with SphinxTrain provides access to a number of modern modeling techniques, such as LDA/MLLT, MLLR and VTLN, that improve recognition accuracy (see the article on Speech Recognition for descriptions of these techniques).

Sphinx 4

Sphinx 4 is a complete rewrite of the Sphinx engine with the goal of providing a more flexible framework for research in speech recognition, written entirely in the Java programming language. Sun Microsystems supported the development of Sphinx 4 and contributed software engineering expertise to the project. Participants included individuals at MERL, MIT and CMU. (Currently supported languages are C, C++, C#, Python, Ruby, Java, and JavaScript.)

Current development goals include:

PocketSphinx

A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation.

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

In computing, cross-platform software is computer software that is designed to work in several computing platforms. Some cross-platform software requires a separate build for each platform, but some can be directly run on any platform without special preparation, being written in an interpreted language or compiled to portable bytecode for which the interpreters or run-time packages are common or standard components of all supported platforms.

<span class="mw-page-title-main">Raj Reddy</span> Indian-American computer scientist (born 1937)

Dabbala Rajagopal "Raj" Reddy is an Indian-American computer scientist and a winner of the Turing Award. He is one of the early pioneers of artificial intelligence and has served on the faculty of Stanford and Carnegie Mellon for over 50 years. He was the founding director of the Robotics Institute at Carnegie Mellon University. He was instrumental in helping to create Rajiv Gandhi University of Knowledge Technologies in India, to cater to the educational needs of the low-income, gifted, rural youth. He was the founding chairman of International Institute of Information Technology, Hyderabad. He is the first person of Asian origin to receive the Turing Award, in 1994, known as the Nobel Prize of Computer Science, for his work in the field of artificial intelligence.

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites. It is distributed under a free software license similar to the BSD License.

GRIB is a concise data format commonly used in meteorology to store historical and forecast weather data. It is standardized by the World Meteorological Organization's Commission for Basic Systems, known under number GRIB FM 92-IX, described in WMO Manual on Codes No.306. Currently there are three versions of GRIB. Version 0 was used to a limited extent by projects such as TOGA, and is no longer in operational use. The first edition is used operationally worldwide by most meteorological centers, for Numerical Weather Prediction output (NWP). A newer generation has been introduced, known as GRIB second edition, and data is slowly changing over to this format. Some of the second-generation GRIB is used for derived products distributed in the Eumetcast of Meteosat Second Generation. Another example is the NAM model.

Julius is a speech recognition engine, specifically a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time computing (RTC) decoding on most current personal computers (PCs) in 60k word dictation task using word trigram (3-gram) and context-dependent Hidden Markov model (HMM). Major search methods are fully incorporated.

VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.

As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.

Project LISTEN was a 25-year research project at Carnegie Mellon University to improve children's reading skills. Project LISTEN. The project created a computer-based Reading Tutor that listens to a child reading aloud, corrects errors, helps when the child is stuck or encounters a hard word, provides hints, assesses progress, and presents more advanced text when the child is ready. The Reading Tutor has been used daily by hundreds of children in field tests at schools in the United States, Canada, Ghana, and India. Thousands of hours of usage logged at multiple levels of detail, including millions of words read aloud, have been stored in a database that has been mined to improve the Tutor's interactions with students. An extensive list of publications can be found at Carnegie Mellon University.

The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

LumenVox is a privately held speech recognition software company based in San Diego, California. LumenVox has been described as one of the market leaders in the speech recognition software industry.

RWTH ASR is a proprietary speech recognition toolkit.

Janus Recognition Toolkit (JRTk), sometimes referred to as Janus, is a general purpose speech recognition toolkit developed and maintained by the Interactive Systems Laboratories at Carnegie Mellon University and Karlsruhe Institute of Technology. It is useful for both research and application development and is part of the JANUS speech-to-speech translation system.

The following outline is provided as an overview of and topical guide to natural-language processing:

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

OpenVINO is an open-source software toolkit for optimizing and deploying deep learning models. It enables programmers to develop scalable and efficient AI solutions with relatively few lines of code. It supports several popular model formats and categories, such as large language models, computer vision, and generative AI.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

References

  1. http://www.speech.cs.cmu.edu/sphinx
  2. lee_k_f_1990_1.pdf
  3. huang92sphinxii.pdf