Cepstral (company)

Last updated

Cepstral LLC
IndustrySpeech synthesis
FoundedJune 2000
Founder Kevin Lenzo, Alan W. Black
Headquarters Pittsburgh, Pennsylvania
ProductsCepstral Text-To-Speech
Website http://www.cepstral.com

Cepstral is a provider of speech synthesis technology and services. It was founded in June 2000 by scientists from Carnegie Mellon University including the computer scientists Kevin Lenzo and Alan W. Black. It is a privately held corporation with headquarters in Pittsburgh, Pennsylvania.

Contents

The company primarily produces synthetic voices to be used in telephony systems, [1] mobile applications, desktop applications, and with other TTS software such as open-source Festival. [2]

See also

Related Research Articles

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites. It is distributed under a free software license similar to the BSD License.

PlainTalk is the collective name for several speech synthesis (MacinTalk) and speech recognition technologies developed by Apple Inc. In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many researchers in the field. The result was "PlainTalk", released with the AV models in the Macintosh Quadra series from 1993. It was made a standard system component in System 7.1.2, and has since been shipped on all PowerPC and some 68k Macintoshes.

<span class="mw-page-title-main">SpeechFX</span>

SpeechFX, Inc., offers voice technology for mobile phone and wireless devices, interactive video games, toys, home appliances, computer telephony systems and vehicle telematics. SpeechFX speech solutions are based on the firm’s proprietary neural network-based automatic speech recognition (ASR) and Fonix DECtalk, a text-to-speech speech synthesis system (TTS). Fonix speech technology is user-independent, meaning no voice training is involved.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

<span class="mw-page-title-main">DECtalk</span> Speech synthesizer and text-to-speech technology

DECtalk was a speech synthesizer and text-to-speech technology developed by Digital Equipment Corporation in 1983, based largely on the work of Dennis Klatt at MIT, whose source-filter algorithm was variously known as KlattTalk or MITalk.

Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to the Chinese characters, the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

Kevin Lenzo is an American computer scientist. He wrote the initial infobot, founded The Perl Foundation and the Yet Another Perl Conferences (YAPC)., released CMU Sphinx into Open source, founded Cepstral LLC, and has been a major contributor to the Festival Speech Synthesis System, FestVox, and Flite. His voice is the basis for a number of synthetic voices, including FreeTTS, Flite, and the cmu_us_kal_diphone Festival voice. He has also contributed Perl modules to CPAN. Kevin was also a founding member of the 1980s funk band "Leftover Funk"

<span class="mw-page-title-main">Alan W. Black</span> Scottish computer scientist

Alan W Black is a Scottish computer scientist, known for his research on speech synthesis. He is a professor in the Language Technologies Institute at Carnegie Mellon University in Pittsburgh, Pennsylvania.

The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI) or the Microsoft Speech Server Platform. There are client, server, and mobile versions of Microsoft text-to-speech voices. Client voices are shipped with Windows operating systems; server voices are available for download for use with server applications such as Speech Server, Lync etc. for both Windows client and server platforms, and mobile voices are often shipped with more recent versions.

Voice Elements is a Microsoft .NET development environment for building automated telephone systems. Voice Elements was released by Inventive Labs Corporation in 2008, based on their original CTI32 toolkit. Software developers who use C#, VB.NET or Delphi use Voice Elements to write telephony-based applications, such as Interactive Voice Response systems, voice dialers, auto attendants, call centers and more.

<span class="mw-page-title-main">CereProc</span> Speech synthesis company

CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning.

Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. The firm currently has seven patents granted and three more pending for its automated methods of converting digital text into human-sounding speech, more accurately recognizing human speech and outputting the text representing the words and phrases of said speech, along with recognizing the speaker's emotional state.

NeoSpeech is a company that specializes in text-to-speech (TTS) software for embedded devices, mobile, desktop, and network/server applications. NeoSpeech was founded by two speech engineers in Fremont, California, US, in 2002. NeoSpeech is privately held, headquartered in Santa Clara, California.

<span class="mw-page-title-main">WaveNet</span> Deep neural network for generating raw audio

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">Multichannel Speaking Automaton</span>

MUSA (MUltichannel Speaking Automaton) was an early prototype of Speech Synthesis machine started in 1975.

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by an anonymous MIT researcher under the eponymous pseudonym 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

Richard Sproat is a computational linguist currently working for Google as a researcher on text normalization and speech recognition.

References

  1. Adam Boretz. "Voice Elements Integrates Cepstral TTS". SpeechTechMag.com. Retrieved November 6, 2011.
  2. "Cepstral Text-to-Speech". Cepstral.com. Archived from the original on October 18, 2006. Retrieved November 6, 2011.