Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. [1] The firm currently has seven patents granted [2] [3] [4] [5] [6] [7] [8] and three more pending for its automated methods of converting digital text into human-sounding speech, more accurately recognizing human speech and outputting the text representing the words and phrases of said speech, along with recognizing the speaker's emotional state.
The LTI technology is partly based on the work of the late Arthur Lessac, a Professor of Theater at the State University of New York and the creator of Lessac Kinesensic Training, and LTI has licensed exclusive rights to exploit Arthur Lessac's copyrighted works in the fields of speech synthesis and speech recognition. Based on the view that music is speech and speech is music, Lessac's work and books focused on body and speech energies and how they go together. Arthur Lessac's textual annotation system, which was originally developed to assist actors, singers, and orators in marking up scripts to prepare for performance, is adapted in LTI's speech synthesis system as the basic representation of the speech to be synthesized (Lessemes), in contrast to many other systems which use a phonetic representation. [9] [10] [11]
LTI's software has two major components: (1) a linguistic front-end that converts plain text to a sequence of prosodic and phonosensory graphic symbols (Lessemes) based on Arthur Lessac's annotation system, which specify the speech units to be synthesized; (2) a signal-processing back-end that takes the Lessemes as acoustic data and produces human-sounding synthesized speech as output, using unit selection and concatenation.
LTI's text-to-speech system came in second in the world-wide Blizzard Challenge 2011 and 2012. The first-place team in 2011 also employed LTI's "front-end" technology, but with its own back-end. [12] [13] The Blizzard Challenge, conducted by the Language Technologies Institute of Carnegie Mellon University, was devised as a way to evaluate speech synthesis techniques by having different research groups build voices from the same voice-actor recordings, and comparing the results through listening tests.
LTI was founded in 2000 by H. Donald Wilson (chairman), a lawyer, LexisNexis entrepreneur and business associate of Arthur Lessac; and Gary A. Marple (chief inventor), after Marple suggested that Arthur Lessac's kinesensic voice training might be applicable to computational linguistics. After Wilson's death in 2006, his nephew John Reichenbach became the firm's CEO.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Interactive voice response (IVR) is a technology that allows humans to interact with a computer-operated phone system through the use of voice and DTMF tones input via a keypad. In telecommunications, IVR allows customers to interact with a company's host system via a telephone keypad or by speech recognition, after which services can be inquired about through the IVR dialogue. IVR systems can respond with pre-recorded or dynamically generated audio to further direct users on how to proceed. IVR systems deployed in the network are sized to handle large call volumes and also used for outbound calling as IVR systems are more intelligent than many predictive dialer systems.
PlainTalk is the collective name for several speech synthesis (MacinTalk) and speech recognition technologies developed by Apple Inc. In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many researchers in the field. The result was "PlainTalk", released with the AV models in the Macintosh Quadra series from 1993. It was made a standard system component in System 7.1.2, and has since been shipped on all PowerPC and some 68k Macintoshes.
SpeechFX, Inc., offers voice technology for mobile phone and wireless devices, interactive video games, toys, home appliances, computer telephony systems and vehicle telematics. Fonix speech solutions are based on the firm’s proprietary neural network-based automatic speech recognition (ASR) and Fonix DECtalk, a text-to-speech speech synthesis system (TTS). Fonix speech technology is user-independent, meaning no voice training is involved.
AlyaSpeech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.
The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.
A voice font is a computer-generated voice that can be controlled by specifying parameters such as speed and pitch and made to pronounce text input. The concept is akin to that of a text font or a MIDI instrument in the sense that the same input may easily be represented in several different ways based on the design of each font. In spite of current shortcomings in the underlying technology for voice fonts, screen readers and other devices used to enhance accessibility of text to persons with disabilities, can benefit from having more than one default voice font. This happens in the same way that users of a traditional computer word processor benefit from having more than one text font.
Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to the Chinese characters, the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.
Henry Donald Wilson, generally referred to as H. Donald Wilson was a database pioneer and entrepreneur. He was also the first president and one of the principal creators of the Lexis legal information system, and Nexis. An attorney by training who became an information industry innovator and a venture capital consultant to numerous businesses, Mr. Wilson was also an internationalist and a conservationist. At the time of his death, he was chairman of Lessac Technologies Inc., a text-to-voice software venture based on nearly fifty years of partnership with Arthur Lessac.
The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only there are several implementations created by third parties, for example FreeTTS.
The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
A spoken dialog system is a computer system able to converse with a human with voice. It has two essential components that do not exist in a written text dialog system: a speech recognizer and a text-to-speech module. It can be further distinguished from command and control speech systems that can respond to requests but do not attempt to maintain continuity over time.
eSpeakNG is a compact, open-source, software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeakNG's language support is done using rule files with feedback from native speakers.
SVOX is an embedded speech technology company founded in 2000 and headquartered in Zurich, Switzerland. SVOX was acquired by Nuance Communications in 2011. The company's products included Automated Speech Recognition (ASR), Text-to-Speech (TTS) and Speech Dialog systems, with customers mostly being manufacturers and system integrators in automotive and mobile device industries.
CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning.
The Medical Intelligence and Language Engineering Laboratory, also known as MILE lab, is a research laboratory at the Indian Institute of Science, Bangalore under the Department of Electrical Engineering. The lab is known for its work on Image processing, online handwriting recognition, Text-To-Speech and Optical character recognition systems, all of which are focused mainly on documents and speech in Indian languages. The lab is headed by A. G. Ramakrishnan.
NeoSpeech is a company that specializes in text-to-speech (TTS) software for embedded devices, mobile, desktop, and network/server applications. NeoSpeech was founded by two speech engineers in Fremont, California, US, in 2002. NeoSpeech is privately held, headquartered in Santa Clara, California.
Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.
WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by an anonymous MIT researcher under the eponymous pseudonym 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, even those with a very small amount of data.