Java Speech API

Last updated

The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.[ citation needed ]

Contents

Core technologies

Two core speech technologies are supported through the Java Speech API: speech synthesis and speech recognition .

Speech synthesis

Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet, or a user. It is often referred to as text-to-speech technology.

The major steps in producing speech from text are as follows:

The result of these first two steps is a spoken form of the written text. Here are examples of the differences between written and spoken text:

St. Matthew's hospital is on Main St. -> “Saint Matthew's hospital is on Main Street”  Add $20 to account 55374. -> “Add twenty dollars to account five five, three seven four.”

The remaining steps convert the spoken text to speech:

Speech synthesizers can make errors in any of the processing steps described above. Human ears are well-tuned to detecting these errors, but careful work by developers can minimize errors and improve the speech output quality.

Speech recognition

Speech recognition provides computers with the ability to listen to spoken language and determine what has been said. In other words, it processes audio input containing speech by converting it to text.

The major steps of a typical speech recognizer are as follows:

A grammar is an object in the Java Speech API that indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints make recognition faster and more accurate because the recognizer does not have to check for bizarre sentences.

The Java Speech API 1 supports two basic grammar types: rule grammars and dictation grammars. These types differ in various ways, including how applications set up the grammars; the types of sentences they allow; how results are provided; the amount of computational resources required; and how they are used in application design. Rule grammars are defined in JSAPI 1 by JSGF, the Java Speech Grammar Format.

The Java Speech API’s classes and interfaces

The different classes and interfaces that form the Java Speech API are grouped into the following three packages:

The EngineManager class is like a factory class used by all Java Speech API applications. It provides static methods to enable access to speech synthesis and speech recognition engines. The Engine interface encapsulates the generic operations that a Java Speech API-compliant speech engine should provide for speech applications.

Speech applications can primarily use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio stream generated or processed by the speech engine. The AudioManager can manipulate streams. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality. The Synthesizer interface encapsulates a Java Speech API-compliant speech synthesis engine's operations for speech applications.

The Java Speech API is based on event handling. Events generated by the speech engine can be identified and handled as required. Speech events can be handled through the EngineListener interface, specifically through the RecognizerListener and the SynthesizerListener.

The Java Speech API was written before the Java Community Process (JCP) and targeted the Java Platform, Standard Edition (Java SE). Subsequently, the Java Speech API 2 (JSAPI2) was created as JSR 113 under the JCP. This API targets the Java Platform, Micro Edition (Java ME), but also complies with Java SE.

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Java Platform, Standard Edition is a computing platform for development and deployment of portable code for desktop and server environments. Java SE was formerly known as Java 2 Platform, Standard Edition (J2SE).

In computing, the Java API for XML Processing, or JAXP, one of the Java XML Application programming interfaces, provides the capability of validating and parsing XML documents. It has three basic parsing interfaces:

VoiceXML (VXML) is a digital document standard for specifying interactive media and voice dialogs between humans and computers. It is used for developing audio and voice response applications, such as banking systems and automated customer service portals. VoiceXML applications are developed and deployed in a manner analogous to how a web browser interprets and visually renders the Hypertext Markup Language (HTML) it receives from a web server. VoiceXML documents are interpreted by a voice browser and in common deployment architectures, users interact with voice browsers via the public switched telephone network (PSTN).

PlainTalk is the collective name for several speech synthesis (MacinTalk) and speech recognition technologies developed by Apple Inc. In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many researchers in the field. The result was "PlainTalk", released with the AV models in the Macintosh Quadra series from 1993. It was made a standard system component in System 7.1.2, and has since been shipped on all PowerPC and some 68k Macintoshes.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to the Chinese characters, the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

MBROLA is speech synthesis software as a worldwide collaborative project. The MBROLA project web page provides diphone databases for many spoken languages.

The Java Class Library (JCL) is a set of dynamically loadable libraries that Java Virtual Machine (JVM) languages can call at run time. Because the Java Platform is not dependent on a specific operating system, applications cannot rely on any of the platform-native libraries. Instead, the Java Platform provides a comprehensive set of standard class libraries, containing the functions common to modern operating systems.

<span class="mw-page-title-main">Windows Speech Recognition</span> Speech recognition software

Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface; dictate text in electronic documents and email; navigate websites; perform keyboard shortcuts; and to operate the mouse cursor. It supports custom macros to perform additional or supplementary tasks.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Speech Services</span> Screen reader application by Google

Speech Services is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, by Google Translate for reading aloud translations providing useful insight to the pronunciation of words, by Google TalkBack and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

References