Microsoft Speech API

Last updated

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

Contents

In general, all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition and Text-To-Speech engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines.

In general, the Speech API is a freely redistributable component which can be shipped with any Windows application that wishes to use speech technology. Many versions (although not all) of the speech recognition and synthesis engines are also freely redistributable.

There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5, however, was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.

Basic architecture

The Speech API can be viewed as an interface or piece of middleware which sits between applications and speech engines (recognition and synthesis). In SAPI versions 1 to 4, applications could directly communicate with engines. The API included an abstract interface definition which applications and engines conformed to. Applications could also use simplified higher-level objects rather than directly call methods on the engines.

In SAPI 5 however, applications and engines do not directly communicate with each other. Instead, each talks to a runtime component (sapi.dll). There is an API implemented by this component which applications use, and another set of interfaces for engines.

Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition). The recognition and synthesis engines also generate events while processing (for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech). These pass in the reverse direction, from the engines, through the runtime DLL, and on to an event sink in the application.

In addition to the actual API definition and runtime DLL, other components are shipped with all versions of SAPI to make a complete Speech Software Development Kit. The following components are among those included in most versions of the Speech SDK:

Versions

Xuedong Huang was a key person who led Microsoft's early SAPI efforts.

SAPI 1-4 API family

SAPI 1

The first version of SAPI was released in 1995, and was supported on Windows 95 and Windows NT 3.51. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs.

SAPI 3

SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample applications and audio sources.

SAPI 4

SAPI 4.0 was released in 1998. This version of SAPI included both the core COM API; together with C++ wrapper classes to make programming from C++ easier; and ActiveX controls to allow drag-and-drop Visual Basic development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped (with synthesis engines only) in Windows 2000.

The main components of the SAPI 4 API (which were all available in C++, COM, and ActiveX flavors) were:

  • Voice Command - high-level objects for command & control speech recognition
  • Voice Dictation - high-level objects for continuous dictation speech recognition
  • Voice Talk - high-level objects for speech synthesis
  • Voice Telephony - objects for writing telephone speech applications
  • Direct Speech Recognition - objects for direct control of recognition engine
  • Direct Text To Speech - objects for direct control of synthesis engine
  • Audio objects - for reading to and from an audio device or file

SAPI 5 API family

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification.

The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition, this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime.

The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from Windows 98 and NT 4.0 upwards were supported.

Major features of the API include:

SAPI 5.0

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and Simplified Chinese versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.

SAPI 5.1

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as JScript, and managed code. This version of the API and TTS engines were shipped in Windows XP. Windows XP Tablet PC Edition and Office 2003 also include this version but with a substantially improved version 6 recognition engine and Traditional Chinese.

SAPI 5.2

This was a special version of the API for use only in the Microsoft Speech Server which shipped in 2004. It added support for SRGS and SSML mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.

SAPI 5.3

This is the version of the API that ships in Windows Vista together with new recognition and synthesis engines. As Windows Speech Recognition is now integrated into the operating system, the Speech SDK and APIs are a part of the Windows SDK. SAPI 5.3 includes the following new features:

  • Support for W3C XML speech grammars for recognition and synthesis. The Speech Synthesis Markup Language (SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation.
  • The Speech Recognition Grammar Specification (SRGS) supports the definition of context-free grammars, with two limitations:
    • It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.
    • It does not support Augmented Backus–Naur form (ABNF).
  • Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated with JavaScript for semantic interpretation to supplement the recognized text.
  • User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.
  • Additional functionality and ease-of-programming provided by new types.
  • Performance improvements, improved reliability, and security.
  • Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")

SAPI 5.4

This is an updated version of the API that ships in Windows 7.

SAPI 5 Voices

Microsoft Sam (Speech Articulation Module) is a commonly shipped SAPI 5 voice. In addition, Microsoft Office XP and Office 2003 installed L&H Michael and Michelle voices. The SAPI 5.1 SDK installs 2 more voices, Mike and Mary. Windows Vista includes Microsoft Anna which replaces Microsoft Sam and sounds more natural and intelligible. It is also installed on Windows XP by Microsoft Streets & Trips 2006 and later versions. The Chinese version of Vista and later Windows client versions also include a female voice named Microsoft Lili.

Managed code Speech API

A managed code API ships as part of the .NET Framework 3.0. [1] It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available on Windows XP, Windows Server 2003, Windows Vista, and Windows Server 2008.

The existing SAPI 5 API can also be used from managed code to a limited extent by creating a COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library.

However, major obstacle towards transitioning from the COM Interop is the fact that the managed implementation has subtle memory leaks which lead to memory fragmentation and exclude the use of the library in any non-trivial applications. As a workaround, Microsoft has suggested using a different API, which has fewer voices. [2]

Speech functionality in Windows Vista

Windows Vista includes a number of new speech-related features including:

Microsoft Agent most notably, and all other Microsoft speech applications use SAPI 5.

Compatibility

The Speech API is compatible with the following operating systems: [3]

SAPI 5

SAPI 4

Major applications using SAPI

See also

Related Research Articles

<span class="mw-page-title-main">DirectX</span> Collection of multimedia related APIs on Microsoft platforms

Microsoft DirectX is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with "Direct", such as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The name DirectX was coined as a shorthand term for all of these APIs and soon became the name of the collection. When Microsoft later set out to develop a gaming console, the X was used as the basis of the name Xbox to indicate that the console was based on DirectX technology. The X initial has been carried forward in the naming of APIs designed for the Xbox such as XInput and the Cross-platform Audio Creation Tool (XACT), while the DirectX pattern has been continued for Windows APIs such as Direct2D and DirectWrite.

In computing, DLL Hell is a term for the complications that arise when one works with dynamic-link libraries (DLLs) used with Microsoft Windows operating systems, particularly legacy 16-bit editions, which all run in a single memory space.

<span class="mw-page-title-main">Windows API</span> Microsofts core set of application programming interfaces on Windows

The Windows API, informally WinAPI, is the foundational application programming interface (API) that allows a computer program to access the features of the Microsoft Windows operating system in which the program is running.

<span class="mw-page-title-main">Microsoft Agent</span> Virtual software agent technology

Microsoft Agent was a technology developed by Microsoft which employed animated characters, text-to-speech engines, and speech recognition software to enhance interaction with computer users. Thus it was an example of an embodied agent. It came pre-installed as part of Windows 2000 through Windows Vista. It was not included with Windows 7 but can be downloaded from Microsoft. It was completely discontinued in Windows 8. Microsoft Agent functionality was exposed as an ActiveX control that can be used by web pages.

DirectPlay is part of Microsoft's DirectX API. It is a network communication library intended for computer game development, although it can be used for other purposes.

<span class="mw-page-title-main">Microsoft Data Access Components</span> Framework

Microsoft Data Access Components is a framework of interrelated Microsoft technologies that allows programmers a uniform and comprehensive way of developing applications that can access almost any data store. Its components include: ActiveX Data Objects (ADO), OLE DB, and Open Database Connectivity (ODBC). There have been several deprecated components as well, such as the Jet Database Engine, MSDASQL, and Remote Data Services (RDS). Some components have also become obsolete, such as the former Data Access Objects API and Remote Data Objects.

Microsoft XML Core Services (MSXML) are set of services that allow applications written in JScript, VBScript, and Microsoft development tools to build Windows-native XML-based applications. It supports XML 1.0, DOM, SAX, an XSLT 1.0 processor, XML schema support including XSD and XDR, as well as other XML-related technologies.

The Microsoft Windows operating system supports a form of shared libraries known as "dynamic-link libraries", which are code libraries that can be used by multiple processes while only one copy is loaded into memory. This article provides an overview of the core libraries that are included with every modern Windows installation, on top of which most Windows applications are built.

<span class="mw-page-title-main">Visual Basic (classic)</span> Microsofts programming language based on BASIC and COM

Visual Basic (VB) before .NET, sometimes referred to as Classic Visual Basic, is a third-generation programming language, based on BASIC, from Microsoft known for its Component Object Model (COM) programming model – first released in 1991, last released in 1998 and declared legacy in 2008.

<span class="mw-page-title-main">Text Services Framework</span> Software framework and API for input method in Microsoft Windows

The Text Services Framework (TSF) is a COM framework and API in Windows XP and later Windows operating systems that supports advanced text input and text processing. The Language Bar is the core user interface for Text Services Framework.

Microsoft UI Automation (UIA) is an application programming interface (API) that allows one to access, identify, and manipulate the user interface (UI) elements of another application.

Windows Vista has many significant new features compared with previous Microsoft Windows versions, covering most aspects of the operating system.

<span class="mw-page-title-main">Windows Speech Recognition</span> Speech recognition software

Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface, dictate text in electronic documents and email, navigate websites, perform keyboard shortcuts, and operate the mouse cursor. It supports custom macros to perform additional or supplementary tasks.

Windows XP, which is the next version of Windows NT after Windows 2000 and the successor to the consumer-oriented Windows Me, has been released in several editions since its original release in 2001.

Component Object Model (COM) is a binary-interface standard for software components introduced by Microsoft in 1993. It is used to enable inter-process communication (IPC) object creation in a large range of programming languages. COM is the basis for several other Microsoft technologies and frameworks, including OLE, OLE Automation, Browser Helper Object, ActiveX, COM+, DCOM, the Windows shell, DirectX, UMDF and Windows Runtime.

The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI) or the Microsoft Speech Server Platform. There are client, server, and mobile versions of Microsoft text-to-speech voices. Client voices are shipped with Windows operating systems; server voices are available for download for use with server applications such as Speech Server, Lync etc. for both Windows client and server platforms, and mobile voices are often shipped with more recent versions.

XAudio2 is a lower-level audio API for Microsoft Windows, Xbox 360 and Windows Phone 8, the successor to DirectSound on Windows and a supplement to the original XAudio on the Xbox 360.

Windows Runtime (WinRT) is a platform-agnostic component and application architecture first introduced in Windows 8 and Windows Server 2012 in 2012. It is implemented in C++ and officially supports development in C++, Rust/WinRT, Python/WinRT, JavaScript-TypeScript, and the managed code languages C# and Visual Basic .NET (VB.NET).

References

  1. Michael Dunn. "Speech synthesis and recognition in .NET - Give applications a voice". Redmond Developer News. Retrieved 2011-11-09. Archived 14 January 2010 at the Wayback Machine
  2. System. Speech has a memory leak | Microsoft Connect. Connect.microsoft.com. Retrieved on 2013-09-27.
  3. Microsoft Corporation. "SAPI System Requirements". MSDN. Retrieved 2006-04-12.