Developer(s) | Microsoft |
---|---|
Initial release | November 30, 2006 |
Operating system | Windows Vista and later |
Type | Speech recognition |
Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface, dictate text in electronic documents and email, navigate websites, perform keyboard shortcuts, and operate the mouse cursor. It supports custom macros to perform additional or supplementary tasks.
WSR is a locally processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition, but adapts based on contexts, grammars, speech samples, training sessions, and vocabularies. It provides a personal dictionary that allows users to include or exclude words or expressions from dictation and to record pronunciations to increase recognition accuracy. Custom language models are also supported.
With Windows Vista, WSR was developed to be part of Windows, as speech recognition was previously exclusive to applications such as Windows Media Player. It is present in Windows 7, Windows 8, Windows 8.1, Windows RT, Windows 10, and Windows 11. The so-called "Startup" of Windows Vista Beta is actually the startup of the speech recognition tutorial, and Vista beta used Windows XP sounds [1] [2]
Microsoft was involved in speech recognition and speech synthesis research for many years before WSR. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts; the company's research led to the development of the Speech API (SAPI) introduced in 1994. [3] Speech recognition had also been used in previous Microsoft products. Office XP and Office 2003 provided speech recognition capabilities among Internet Explorer and Microsoft Office applications; [4] it also enabled limited speech functionality in Windows 98, Windows Me, Windows NT 4.0, and Windows 2000. [5] Windows XP Tablet PC Edition 2002 included speech recognition capabilities with the Tablet PC Input Panel, [6] [7] and Microsoft Plus! for Windows XP enabled voice commands for Windows Media Player. [8] However, these all required installation of speech recognition as a separate component; before Windows Vista, Windows did not include integrated or extensive speech recognition. [7] Office 2007 and later versions rely on WSR for speech recognition services. [9]
At WinHEC 2002 Microsoft announced that Windows Vista (codenamed "Longhorn") would include advances in speech recognition and in features such as microphone array support [10] as part of an effort to "provide a consistent quality audio infrastructure for natural (continuous) speech recognition and (discrete) command and control." [11] Bill Gates stated during PDC 2003 that Microsoft would "build speech capabilities into the system — a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time"; [12] [13] and pre-release builds during the development of Windows Vista included a speech engine with training features. [14] A PDC 2003 developer presentation stated Windows Vista would also include a user interface for microphone feedback and control, and user configuration and training features. [15] Microsoft clarified the extent to which speech recognition would be integrated when it stated in a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide." [16]
During WinHEC 2004 Microsoft included WSR as part of a strategy to improve productivity on mobile PCs. [17] [18] Microsoft later emphasized accessibility, new mobility scenarios, support for additional languages, and improvements to the speech user experience at WinHEC 2005. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between separate Commanding and Dictation modes, Windows Vista would introduce a dedicated interface for speech input on the desktop and would unify the separate speech modes; [19] users previously could not speak a command after dictating or vice versa without first switching between these two modes. [20] Windows Vista Beta 1 included integrated speech recognition. [21] To incentivize company employees to analyze WSR for software glitches and to provide feedback, Microsoft offered an opportunity for its testers to win a Premium model of the Xbox 360. [22]
During a demonstration by Microsoft on July 27, 2006—before Windows Vista's release to manufacturing (RTM)—a notable incident involving WSR occurred that resulted in an unintended output of "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; [23] [24] the incident was a subject of significant derision among analysts and journalists in the audience, [25] [26] despite another demonstration for application management and navigation being successful. [23] Microsoft revealed these issues were due to an audio gain glitch that caused the recognizer to distort commands and dictations; the glitch was fixed before Windows Vista's release. [27]
Reports from early 2007 indicated that WSR is vulnerable to attackers using speech recognition for malicious operations by playing certain audio commands through a target's speakers; [28] [29] it was the first vulnerability discovered after Windows Vista's general availability. [30] Microsoft stated that although such an attack is theoretically possible, a number of mitigating factors and prerequisites would limit its effectiveness or prevent it altogether: a target would need the recognizer to be active and configured to properly interpret such commands; microphones and speakers would both need to be enabled and at sufficient volume levels; and an attack would require the computer to perform visible operations and produce audible feedback without users noticing. User Account Control would also prohibit the occurrence of privileged operations. [31]
WSR was updated to use Microsoft UI Automation and its engine now uses the WASAPI audio stack, substantially enhancing its performance and enabling support for echo cancellation, respectively. The document harvester, which can analyze and collect text in email and documents to contextualize user terms has improved performance, and now runs periodically in the background instead of only after recognizer startup. Sleep mode has also seen performance improvements and, to address security issues, the recognizer is turned off by default after users speak "stop listening" instead of being suspended. Windows 7 also introduces an option to submit speech training data to Microsoft to improve future recognizer versions. [32]
A new dictation scratchpad interface functions as a temporary document into which users can dictate or type text for insertion into applications that are not compatible with the Text Services Framework. [32] Windows Vista previously provided an "enable dictation everywhere option" for such applications. [33]
WSR can be used to control the Metro user interface in Windows 8, Windows 8.1, and Windows RT with commands to open the Charms bar ("Press Windows C"); to dictate or display commands in Metro-style apps ("Press Windows Z"); to perform tasks in apps (e.g., "Change to Celsius" in MSN Weather); and to display all installed apps listed by the Start screen ("Apps"). [34] [35]
WSR is featured in the Settings application starting with the Windows 10 April 2018 Update (Version 1803); the change first appeared in Insider Preview Build 17083. [36] The April 2018 Update also introduces a new ⊞ Win+Ctrl+S keyboard shortcut to activate WSR. [37]
In Windows 11 version 22H2, a second Microsoft app, Voice Access, was added in addition to WSR. [38] [39] In December 2023 Microsoft announced that WSR is deprecated in favor of Voice Access and may be removed in a future build or release of Windows. [40]
WSR allows a user to control applications and the Windows desktop user interface through voice commands. [41] Users can dictate text within documents, email, and forms; control the operating system user interface; perform keyboard shortcuts; and move the mouse cursor. [42] The majority of integrated applications in Windows Vista can be controlled; [41] third-party applications must support the Text Services Framework for dictation. [3] English (U.S.), English (U.K.), French, German, Japanese, Mandarin Chinese, and Spanish are supported languages. [43]
When started for the first time, WSR presents a microphone setup wizard and an optional interactive step-by-step tutorial that users can commence to learn basic commands while adapting the recognizer to their specific voice characteristics; [41] the tutorial is estimated to require approximately 10 minutes to complete. [44] The accuracy of the recognizer increases through regular use, which adapts it to contexts, grammars, patterns, and vocabularies. [43] [45] Custom language models for the specific contexts, phonetics, and terminologies of users in particular occupational fields such as legal or medical are also supported. [46] With Windows Search, [47] the recognizer also can optionally harvest text in documents, email, as well as handwritten tablet PC input to contextualize and disambiguate terms to improve accuracy; no information is sent to Microsoft. [45]
WSR is a locally processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition. [48] Speech profiles that store information about users are retained locally. [45] Backups and transfers of profiles can be performed via Windows Easy Transfer. [49]
The WSR interface consists of a status area that displays instructions, information about commands (e.g., if a command is not heard by the recognizer), and the status of the recognizer; a voice meter displays visual feedback about volume levels. The status area represents the current state of WSR in a total of three modes, listed below with their respective meanings:
Colors of the recognizer listening mode button denote its various modes of operation: blue when listening; blue-gray when sleeping; gray when turned off; and yellow when the user switches context (e.g., from the desktop to the taskbar) or when a voice command is misinterpreted. The status area can also display custom user information as part of Windows Speech Recognition Macros. [50] [51]
An alternates panel disambiguation interface lists items interpreted as being relevant to a user's spoken word(s); if the word or phrase that a user desired to insert into an application is listed among results, a user can speak the corresponding number of the word or phrase in the results and confirm this choice by speaking "OK" to insert it within the application. [52] The alternates panel also appear when launching applications or speaking commands that refer to more than one item (e.g., speaking "Start Internet Explorer" may list both the web browser and a separate version with add-ons disabled). An ExactMatchOverPartialMatch entry in the Windows Registry can limit commands to items with exact names if there is more than one instance included in results. [53]
Listed below are common WSR commands. Words in italics indicate a word that can be substituted for the desired item (e.g., "direction" in "scroll direction" can be substituted with the word "down"). [42] A "start typing" command enables WSR to interpret all dictation commands as keyboard shortcuts. [52]
MouseGrid enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions gradually narrow as a user speaks the number(s) of the region on which to focus until the desired interface element is reached. Users can then issue commands including "Click number of region," which moves the mouse cursor to the desired region and then clicks it; and "Mark number of region", which allows an item (such as a computer icon) in a region to be selected, which can then be clicked with the previous click command. Users also can interact with multiple regions at once. [42]
Applications and interface elements that do not present identifiable commands can still be controlled by asking the system to overlay numbers on top of them through a Show Numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. [42] Show Numbers was designed so that users could interact with items that are not readily identifiable. [55]
WSR enables dictation of text in applications and Windows. If a dictation mistake occurs it can be corrected by speaking "Correct word" or "Correct that" and the alternates panel will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion and by speaking "OK." If the desired item is not listed among suggestions, a user can speak it so that it might appear. Alternatively, users can speak "Spell it" or "I'll spell it myself" to speak the desired word on letter-by-letter basis; users can use their personal alphabet or the NATO phonetic alphabet (e.g., "N as in November") when spelling. [46]
Multiple words in a sentence can be corrected simultaneously (for example, if a user speaks "dictating" but the recognizer interprets this word as "the thing," a user can state "correct the thing" to correct both words at once). In the English language over 100,000 words are recognized by default. [46]
A personal dictionary allows users to include or exclude certain words or expressions from dictation. [46] When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized or if capitalization depends on the context in which the word is spoken. Users can also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Information stored within a dictionary is included as part of a user's speech profile. [45] Users can open the speech dictionary by speaking the "show speech dictionary" command.
WSR supports custom macros through a supplementary application by Microsoft that enables additional natural language commands. [56] [57] As an example of this functionality, an email macro released by Microsoft enables a natural language command where a user can speak "send email to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted. [58] Microsoft has also released sample macros for the speech dictionary, [59] for Windows Media Player, [60] for Microsoft PowerPoint, [61] for speech synthesis, [62] to switch between multiple microphones, [63] to customize various aspects of audio device configuration such as volume levels, [64] and for general natural language queries such as "What is the weather forecast?" [65] "What time is it?" [62] and "What's the date?" [62] Responses to these user inquiries are spoken back to the user in the active Microsoft text-to-speech voice installed on the machine.
Application or item | Sample macro phrases (italics indicate substitutable words) | |||||||
---|---|---|---|---|---|---|---|---|
Microsoft Outlook | Send email | Send email to | Send email to Makoto | Send email to Makoto Yamagishi | Send email to Makoto Yamagishi about | Send email to Makoto Yamagishi about This week's meeting | Refresh Outlook email contacts | |
Microsoft PowerPoint | Next slide | Previous slide | Next | Previous | Go forward 5 slides | Go back 3 slides | Go to slide 8 | |
Windows Media Player | Next track | Previous song | Play Beethoven | Play something by Mozart | Play the CD that has In the Hall of the Mountain King | Play something written in 1930 | Pause music | |
Microphones in Windows | Microphone | Switch microphone | Microphone Array microphone | Switch to Line | Switch to Microphone Array | Switch to Line microphone | Switch to Microphone Array microphone | |
Volume levels in Windows | Mute the speakers | Unmute the speakers | Turn off the audio | Increase the volume | Increase the volume by 2 times | Decrease the volume by 50 | Set the volume to 66 | |
WSR Speech Dictionary | Export the speech dictionary | Add a pronunciation | Add that [selected text] to the speech dictionary | Block that [selected text] from the speech dictionary | Remove that [selected text] | [Selected text] sounds like... | What does that [selected text] sound like? | |
Speech Synthesis | Read that [selected text] | Read the next 3 paragraphs | Read the previous sentence | Please stop reading | What time is it? | What's today's date? | Tell me the weather forecast for Redmond | |
Users and developers can create their own macros based on text transcription and substitution; application execution (with support for command-line arguments); keyboard shortcuts; emulation of existing voice commands; or a combination of these items. XML, JScript and VBScript are supported. [52] Macros can be limited to specific applications [66] and rules for macros can be defined programmatically. [58] For a macro to load, it must be stored in a Speech Macros folder within the active user's Documents directory. All macros are digitally signed by default if a user certificate is available to ensure that stored commands are not altered or loaded by third-parties; if a certificate is not available, an administrator can create one. [67] Configurable security levels can prohibit unsigned macros from being loaded; to prompt users to sign macros after creation; and to load unsigned macros. [66]
As of 2017 [update] WSR uses Microsoft Speech Recognizer 8.0, the version introduced in Windows Vista. For dictation it was found to be 93.6% accurate without training by Mark Hachman, a Senior Editor of PC World —a rate that is not as accurate as competing software. According to Microsoft, the rate of accuracy when trained is 99%. Hachman opined that Microsoft does not publicly discuss the feature because of the 2006 incident during the development of Windows Vista, with the result being that few users knew that documents could be dictated within Windows before the introduction of Cortana. [44]
Microsoft Agent is a technology developed by Microsoft which employs animated characters, text-to-speech engines, and speech recognition software to enhance interaction with computer users. It came pre-installed as part of Windows 2000 and later versions of Microsoft Windows up to Windows Vista. It was not included with Windows 7, and was completely discontinued in Windows 8. Microsoft Agent functionality was exposed as an ActiveX control that can be used by web pages.
Virtual PC is a discontinued x86 emulator for PowerPC Mac hosts and a hypervisor for Microsoft Windows hosts. It was created by Connectix in 1997 and acquired by Microsoft in 2003. The Mac version was discontinued in 2006 following the Mac transition to Intel, while the Windows version was discontinued in 2011 in favour of Hyper-V.
IBM ViaVoice was a range of language-specific continuous speech recognition software products offered by IBM. The current version is designed primarily for use in embedded devices. The latest stable version of IBM Via Voice was 9.0 and was able to transfer text directly into Word.
WordPad is a word processor included with Windows 95 and later. Similarly to its predecessor Microsoft Write, it is a basic word processor, positioned as more advanced than the Notepad text editor by supporting rich text editing, but with a subset of the functionality of Microsoft Word.
Microsoft Office XP is an office suite which was officially revealed in July 2000 by Microsoft for the Windows operating system. Office XP was released to manufacturing on March 5, 2001, and was later made available to retail on May 31, 2001. A Mac OS X equivalent, Microsoft Office v. X was released on November 19, 2001.
A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.
Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft Windows. It was released to manufacturing on November 8, 2006, and over the following two months, it was released in stages to business customers, original equipment manufacturers (OEMs), and retail channels. On January 30, 2007, it was released internationally and was made available for purchase and download from the Windows Marketplace; it is the first release of Windows to be made available through a digital distribution platform.
Microsoft Office 2007 is an office suite for Windows, developed and published by Microsoft. It was officially revealed on March 9, 2006 and was the 12th version of Microsoft Office. It was released to manufacturing on November 3, 2006; it was subsequently made available to volume license customers on November 30, 2006, and later to retail on January 30, 2007. The Mac OS X equivalent, Microsoft Office 2008 for Mac, was released on January 15, 2008.
Compared with previous versions of Microsoft Windows, features new to Windows Vista are numerous, covering most aspects of the operating system, including additional management features, new aspects of security and safety, new I/O technologies, new networking features, and new technical features. Windows Vista also removed some others.
The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.
Windows SideShow was a feature by Microsoft introduced in Windows Vista to supply information such as e-mail, instant messages, and RSS feeds from a personal computer to a local or remote peripheral device or display. SideShow was intended to enhance the Windows experience by enabling new mobility scenarios for the Windows platform and by providing power saving benefits as part of Microsoft's broader efforts regarding a mobile initiative.
The Text Services Framework (TSF) is a COM framework and API in Windows XP and later Windows operating systems that supports advanced text input and text processing. The Language Bar is the core user interface for Text Services Framework.
Windows Vista has many significant new features compared with previous Microsoft Windows versions, covering most aspects of the operating system.
Task Scheduler is a job scheduler in Microsoft Windows that launches computer programs or scripts at pre-defined times or after specified time intervals. Microsoft introduced this component in the Microsoft Plus! for Windows 95 as System Agent. Its core component is an eponymous Windows service. The Windows Task Scheduler infrastructure is the basis for the Windows PowerShell scheduled jobs feature introduced with PowerShell v3.
Some of the new features included in Windows 7 are advancements in touch, speech and handwriting recognition, support for virtual hard disks, support for additional file formats, improved performance on multi-core processors, improved boot performance, and kernel improvements.
The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI) or the Microsoft Speech Server Platform. There are client, server, and mobile versions of Microsoft text-to-speech voices. Client voices are shipped with Windows operating systems; server voices are available for download for use with server applications such as Speech Server, Lync etc. for both Windows client and server platforms, and mobile voices are often shipped with more recent versions.
Microsoft Tablet PC is a term coined by Microsoft for tablet computers conforming to hardware specifications, devised by Microsoft and announced in 2001, for a pen-enabled personal computer, and running a licensed copy of Windows XP Tablet PC Edition operating system or a derivative thereof.
Tazti is a speech recognition software package developed and sold by Voice Tech Group, Inc. for Windows personal computers. The most recent package is version 3.2, which supports Windows 10, Windows 8.1, Windows 8 and Windows 7 64-bit editions. Earlier versions of Tazti supported Windows Vista and Windows XP. PC video game play by voice, controlling PC applications and programs by voice and creating speech commands to trigger a browser to open web pages, or trigger the Windows operating system to open files, folders or programs are Tazti's primary features. Earlier versions of Tazti included a lite Dictation feature that is eliminated from the latest version.
Braina is a virtual assistant and speech-to-text dictation application for Microsoft Windows developed by Brainasoft. Braina uses natural language interface, speech synthesis, and speech recognition technology to interact with its users and allows them to use natural language sentences to perform various tasks on a computer in most languages of the world. The name Braina is a short form of “Brain Artificial”.