Windows Speech Recognition (WSR) is

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

developed by

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...

for

Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...

that enables voice commands to control the desktop

user interface In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...

; dictate text in electronic documents and

email Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...

; navigate

website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google Search, Google, Facebook, Amaz ...

s; perform keyboard shortcuts; and to operate the

mouse cursor In human–computer interaction, a cursor is an indicator used to show the current position on a computer monitor or other display device that will respond to input from a text input or pointing device. The mouse cursor is also called a point ...

. It supports custom macros to perform additional or supplementary tasks. WSR is a locally processed speech recognition platform; it does not rely on

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...

for accuracy, dictation, or recognition, but adapts based on contexts, grammars, speech samples, training sessions, and vocabularies. It provides a personal dictionary that allows users to include or exclude words or expressions from dictation and to record pronunciations to increase recognition accuracy. Custom language models are also supported. With Windows Vista, WSR was developed to be part of Windows, as speech recognition was previously exclusive to applications such as

Windows Media Player Windows Media Player (WMP) is the first media player and media library application that was developed by Microsoft for playing audio, video and viewing images on personal computers running the Microsoft Windows operating system, as well as on ...

. It is present in

Windows 7 Windows 7 is a major release of the Windows NT operating system developed by Microsoft. It was released to manufacturing on July 22, 2009, and became generally available on October 22, 2009. It is the successor to Windows Vista, released nearly ...

Windows 8 Windows 8 is a major release of the Windows NT operating system developed by Microsoft. It was Software release life cycle#Release to manufacturing (RTM), released to manufacturing on August 1, 2012; it was subsequently made available for downl ...

Windows 8.1 Windows 8.1 is a release of the Windows NT operating system developed by Microsoft. It was released to manufacturing on August 27, 2013, and broadly released for retail sale on October 17, 2013, about a year after the retail release of its pre ...

, Windows RT,

Windows 10 Windows 10 is a major release of Microsoft's Windows NT operating system. It is the direct successor to Windows 8.1, which was released nearly two years earlier. It was released to manufacturing on July 15, 2015, and later to retail on J ...

, and

Windows 11 Windows 11 is the latest major release of Microsoft's Windows NT operating system, released in October 2021. It is a free upgrade to its predecessor, Windows 10 (2015), and is available for any Windows 10 devices that meet the new Windows 11 ...

History

Microsoft was involved in speech recognition and

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

research for many years before WSR. In 1993, Microsoft hired

Xuedong Huang Xuedong D. Huang (born October 20, 1962) is a Chinese American computer scientist and technology executive who has made contributions to spoken language processing and AI Cognitive Services. He is Microsoft's Technical Fellow and Chief Technology ...

from

Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...

to lead its speech development efforts; the company's research led to the development of the Speech API (SAPI) introduced in 1994. Speech recognition had also been used in previous Microsoft products. Office XP and Office 2003 provided speech recognition capabilities among

Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated IE or MSIE) is a series of graphical user interface, graphical web browsers developed by Microsoft which was used in the Microsoft Wind ...

and

Microsoft Office Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketin ...

applications; it also enabled limited speech functionality in

Windows 98 Windows 98 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of Microsoft Windows operating systems. The second operating system in the 9x line, it is the successor to Windows 95, and was released to ...

, Windows Me,

Windows NT 4.0 Windows NT 4.0 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It is the direct successor to Windows NT 3.51, which was released to manufacturing on July 31, 1996, and then to retail ...

, and

Windows 2000 Windows 2000 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It was the direct successor to Windows NT 4.0, and was Software release life cycle#Release to manufacturing (RTM), releas ...

Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...

Tablet PC Edition 2002 included speech recognition capabilities with the Tablet PC Input Panel, and Microsoft Plus! for Windows XP enabled voice commands for Windows Media Player. However, these all required installation of speech recognition as a separate component; before Windows Vista, Windows did not include integrated or extensive speech recognition. Office 2007 and later versions rely on WSR for speech recognition services.

Windows Vista

At WinHEC 2002 Microsoft announced that Windows Vista (codenamed "Longhorn") would include advances in speech recognition and in features such as

microphone array A microphone array is any number of microphones operating in tandem. There are many applications: * Systems for extracting voice input from ambient noise (notably telephones, speech recognition systems, hearing aids) * Surround sound and relate ...

support as part of an effort to "provide a consistent quality audio infrastructure for natural (continuous) speech recognition and (discrete) command and control."

Bill Gates William Henry Gates III (born October 28, 1955) is an American business magnate and philanthropist. He is a co-founder of Microsoft, along with his late childhood friend Paul Allen. During his career at Microsoft, Gates held the positions ...

stated during PDC 2003 that Microsoft would "build speech capabilities into the system — a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time"; and pre-release builds during the development of Windows Vista included a speech engine with training features. A PDC 2003 developer presentation stated Windows Vista would also include a user interface for microphone feedback and control, and user configuration and training features. Microsoft clarified the extent to which speech recognition would be integrated when it stated in a pre-release

software development kit A software development kit (SDK) is a collection of software development tools in one installable package. They facilitate the creation of applications by having a compiler, debugger and sometimes a software framework. They are normally specific to ...

that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide." During WinHEC 2004 Microsoft included WSR as part of a strategy to improve productivity on mobile PCs. Microsoft later emphasized

accessibility Accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities. The concept of accessible design and practice of accessible development ensures both "direct access" (i. ...

, new mobility scenarios, support for additional languages, and improvements to the speech user experience at WinHEC 2005. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between separate Commanding and Dictation modes, Windows Vista would introduce a dedicated interface for speech input on the desktop and would unify the separate speech modes; users previously could not speak a command after dictating or vice versa without first switching between these two modes. Windows Vista Beta 1 included integrated speech recognition. To incentivize company employees to analyze WSR for software glitches and to provide feedback, Microsoft offered an opportunity for its testers to win a Premium model of the

Xbox 360 The Xbox 360 is a home video game console developed by Microsoft. As the successor to the original Xbox, it is the second console in the Xbox series. It competed with Sony's PlayStation 3 and Nintendo's Wii as part of the seventh generation ...

. During a demonstration by Microsoft on July 27, 2006—before Windows Vista's release to manufacturing (RTM)—a notable incident involving WSR occurred that resulted in an unintended output of "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; the incident was a subject of significant derision among analysts and journalists in the audience, despite another demonstration for application management and navigation being successful. Microsoft revealed these issues were due to an audio

gain Gain or GAIN may refer to: Science and technology * Gain (electronics), an electronics and signal processing term * Antenna gain * Gain (laser), the amplification involved in laser emission * Gain (projection screens) * Information gain in de ...

glitch that caused the recognizer to distort commands and dictations; the glitch was fixed before Windows Vista's release. Reports from early 2007 indicated that WSR is vulnerable to attackers using speech recognition for malicious operations by playing certain audio commands through a target's speakers; it was the first vulnerability discovered after Windows Vista's

general availability A software release life cycle is the sum of the stages of development and maturity for a piece of computer software ranging from its initial development to its eventual release, and including updated versions of the released version to help impro ...

. Microsoft stated that although such an attack is theoretically possible, a number of mitigating factors and prerequisites would limit its effectiveness or prevent it altogether: a target would need the recognizer to be active and configured to properly interpret such commands; microphones and speakers would both need to be enabled and at sufficient volume levels; and an attack would require the computer to perform visible operations and produce audible feedback without users noticing.

User Account Control User Account Control (UAC) is a mandatory access control enforcement feature introduced with Microsoft's Windows Vista and Windows Server 2008 operating systems, with a more relaxed

would also prohibit the occurrence of privileged operations.

Windows 7

WSR was updated to use Microsoft UI Automation and its engine now uses the

WASAPI Windows Vista (formerly codenamed Windows "Longhorn") has many significant new features compared with previous Microsoft Windows versions, covering most aspects of the operating system. In addition to the new user interface, security capabilities ...

audio stack, substantially enhancing its performance and enabling support for echo cancellation, respectively. The document harvester, which can analyze and collect text in email and documents to contextualize user terms has improved performance, and now runs periodically in the background instead of only after recognizer startup. Sleep mode has also seen performance improvements and, to address security issues, the recognizer is turned off by default after users speak "stop listening" instead of being suspended. Windows 7 also introduces an option to submit speech training data to Microsoft to improve future recognizer versions. A new dictation scratchpad interface functions as a temporary document into which users can dictate or type text for insertion into applications that are not compatible with the

Text Services Framework The Text Services Framework (TSF) is a COM framework and API in Windows XP and later Windows operating systems that supports advanced text input and text processing. The Language Bar is the core user interface for Text Services Framework. Overvie ...

. Windows Vista previously provided an "enable dictation everywhere option" for such applications.

Windows 8.x and Windows RT

WSR can be used to control the

Metro Metro, short for metropolitan, may refer to: Geography * Metro (city), a city in Indonesia * A metropolitan area, the populated region including and surrounding an urban center Public transport * Rapid transit, a passenger railway in an urba ...

user interface in Windows 8, Windows 8.1, and Windows RT with commands to open the Charms bar ("Press Windows C"); to dictate or display commands in Metro-style apps ("Press Windows Z"); to perform tasks in apps (e.g., "Change to Celsius" in

MSN Weather MSN (meaning Microsoft Network) is a web portal and related collection of Internet services and apps for Windows and mobile devices, provided by Microsoft and launched on August 24, 1995, alongside the release of Windows 95. The Microsoft Net ...

); and to display all installed apps listed by the Start screen ("Apps").

Windows 10

WSR is featured in the Settings application starting with the Windows 10 April 2018 Update ( Version 1803); the change first appeared in Insider Preview Build 17083. The April 2018 Update also introduces a new ++ keyboard shortcut to activate WSR.

Windows 11

In Windows 11 version 22H2, a second Microsoft app, Voice Access, was added in addition to WSR.

Overview and features

WSR allows a user to control applications and the Windows desktop

through voice commands. Users can dictate text within documents, email, and forms; control the operating system user interface; perform keyboard shortcuts; and move the

. The majority of integrated applications in Windows Vista can be controlled; third-party applications must support the Text Services Framework for dictation. English (U.S.), English (U.K.),

French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...

German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...

, Japanese,

Mandarin Chinese Mandarin (; ) is a group of Chinese (Sinitic) dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of ...

, and Spanish are supported languages. When started for the first time, WSR presents a microphone setup wizard and an optional interactive step-by-step tutorial that users can commence to learn basic commands while adapting the recognizer to their specific voice characteristics; the tutorial is estimated to require approximately 10 minutes to complete. The accuracy of the recognizer increases through regular use, which adapts it to contexts, grammars, patterns, and vocabularies. Custom language models for the specific contexts, phonetics, and terminologies of users in particular occupational fields such as legal or medical are also supported. With Windows Search, the recognizer also can optionally harvest text in documents, email, as well as handwritten tablet PC input to contextualize and disambiguate terms to improve accuracy; no information is sent to Microsoft. WSR is a locally processed speech recognition platform; it does not rely on cloud computing for accuracy, dictation, or recognition. Speech profiles that store information about users are retained locally. Backups and transfers of profiles can be performed via

Windows Easy Transfer Windows Easy Transfer is a specialized file transfer program developed by Microsoft that allows users of the Windows operating system to transfer personal files and settings from a computer running an earlier version of Windows to a computer runn ...

Interface

The WSR interface consists of a status area that displays instructions, information about commands (e.g., if a command is not heard by the recognizer), and the status of the recognizer; a voice meter displays visual feedback about volume levels. The status area represents the current state of WSR in a total of three modes, listed below with their respective meanings: * Listening: The recognizer is active and waiting for user input * Sleeping: The recognizer will not listen for or respond to commands other than "Start listening" * Off: The recognizer will not listen or respond to any commands; this mode can be enabled by speaking "Stop listening" Colors of the recognizer listening mode button denote its various modes of operation: blue when listening; blue-gray when sleeping; gray when turned off; and yellow when the user switches context (e.g., from the desktop to the taskbar) or when a voice command is misinterpreted. The status area can also display custom user information as part of Windows Speech Recognition Macros.

Alternates panel

An alternates panel disambiguation interface lists items interpreted as being relevant to a user's spoken word(s); if the word or phrase that a user desired to insert into an application is listed among results, a user can speak the corresponding number of the word or phrase in the results and confirm this choice by speaking "OK" to insert it within the application. The alternates panel also appear when launching applications or speaking commands that refer to more than one item (e.g., speaking "Start Internet Explorer" may list both the web browser and a separate version with add-ons disabled). An ''ExactMatchOverPartialMatch'' entry in the Windows Registry can limit commands to items with exact names if there is more than one instance included in results.

Common commands

Listed below are common WSR commands. Words in ''italics'' indicate a word that can be substituted for the desired item (e.g., "direction" in "scroll ''direction''" can be substituted with the word "''down''"). A "start typing" command enables WSR to interpret all dictation commands as keyboard shortcuts. : Dictation commands: "New line"; "New paragraph"; "Tab"; "Literal ''word''"; "Numeral ''number''"; "Go to ''word''"; "Go after ''word''"; "No space"; "Go to start of sentence"; "Go to end of sentence"; "Go to start of paragraph"; "Go to end of paragraph"; "Go to start of document" "Go to end of document"; "Go to ''field name''" (e.g., go to ''address'', ''cc'', or ''subject''). Special characters such as a comma are dictated by speaking the name of the special character. : Navigation commands: :: Keyboard shortcuts: "Press ''keyboard key''"; "Press ' plus '"; "Press capital '." :: Keys that can be pressed without first giving the press command include: , , , , , , , and . :: Mouse commands: "Click"; "Click ''that''"; "Double-click"; "Double-click ''that''"; "Mark"; "Mark ''that''"; "Right-click"; "Right-click ''that''"; " MouseGrid". :: Window management commands: "Close (alternatively maximize, minimize, or restore) window"; "Close ''that''"; "Close ''name of open application''"; "Switch applications"; "Switch to ''name of open application''"; "Scroll ''direction''"; "Scroll ''direction'' in ''number of pages''"; "Show desktop"; " Show Numbers." : Speech recognition commands: "Start listening"; "Stop listening"; "Show speech options"; "Open speech dictionary"; "Move speech recognition"; "Minimize speech recognition"; "Restore speech recognition". In the English language, applicable commands can be shown by speaking "What can I say?" Users can also query the recognizer about tasks in Windows by speaking "How do I ''task name''" (e.g., "How do I install a printer?") which opens related help documentation.

''MouseGrid''

''MouseGrid'' enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions gradually narrow as a user speaks the number(s) of the region on which to focus until the desired interface element is reached. Users can then issue commands including "Click ''number of region''," which moves the mouse cursor to the desired region and then clicks it; and "Mark ''number of region''", which allows an item (such as a computer icon) in a region to be selected, which can then be clicked with the previous ''click'' command. Users also can interact with multiple regions at once.

''Show Numbers''

Applications and interface elements that do not present identifiable commands can still be controlled by asking the system to overlay numbers on top of them through a ''Show Numbers'' command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. ''Show Numbers'' was designed so that users could interact with items that are not readily identifiable.

Dictation

WSR enables dictation of text in applications and Windows. If a dictation mistake occurs it can be corrected by speaking "Correct ''word''" or "Correct that" and the alternates panel will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion and by speaking "OK." If the desired item is not listed among suggestions, a user can speak it so that it might appear. Alternatively, users can speak "Spell it" or "I'll spell it myself" to speak the desired word on letter-by-letter basis; users can use their personal alphabet or the

NATO phonetic alphabet The (International) Radiotelephony Spelling Alphabet, commonly known as the NATO phonetic alphabet, is the most widely used set of clear code words for communicating the letters of the Roman alphabet, technically a ''radiotelephonic spellin ...

(e.g., "N as in November") when spelling. Multiple words in a sentence can be corrected simultaneously (for example, if a user speaks "dictating" but the recognizer interprets this word as "the thing," a user can state "correct the thing" to correct both words at once). In the English language over 100,000 words are recognized by default.

Speech dictionary

A personal dictionary allows users to include or exclude certain words or expressions from dictation. When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized or if capitalization depends on the context in which the word is spoken. Users can also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a

stylus A stylus (plural styli or styluses) is a writing utensil or a small tool for some other form of marking or shaping, for example, in pottery. It can also be a computer accessory that is used to assist in navigating or providing more precision w ...

on a tablet PC for the Windows handwriting recognition feature are also stored. Information stored within a dictionary is included as part of a user's speech profile. Users can open the speech dictionary by speaking the "show speech dictionary" command.

Macros

WSR supports custom macros through a supplementary application by Microsoft that enables additional

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

commands. As an example of this functionality, an email macro released by Microsoft enables a natural language command where a user can speak "send email to ''contact'' about ''subject''," which opens

Microsoft Outlook Microsoft Outlook is a personal information manager software system from Microsoft, available as a part of the Microsoft Office and Microsoft 365 software suites. Though primarily an email client, Outlook also includes such functions as Calen ...

to compose a new message with the designated contact and subject automatically inserted. Microsoft has also released sample macros for the speech dictionary, for Windows Media Player, for

Microsoft PowerPoint Microsoft PowerPoint is a presentation program, created by Robert Gaskins and Dennis Austin at a software company named Forethought, Inc. It was released on April 20, 1987, initially for Macintosh computers only. Microsoft acquired PowerPoi ...

, for

, to switch between multiple microphones, to customize various aspects of audio device configuration such as volume levels, and for general natural language queries such as "What is the weather forecast?" "What time is it?" and "What's the date?" Responses to these user inquiries are spoken back to the user in the active Microsoft text-to-speech voice installed on the machine. Users and developers can create their own macros based on text transcription and substitution; application execution (with support for command-line interface#arguments, command-line arguments); keyboard shortcuts; emulation of existing voice commands; or a combination of these items. XML, JScript and VBScript are supported. Macros can be limited to specific applications and rules for macros can be defined programmatically. For a macro to load, it must be stored in a ''Speech Macros'' folder within the active user's ''My Documents, Documents'' directory. All macros are digital signature, digitally signed by default if a public key certificate, user certificate is available to ensure that stored commands are not altered or loaded by third-parties; if a certificate is not available, an administrator can create one. Configurable security levels can prohibit unsigned macros from being loaded; to prompt users to sign macros after creation; and to load unsigned macros.

Performance

WSR uses Microsoft Speech Recognizer 8.0, the version introduced in Windows Vista. For dictation it was found to be 93.6% accurate without training by Mark Hachman, a Senior Editor of ''PC World''—a rate that is not as accurate as competing software. According to Microsoft, the rate of accuracy when trained is 99%. Hachman opined that Microsoft does not publicly discuss the feature because of the 2006 incident during the development of Windows Vista, with the result being that few users knew that documents could be dictated within Windows before the introduction of Cortana (virtual assistant), Cortana.

References

External links

Windows Vista Speech Recognition demonstration at Microsoft Financial Analyst Meeting
{{Windows Components 2006 software Speech processing software Speech recognition software Windows components Windows Vista