HOME

TheInfoList



OR:

The Speech Application Programming Interface or SAPI is an
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
developed by
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
to allow the use of
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...
and speech synthesis within
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for ser ...
applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include
Microsoft Office Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketi ...
,
Microsoft Agent Microsoft Agent was a technology developed by Microsoft which employed animated characters, Speech synthesis, text-to-speech engines, and speech recognition software to enhance interaction with computer users. Thus it was an example of an embodie ...
and Microsoft Speech Server. In general, all versions of the API have been designed such that a software developer can write an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages. In addition, it is possible for a 3rd-party company to produce their own Speech Recognition and
Text-To-Speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...
engines or adapt existing engines to work with SAPI. In principle, as long as these engines conform to the defined interfaces they can be used instead of the Microsoft-supplied engines. In general, the Speech API is a freely redistributable component which can be shipped with any Windows application that wishes to use speech technology. Many versions (although not all) of the speech recognition and synthesis engines are also freely redistributable. There have been two main 'families' of the Microsoft Speech API. SAPI versions 1 through 4 are all similar to each other, with extra features in each newer version. SAPI 5, however, was a completely new interface, released in 2000. Since then several sub-versions of this API have been released.


Basic architecture

The Speech API can be viewed as an interface or piece of middleware which sits between ''applications'' and speech ''engines'' (recognition and synthesis). In SAPI versions 1 to 4, applications could directly communicate with engines. The API included an abstract ''interface definition'' which applications and engines conformed to. Applications could also use simplified higher-level objects rather than directly call methods on the engines. In SAPI 5 however, applications and engines do not directly communicate with each other. Instead, each talks to a runtime component (sapi.dll). There is an API implemented by this component which applications use, and another set of interfaces for engines. Typically in SAPI 5 applications issue calls through the API (for example to load a recognition grammar; start recognition; or provide text to be synthesized). The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces (for example, the loading of grammar from a file is done in the runtime, but then the grammar data is passed to the recognition engine to actually use in recognition). The recognition and synthesis engines also generate events while processing (for example, to indicate an utterance has been recognized or to indicate word boundaries in the synthesized speech). These pass in the reverse direction, from the engines, through the runtime DLL, and on to an ''event sink'' in the application. In addition to the actual API definition and runtime DLL, other components are shipped with all versions of SAPI to make a complete Speech
Software Development Kit A software development kit (SDK) is a collection of software development tools in one installable package. They facilitate the creation of applications by having a compiler, debugger and sometimes a software framework. They are normally specific ...
. The following components are among those included in most versions of the Speech SDK: *''API definition files'' - in
MIDL Microsoft Interface Definition Language (MIDL) is a text-based interface description language from Microsoft, based on the DCE/RPC IDL which it extends for use with the Microsoft Component Object Model. Its compiler is also called MIDL. See also ...
and as C or C++ header files. *''Runtime components'' - e.g. sapi.dll. *''Control Panel applet'' - to select and configure default speech recognizer and synthesizer. *''Text-To-Speech engines'' in multiple languages. *''Speech Recognition engines'' in multiple languages. *''Redistributable components'' to allow developers to package the engines and runtime with their
application code This glossary of computer software terms lists the general terms related to computer software, and related fields, as commonly used in Wikipedia articles. Glossary See also * Outline of computer programming * Outline of soft ...
to produce a single installable application. *''Sample application code''. *''Sample engines'' - implementations of the necessary engine interfaces but with no true speech processing which could be used as a sample for those porting an engine to SAPI. *''Documentation''.


Versions

Xuedong Huang Xuedong D. Huang (born October 20, 1962) is a Chinese American computer scientist and technology executive who has made contributions to spoken language processing and AI Cognitive Services. He is Microsoft's Technical Fellow and Chief Technology ...
was a key person who led Microsoft's early SAPI efforts.


SAPI 1-4 API family


SAPI 1

The first version of SAPI was released in 1995, and was supported on
Windows 95 Windows 95 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of operating systems. The first operating system in the 9x family, it is the successor to Windows 3.1x, and was released to manufacturi ...
and Windows NT 3.51. This version included low-level Direct Speech Recognition and Direct Text To Speech APIs which applications could use to directly control engines, as well as simplified 'higher-level' Voice Command and Voice Talk APIs.


SAPI 3

SAPI 3.0 was released in 1997. It added limited support for dictation speech recognition (discrete speech, not continuous), and additional sample applications and audio sources.


SAPI 4

SAPI 4.0 was released in 1998. This version of SAPI included both the core COM API; together with
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
wrapper classes to make programming from C++ easier; and
ActiveX ActiveX is a deprecated software framework created by Microsoft that adapts its earlier Component Object Model (COM) and Object Linking and Embedding (OLE) technologies for content downloaded from a network, particularly from the World Wide We ...
controls to allow drag-and-drop
Visual Basic Visual Basic is a name for a family of programming languages from Microsoft. It may refer to: * Visual Basic .NET (now simply referred to as "Visual Basic"), the current version of Visual Basic launched in 2002 which runs on .NET * Visual Basic ( ...
development. This was shipped as part of an SDK that included recognition and synthesis engines. It also shipped (with synthesis engines only) in
Windows 2000 Windows 2000 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It was the direct successor to Windows NT 4.0, and was released to manufacturing on December 15, 1999, and was officiall ...
. The main components of the SAPI 4 API (which were all available in C++, COM, and ActiveX flavors) were: *Voice Command - high-level objects for command & control speech recognition *Voice Dictation - high-level objects for continuous dictation speech recognition *Voice Talk - high-level objects for speech synthesis *Voice Telephony - objects for writing telephone speech applications *Direct Speech Recognition - objects for direct control of recognition engine *Direct Text To Speech - objects for direct control of synthesis engine *Audio objects - for reading to and from an audio device or file


SAPI 5 API family

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification. The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was intended to make the API more 'engine-independent', preventing applications from inadvertently depending on features of a specific engine. In addition, this change was aimed at making it much easier to incorporate speech technology into an application by moving some management and initialization code into the runtime. The new API was initially a pure COM API and could be used easily only from C/C++. Support for VB and scripting languages were added later. Operating systems from
Windows 98 Windows 98 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of Microsoft Windows operating systems. The second operating system in the 9x line, it is the successor to Windows 95, and was released to ...
and NT 4.0 upwards were supported. Major features of the API include: *Shared Recognizer. For desktop speech recognition applications, a recognizer object can be used that runs in a separate process (sapisvr.exe). All applications using the shared recognizer communicate with this single instance. This allows sharing of resources, removes contention for the microphone and allows for a global UI for control of all speech applications. *In-proc recognizer. For applications that require explicit control of the recognition process, the in-proc recognizer object can be used instead of the shared one. *Grammar objects. Speech grammars are used to specify the words that the recognizer is listening for. SAPI 5 defines an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
markup for specifying a grammar, as well as mechanisms to create them dynamically in code. Methods also exist for instructing the recognizer to load a built-in dictation language model. *Voice object. This performs speech synthesis, producing an audio stream from a text. A markup language (similar to XML, but not strictly XML) can be used for controlling the synthesis process. *Audio interfaces. The runtime includes objects for performing speech input from the microphone or speech output to speakers (or any sound device); as well as to and from wave files. It is also possible to write a custom audio object to stream audio to or from a non-standard location. *User lexicon object. This allows custom words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons. *Object tokens. This is a concept allowing recognition and TTS engines, audio objects, lexicons and other categories of an object to be registered, enumerated and instantiated in a common way.


SAPI 5.0

This version shipped in late 2000 as part of the Speech SDK version 5.0, together with version 5.0 recognition and synthesis engines. The recognition engines supported continuous dictation and command & control and were released in U.S. English, Japanese and
Simplified Chinese Simplification, Simplify, or Simplified may refer to: Mathematics Simplification is the process of replacing a mathematical expression by an equivalent one, that is simpler (usually shorter), for example * Simplification of algebraic expressions, ...
versions. In the U.S. English system, special acoustic models were available for children's speech and telephony speech. The synthesis engine was available in English and Chinese. This version of the API and recognition engines also shipped in Microsoft Office XP in 2001.


SAPI 5.1

This version shipped in late 2001 as part of the Speech SDK version 5.1. Automation-compliant interfaces were added to the API to allow use from Visual Basic, scripting languages such as
JScript JScript is Microsoft's legacy dialect of the ECMAScript standard that is used in Microsoft's Internet Explorer 11 and older. JScript is implemented as an Active Scripting engine. This means that it can be "plugged in" to OLE Automation applic ...
, and
managed code Managed code is computer program code that requires and will execute only under the management of a Common Language Infrastructure (CLI); Virtual Execution System (VES); virtual machine, e.g. .NET, CoreFX, or .NET Framework; Common Language Runt ...
. This version of the API and TTS engines were shipped in
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...
. Windows XP Tablet PC Edition and Office 2003 also include this version but with a substantially improved version 6 recognition engine and
Traditional Chinese A tradition is a belief or behavior (folk custom) passed down within a group or society with symbolic meaning or special significance with origins in the past. A component of cultural expressions and folklore, common examples include holidays ...
.


SAPI 5.2

This was a special version of the API for use only in the Microsoft Speech Server which shipped in 2004. It added support for SRGS and SSML mark-up languages, as well as additional server features and performance improvements. The Speech Server also shipped with the version 6 desktop recognition engine and the version 7 server recognition engine.


SAPI 5.3

This is the version of the API that ships in
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
together with new recognition and synthesis engines. As
Windows Speech Recognition Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables hands-free computing, voice commands to control the desktop metaphor, desktop user interface; transcription (linguistics), dictate text i ...
is now integrated into the operating system, the Speech SDK and APIs are a part of the Windows SDK. SAPI 5.3 includes the following new features: * Support for W3C XML speech grammars for recognition and synthesis. The
Speech Synthesis Markup Language Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony ...
(SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation. * The
Speech Recognition Grammar Specification Speech Recognition Grammar Specification (SRGS) is a W3C standard for how ''speech recognition grammars'' are specified. A speech recognition grammar is a set of word patterns, and tells a speech recognition system what to expect a human to say. ...
(SRGS) supports the definition of context-free grammars, with two limitations: ** It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars. ** It does not support
Augmented Backus–Naur form In computer science, augmented Backus–Naur form (ABNF) is a metalanguage based on Backus–Naur form (BNF), but consisting of its own syntax and derivation rules. The motive principle for ABNF is to describe a formal system of a language to be use ...
(ABNF). * Support for semantic interpretation script within grammars. SAPI 5.3 enables an SRGS grammar to be annotated with
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
for semantic interpretation to supplement the recognized text. * User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string. * Additional functionality and ease-of-programming provided by new types. * Performance improvements, improved reliability, and security. * Version 8 of the speech recognition engine ("Microsoft Speech Recognizer")


SAPI 5.4

This is an updated version of the API that ships in
Windows 7 Windows 7 is a major release of the Windows NT operating system developed by Microsoft. It was released to manufacturing on July 22, 2009, and became generally available on October 22, 2009. It is the successor to Windows Vista, released nearly ...
.


SAPI 5 Voices

Microsoft Sam (Speech Articulation Module) is a commonly shipped SAPI 5 voice. In addition,
Microsoft Office Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketi ...
XP and Office 2003 installed L&H Michael and Michelle voices. The SAPI 5.1 SDK installs 2 more voices, ''Mike'' and ''Mary''.
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
includes Microsoft Anna which replaces Microsoft Sam and sounds more natural and intelligible. It is also installed on Windows XP by Microsoft Streets & Trips 2006 and later versions. The Chinese version of Vista and later Windows client versions also include a female voice named Microsoft Lili.


Managed code Speech API

A
managed code Managed code is computer program code that requires and will execute only under the management of a Common Language Infrastructure (CLI); Virtual Execution System (VES); virtual machine, e.g. .NET, CoreFX, or .NET Framework; Common Language Runt ...
API ships as part of the
.NET Framework 3.0 Microsoft started development on the .NET Framework in the late 1990s originally under the name of Next Generation Windows Services (NGWS). By late 2001 the first beta versions of .NET 1.0 were released. The first version of .NET Framework was ...
. It has similar functionality to SAPI 5 but is more suitable to be used by managed code applications. The new API is available on
Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...
,
Windows Server 2003 Windows Server 2003 is the sixth version of Windows Server operating system produced by Microsoft. It is part of the Windows NT family of operating systems and was released to manufacturing on March 28, 2003 and generally available on April 24, 2 ...
,
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
, and
Windows Server 2008 Windows Server 2008 is the fourth release of the Windows Server operating system produced by Microsoft as part of the Windows NT family of the operating systems. It was released to manufacturing on February 4, 2008, and generally to retail on F ...
. The existing SAPI 5 API can also be used from managed code to a limited extent by creating a COM Interop code (helper code designed to assist in accessing COM interfaces and classes). This works well in some scenarios however the new API should provide a more seamless experience equivalent to using any other managed code library. However, major obstacle towards transitioning from the COM Interop is the fact that the managed implementation has subtle
memory leak In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in a way that memory which is no longer needed is not released. A memory leak may also happen when an object ...
s which lead to memory fragmentation and exclude the use of the library in any non-trivial applications. As a workaround, Microsoft has suggested using a different API, which has fewer voices.System. Speech has a memory leak , Microsoft Connect
Connect.microsoft.com. Retrieved on 2013-09-27.


Speech functionality in Windows Vista

Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
includes a number of new speech-related features including: * Speech control of the full Windows
GUI The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
and applications * New tutorial, microphone wizard, and UI for controlling speech recognition * New version of the Speech API runtime: SAPI 5.3 * Built-in updated Speech Recognition engine (Version 8) * New Speech Synthesis engine and SAPI voice Microsoft Anna *
Managed code Managed code is computer program code that requires and will execute only under the management of a Common Language Infrastructure (CLI); Virtual Execution System (VES); virtual machine, e.g. .NET, CoreFX, or .NET Framework; Common Language Runt ...
speech API (codenamed SpeechFX) * Speech recognition support for 8 languages at release time: U.S. English, U.K. English, traditional Chinese, simplified Chinese, Japanese, Spanish, French, and German, with more language to be released later.
Microsoft Agent Microsoft Agent was a technology developed by Microsoft which employed animated characters, Speech synthesis, text-to-speech engines, and speech recognition software to enhance interaction with computer users. Thus it was an example of an embodie ...
most notably, and all other Microsoft speech applications use SAPI 5.


Compatibility

The Speech API is compatible with the following operating systems:


SAPI 5

* Microsoft Windows 11 *
Microsoft Windows 10 Windows 10 is a major release of Microsoft's Windows NT operating system. It is the direct successor to Windows 8.1, which was released nearly two years earlier. It was released to manufacturing on July 15, 2015, and later to retail on J ...
*
Microsoft Windows 8 Windows 8 is a major release of the Windows NT operating system developed by Microsoft. It was released to manufacturing on August 1, 2012; it was subsequently made available for download via MSDN and TechNet on August 15, 2012, and later to r ...
* Microsoft Windows 7 *
Microsoft Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
*
Microsoft Windows XP Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct upgrade to its predecessors, Windows 2000 for high-end and ...


SAPI 4

* Microsoft Windows Millennium Edition *
Microsoft Windows 2000 Windows 2000 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It was the direct successor to Windows NT 4.0, and was released to manufacturing on December 15, 1999, and was officiall ...
* Microsoft Windows 98 *
Microsoft Windows NT 4.0 Windows NT 4.0 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It is the direct successor to Windows NT 3.51, which was released to manufacturing on July 31, 1996, and then to retail ...
, Service Pack 6a, in English, Japanese and Simplified Chinese. *
Microsoft Windows 95 Windows 95 is a consumer-oriented operating system developed by Microsoft as part of its Windows 9x family of operating systems. The first operating system in the 9x family, it is the successor to Windows 3.1x, and was released to manufacturin ...


Major applications using SAPI

*Microsoft Windows XP Tablet PC Edition includes SAPI 5.1 and speech recognition engines 6.1 for English, Japanese, and Chinese (simplified and traditional) *
Windows Speech Recognition Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables hands-free computing, voice commands to control the desktop metaphor, desktop user interface; transcription (linguistics), dictate text i ...
in
Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, which was released five years before, at the time being the longest time span between successive releases of ...
and later * Microsoft Narrator in Windows 2000 and later Windows operating systems *
Microsoft Office Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketi ...
XP and Office 2003 *
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for App ...
2002, Microsoft Excel 2003, and Microsoft Excel 2007 for speaking spreadsheet data * Microsoft Voice Command for Windows Pocket PC and Windows Mobile * Microsoft Plus! Voice Command for Windows Media Player *
Adobe Reader Adobe Acrobat is a family of application software and Web services developed by Adobe Inc. to view, create, manipulate, print and manage Portable Document Format (PDF) files. The family comprises Acrobat Reader (formerly Reader), Acrobat (former ...
uses voice output to read document content * CoolSpeech, a text-to-speech application that reads text aloud from a variety of sources * Window-Eyes screen reader * JAWS screen reader *
NonVisual Desktop Access NonVisual Desktop Access (NVDA) is a free and open-source, portable screen reader for Microsoft Windows. The project was started by Michael Curran in 2006. NVDA is programmed in Python. It currently works exclusively with accessibility APIs s ...
(NVDA), a free and open source screen reader


See also

*
Comparison of speech synthesizers Here is a non-exhaustive comparison of speech synthesis programs: General Technical voice details Technical details {, class="wikitable sortable" style="font-size: 85%; text-align: center; width: 100%;" , - ! style="width:12em" , ...
*
List of speech recognition software Speech recognition software is available for many computing platforms, operating systems, use models, and software licenses. Here is a listing of such, grouped in various useful ways. Acoustic models and speech corpus (compilation) The following ...
*


References


External links


Microsoft Cognitive Services Ignite 2018 event blog postMicrosoft site for SAPIMicrosoft download site for Speech API Software Developers Kit version 5.1Microsoft Systems Journal Whitepaper by Mike Rozak on the first version of SAPIMicrosoft Speech Team blog
{{Speech synthesis Microsoft application programming interfaces Speech processing software Voice technology