A voice-user interface (VUI) enables spoken human interaction with computers, using

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface. Voice user interfaces have been added to

automobile A car, or an automobile, is a motor vehicle with wheels. Most definitions of cars state that they run primarily on roads, Car seat, seat one to eight people, have four wheels, and mainly transport private transport#Personal transport, peopl ...

home automation Home automation or domotics is building automation for a home. A home automation system will monitor and/or control home attributes such as lighting, climate, entertainment systems, and appliances. It may also include home security such ...

systems, computer

operating system An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...

home appliance A home appliance, also referred to as a domestic appliance, an electric appliance or a household appliance, is a machine which assists in household functions such as cooking, cleaning and food preservation. The domestic application attached to ...

s like

washing machine A washing machine (laundry machine, clothes washer, washer, or simply wash) is a machine designed to laundry, launder clothing. The term is mostly applied to machines that use water. Other ways of doing laundry include dry cleaning (which uses ...

s and

microwave oven A microwave oven, or simply microwave, is an electric oven that heats and cooks food by exposing it to electromagnetic radiation in the microwave frequency range. This induces Dipole#Molecular dipoles, polar molecules in the food to rotate and ...

s, and television

remote control A remote control, also known colloquially as a remote or clicker, is an consumer electronics, electronic device used to operate another device from a distance, usually wirelessly. In consumer electronics, a remote control can be used to operat ...

s. They are the primary way of interacting with

virtual assistant A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to streaml ...

s on

smartphones A smartphone is a mobile phone with advanced computing capabilities. It typically has a touchscreen interface, allowing users to access a wide range of applications and services, such as web browsing, email, and social media, as well as mult ...

and

smart speaker A smart speaker is a type of loudspeaker and voice command device with an integrated virtual assistant (artificial intelligence), virtual assistant that offers interactive actions and Hands-free computing, hands-free activation with the help of o ...

s. Older

automated attendant In telephony, an automated attendant (also auto attendant, auto-attendant, autoattendant, automatic phone menus, AA, or virtual receptionist) allows callers to be automatically transferred to an extension without the intervention of an operator/ ...

s (which route phone calls to the correct extension) and

interactive voice response Interactive voice response (IVR) is a technology that allows telephone users to interact with a computer-operated telephone system through the use of voice and DTMF tones input with a keypad. In telephony, IVR allows customers to interact with a ...

systems (which conduct more complicated transactions over the phone) can respond to the pressing of keypad buttons via

DTMF Dual-tone multi-frequency (DTMF) signaling is a telecommunication signaling system using the voice-frequency band over telephone lines between telephone equipment and other communications devices and switching centers. DTMF was first developed ...

tones, but those with a full voice user interface allow callers to speak requests and responses without having to press any buttons. Newer voice command devices are speaker-independent, so they can respond to multiple voices, regardless of accent or dialectal influences. They are also capable of responding to several commands at once, separating vocal messages, and providing appropriate

feedback Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause and effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handle ...

, accurately imitating a natural conversation.

Overview

A VUI is the interface to any speech application. Only a short time ago, controlling a machine by simply talking to it was only possible in

science fiction Science fiction (often shortened to sci-fi or abbreviated SF) is a genre of speculative fiction that deals with imaginative and futuristic concepts. These concepts may include information technology and robotics, biological manipulations, space ...

. Until recently, this area was considered to be

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

. However, advances in technologies like text-to-speech, speech-to-text,

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, and cloud services contributed to the mass adoption of these types of interfaces. VUIs have become more commonplace, and people are taking advantage of the value that these hands-free, eyes-free interfaces provide in many situations. VUIs need to respond to input reliably, or they will be rejected and often ridiculed by their users. Designing a good VUI requires interdisciplinary talents of

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

and human factors

psychology Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such as thoughts, feel ...

– all of which are skills that are expensive and hard to come by. Even with advanced development tools, constructing an effective VUI requires an in-depth understanding of both the tasks to be performed, as well as the target audience that will use the final system. The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction. A VUI designed for the general public should emphasize ease of use and provide a lot of help and guidance for first-time callers. In contrast, a VUI designed for a small group of

power users A power user is a user of computers, software and other electronic devices who uses advanced features of computer hardware, operating systems, programs, or websites which are not used by the average user. A power user might not have extensive tech ...

(including field service workers), should focus more on productivity and less on help and guidance. Such applications should streamline the call flows, minimize prompts, eliminate unnecessary iterations and allow elaborate "mixed initiative dialogs", which enable callers to enter several pieces of information in a single utterance and in any order or combination. In short, speech applications have to be carefully crafted for the specific business process that is being automated. Not all business processes render themselves equally well for speech automation. In general, the more complex the inquiries and transactions are, the more challenging they will be to automate, and the more likely they will be to fail with the general public. In some scenarios, automation is simply not applicable, so live agent assistance is the only option. A legal advice hotline, for example, would be very difficult to automate. On the flip side, speech is perfect for handling quick and routine transactions, like changing the status of a work order, completing a time or expense entry, or transferring funds between accounts.

History

Early applications for VUI included voice-activated dialing of phones, either directly or through a (typically

Bluetooth Bluetooth is a short-range wireless technology standard that is used for exchanging data between fixed and mobile devices over short distances and building personal area networks (PANs). In the most widely used mode, transmission power is li ...

) headset or vehicle audio system. In 2007, a

CNN Cable News Network (CNN) is a multinational news organization operating, most notably, a website and a TV channel headquartered in Atlanta. Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable ne ...

business article reported that voice command was over a billion dollar industry and that companies like Google and

Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...

were trying to create speech recognition features. In the years since the article was published, the world has witnessed a variety of voice command devices. Additionally, Google has created a speech recognition engine called Pico TTS and Apple released Siri. Voice command devices are becoming more widely available, and innovative ways for using the human voice are always being created. For example, Business Week suggests that the future remote controller is going to be the human voice. Currently

Xbox Live The Xbox network, formerly known and commonly referred to as Xbox Live, is an online multiplayer gaming and digital media delivery service created and operated by Microsoft Gaming for the Xbox brand. It was first made available to the origina ...

allows such features and Jobs hinted at such a feature on the new

Apple TV Apple TV is a digital media player and a microconsole developed and marketed by Apple. It is a small piece of networking hardware that sends received media data such as video and audio to a TV or external display. Its media services include ...

Voice command software products on computing devices

Both Apple Mac and

Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...

PC provide built in speech recognition features for their latest

operating systems An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...

Microsoft Windows

Two Microsoft operating systems,

Windows 7 Windows 7 is a major release of the Windows NT operating system developed by Microsoft. It was Software release life cycle#Release to manufacturing (RTM), released to manufacturing on July 22, 2009, and became generally available on October 22, ...

and

Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft W ...

, provide speech recognition capabilities. Microsoft integrated voice commands into their operating systems to provide a mechanism for people who want to limit their use of the mouse and keyboard, but still want to maintain or increase their overall productivity.

Windows Vista

With Windows Vista voice control, a user may dictate documents and emails in mainstream applications, start and switch between applications, control the operating system, format documents, save documents, edit files, efficiently correct errors, and fill out forms on the

Web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...

. The speech recognition software learns automatically every time a user uses it, and speech recognition is available in English (U.S.), English (U.K.), German (Germany), French (France), Spanish (Spain), Japanese, Chinese (Traditional), and Chinese (Simplified). In addition, the software comes with an interactive tutorial, which can be used to train both the user and the speech recognition engine.

Windows 7

In addition to all the features provided in Windows Vista, Windows 7 provides a wizard for setting up the microphone and a tutorial on how to use the feature.

Mac OS X

All

Mac OS X macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...

computers come pre-installed with the speech recognition software. The software is user-independent, and it allows for a user to, "navigate menus and enter keyboard shortcuts; speak checkbox names, radio button names, list items, and button names; and open, close, control, and switch among applications." However, the Apple website recommends a user buy a commercial product called Dictate.

Commercial products

If a user is not satisfied with the built in speech recognition software or a user does not have a built speech recognition software for their OS, then a user may experiment with a commercial product such as Braina Pro or DragonNaturallySpeaking for Windows PCs, and Dictate, the name of the same software for Mac OS.

Voice command mobile devices

Any mobile device running Android OS, Microsoft Windows Phone, iOS 9 or later, or Blackberry OS provides voice command capabilities. In addition to the built-in speech recognition software for each mobile phone's operating system, a user may download third party voice command applications from each operating system's application store:

Apple App store The App Store is an app marketplace developed and maintained by Apple, for mobile apps on its iOS and iPadOS operating systems. The store allows users to browse and download approved apps developed within Apple's iOS SDK. Apps can be download ...

Google Play Google Play, also known as the Google Play Store, Play Store, or sometimes the Android Store (and was formerly Android Market), is a digital distribution service operated and developed by Google. It serves as the official app store for certifie ...

Windows Phone Marketplace Windows Phone Store was a digital distribution platform developed by Microsoft for Windows Phone for publishers to release apps on and users to download apps to their smartphones. The app store was initially launched as Windows Phone Marketplac ...

(initially Windows Marketplace for Mobile), or

BlackBerry App World BlackBerry World was an Digital distribution, application distribution service (app marketplace) by BlackBerry Limited. The service provided BlackBerry users with an environment to browse, download, and update mobile apps, including third-party ap ...

Android OS

Google has developed an open source operating system called Android, which allows a user to perform voice commands such as: send text messages, listen to music, get directions, call businesses, call contacts, send email, view a map, go to websites, write a note, and search Google. The speech recognition software is available for all devices since Android 2.2 "Froyo", but the settings must be set to English. Google allows for the user to change the language, and the user is prompted when he or she first uses the speech recognition feature if he or she would like their voice data to be attached to their Google account. If a user decides to opt into this service, it allows Google to train the software to the user's voice. Google introduced the

Google Assistant Google Assistant is a virtual assistant software application developed by Google that is primarily available on home automation and mobile devices. Based on artificial intelligence, Google Assistant can engage in two-way conversations, unlike ...

with Android 7.0 "Nougat". It is much more advanced than the older version.

Amazon.com Amazon.com, Inc., doing business as Amazon, is an American multinational technology company engaged in e-commerce, cloud computing, online advertising, digital streaming, and artificial intelligence. Founded in 1994 by Jeff Bezos in Bellevu ...

has the

Echo In audio signal processing and acoustics, an echo is a reflection of sound that arrives at the listener with a delay after the direct sound. The delay is directly proportional to the distance of the reflecting surface from the source and the lis ...

that uses Amazon's custom version of Android to provide a voice interface.

Microsoft Windows

Windows Phone Windows Phone (WP) is a discontinued mobile operating system developed by Microsoft Mobile for smartphones as the replacement successor to Windows Mobile and Zune. Windows Phone featured a new user interface derived from the Metro design languag ...

Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...

's mobile device's operating system. On Windows Phone 7.5, the speech app is user independent and can be used to: call someone from your contact list, call any phone number, redial the last number, send a text message, call your voice mail, open an application, read appointments, query phone status, and search the web. In addition, speech can also be used during a phone call, and the following actions are possible during a phone call: press a number, turn the speaker phone on, or call someone, which puts the current call on hold. Windows 10 introduces Cortana, a voice control system that replaces the formerly used voice control on Windows phones.

iOS

Apple added Voice Control to its family of iOS devices as a new feature of

iPhone OS 3 iPhone OS 3 (''stylized as iPhone OS 3.0'') is the third major release of the iOS mobile operating system developed by Apple Inc., succeeding iPhone OS 2. It was announced on March 17, 2009, and was released on June 17, 2009. It was succeeded ...

. The

iPhone 4S The is a smartphone that was developed and marketed by Apple Inc. It is the List of iPhone models, fifth generation of the iPhone, succeeding the iPhone 4 and preceding the iPhone 5. It was announced on October 4, 2011, at Apple's Cupertino ...

, iPad 3, iPad Mini 1G,

iPad Air The iPad is a brand of tablet computers developed and marketed by Apple that run the company's mobile operating systems iOS and later iPadOS. The first-generation iPad was introduced on January 27, 2010. Since then, the iPad product lin ...

, iPad Pro 1G, iPod Touch 5G and later, all come with a more advanced voice assistant called

Siri Siri ( , backronym: Speech Interpretation and Recognition Interface) is a digital assistant purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating sys ...

. Voice Control can still be enabled through the Settings menu of newer devices. Siri is a user independent built-in speech recognition feature that allows a user to issue voice commands. With the assistance of Siri a user may issue commands like, send a text message, check the weather, set a reminder, find information, schedule meetings, send an email, find a contact, set an alarm, get directions, track your stocks, set a timer, and ask for examples of sample voice command queries. In addition, Siri works with

and wired headphones. Apple introduced Personal Voice as an accessibility feature in

iOS 17 iOS 17 is the seventeenth major release of Apple's iOS operating system for the iPhone. It is the direct successor to iOS 16. It was announced on June 5, 2023, at Apple's annual Worldwide Developers Conference alongside watchOS 10, iPadOS 1 ...

, launched on September 18, 2023. This feature allows users to create a personalized, machine learning-generated (AI) version of their voice for use in

text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or Computer hardware, hardware products. A text-to-speech (TTS) system conv ...

applications. Designed particularly for individuals with speech impairments, Personal Voice helps preserve the unique sound of a user's voice. It enhances

and other accessibility tools by providing a more personalized and inclusive

user experience User experience (UX) is how a user interacts with and experiences a product, system or service. It includes a person's perceptions of utility, ease of use, and efficiency. Improving user experience is important to most companies, designers, a ...

. Personal Voice reflects Apple's ongoing commitment to accessibility and

innovation Innovation is the practical implementation of ideas that result in the introduction of new goods or service (economics), services or improvement in offering goods or services. ISO TC 279 in the standard ISO 56000:2020 defines innovation as "a n ...

Amazon Alexa

In 2014 Amazon introduced the Alexa smart home device. Its main purpose was just a smart speaker, that allowed the consumer to control the device with their voice. Eventually, it turned into a novelty device that had the ability to control home appliance with voice. Now almost all the appliances are controllable with Alexa, including light bulbs and temperature. By allowing voice control, Alexa can connect to smart home technology allowing you to lock your house, control the temperature, and activate various devices. This form of A.I allows for someone to simply ask it a question, and in response the Alexa searches for, finds, and recites the answer back to you.

Speech recognition in cars

As car technology improves, more features will be added to cars and these features could potentially distract a driver. Voice commands for cars, according to CNET, should allow a driver to issue commands and not be distracted. CNET stated that Nuance was suggesting that in the future they would create a software that resembled Siri, but for cars. Most speech recognition software on the market in 2011 had only about 50 to 60 voice commands, but Ford Sync had 10,000. However, CNET suggested that even 10,000 voice commands was not sufficient given the complexity and the variety of tasks a user may want to do while driving. Voice command for cars is different from voice command for mobile phones and for computers because a driver may use the feature to look for nearby restaurants, look for gas, driving directions, road conditions, and the location of the nearest hotel. Currently, technology allows a driver to issue voice commands on both a portable GPS like a

Garmin Garmin Ltd. is an American multinational technology company based in Olathe, Kansas. The company designs, develops, manufactures, markets, and distributes GPS-enabled products and other navigation, communication, sensor-based, and information ...

and a car manufacturer navigation system. List of Voice Command Systems Provided By Motor Manufacturers: *

Ford Sync Ford Sync (stylized Ford SYNC) is a factory-installed, integrated in-car communications and entertainment system, in-vehicle communications and entertainment system that allows users to make hands-free telephone calls, control music and perform ...

* Lexus Voice Command * Chrysler UConnect *

Honda Accord The , also known as the in Japan and China for certain generations, is a series of automobiles manufactured by Honda since 1976, best known for its four-door sedan variant, which has been one of the best-selling cars in the United States sinc ...

* GM IntelliLink * BMW * Mercedes * Pioneer * Harman * Hyundai

Non-verbal input

While most voice user interfaces are designed to support interaction through spoken human language, there have also been recent explorations in designing interfaces take non-verbal human sounds as input. In these systems, the user controls the interface by emitting non-speech sounds such as humming, whistling, or blowing into a microphone. One such example of a non-verbal voice user interface is Blendie, an interactive art installation created by Kelly Dobson. The piece comprised a classic 1950s-era blender which was retrofitted to respond to microphone input. To control the blender, the user must mimic the whirring mechanical sounds that a blender typically makes: the blender will spin slowly in response to a user's low-pitched growl, and increase in speed as the user makes higher-pitched vocal sounds. Another example is VoiceDraw, a research system that enables digital drawing for individuals with limited motor abilities. VoiceDraw allows users to "paint" strokes on a digital canvas by modulating vowel sounds, which are mapped to brush directions. Modulating other paralinguistic features (e.g. the loudness of their voice) allows the user to control different features of the drawing, such as the thickness of the brush stroke. Other approaches include adopting non-verbal sounds to augment touch-based interfaces (e.g. on a mobile phone) to support new types of gestures that wouldn't be possible with finger input alone.

Design challenges

Voice interfaces pose a substantial number of challenges for usability. In contrast to graphical user interfaces (GUIs), best practices for voice interface design are still emergent.

Discoverability

With purely audio-based interaction, voice user interfaces tend to suffer from low

discoverability Discoverability is the degree to which something, especially a piece of content or information, can be found in a search of a file, database, or other information system. Discoverability is a concern in library and information science, many aspects ...

: it is difficult for users to understand the scope of a system's capabilities. In order for the system to convey what is possible without a visual display, it would need to enumerate the available options, which can become tedious or infeasible. Low discoverability often results in users reporting confusion over what they are "allowed" to say, or a mismatch in expectations about the breadth of a system's understanding.

Transcription

While

technology has improved considerably in recent years, voice user interfaces still suffer from parsing or transcription errors in which a user's speech is not interpreted correctly. These errors tend to be especially prevalent when the speech content uses technical vocabulary (e.g. medical terminology) or unconventional spellings such as musical artist or song names.

Understanding

Effective system design to maximize conversational understanding remains an open area of research. Voice user interfaces that interpret and manage conversational state are challenging to design due to the inherent difficulty of integrating complex natural language processing tasks like coreference resolution,

named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

, and dialog management. Most voice assistants today are capable of executing single commands very well but limited in their ability to manage dialogue beyond a narrow task or a couple turns in a conversation.

Privacy implications

Privacy concerns are raised by the fact that voice commands are available to the providers of voice-user interfaces in unencrypted form, and can thus be shared with third parties and be processed in an unauthorized or unexpected manner. Additionally to the linguistic content of recorded speech, a user's manner of expression and voice characteristics can implicitly contain information about his or her biometric identity, personality traits, body shape, physical and mental health condition, sex, gender, moods and emotions, socioeconomic status and geographical origin.

References

External links

Voice Interfaces: Assessing the Potential
by Jakob Nielsen
The Rise of Voice: A TimelineVoice First Glossary of TermsVoice First A Reading List
{{Virtual assistants User interface techniques Voice technology Speech recognition History of human–computer interaction

Overview

History

Voice command software products on computing devices

Microsoft Windows

Windows Vista

Windows 7

Mac OS X

Commercial products

Voice command mobile devices

Android OS

Microsoft Windows

iOS

Amazon Alexa

Speech recognition in cars

Non-verbal input

Design challenges

Discoverability

Transcription

Understanding

Privacy implications

See also

References

External links