Voice computing is the discipline that develops hardware or software to process voice inputs. It spans many other fields including human-computer interaction, conversational computing,

linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

automatic speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

audio engineering Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound * Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum * Digital audio, representation of sou ...

digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...

data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...

ethics Ethics or moral philosophy is a branch of philosophy that "involves systematizing, defending, and recommending concepts of right and wrong behavior".''Internet Encyclopedia of Philosophy'' The field of ethics, along with aesthetics, concerns m ...

law Law is a set of rules that are created and are enforceable by social or governmental institutions to regulate behavior,Robertson, ''Crimes against humanity'', 90. with its precise definition a matter of longstanding debate. It has been vario ...

, and

information security Information security, sometimes shortened to InfoSec, is the practice of protecting information by mitigating information risks. It is part of information risk management. It typically involves preventing or reducing the probability of unauthorize ...

. Voice computing has become increasingly significant in modern times, especially with the advent of

smart speakers A smart speaker is a type of loudspeaker and voice command device with an integrated virtual assistant that offers interactive actions and hands-free activation with the help of one "hot word" (or several "hot words"). Some smart speakers can a ...

like the

Amazon Echo Amazon Echo, often shortened to Echo, is an American brand of smart speakers developed by Amazon. Echo devices connect to the voice-controlled intelligent personal assistant service ''Alexa'', which will respond when a user says "Alexa". Users m ...

and

Google Assistant Google Assistant is a virtual assistant software application developed by Google that is primarily available on mobile and home automation devices. Based on artificial intelligence, Google Assistant can engage in two-way conversations, unlike t ...

, a shift towards

serverless computing Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still ...

, and improved accuracy of

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

and

text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

models.

History

Voice computing has a rich history. First, scientists like Wolfgang Kempelen started to build speech machines to produce the earliest synthetic speech sounds. This led to further work by Thomas Edison to record audio with

dictation machines A dictation machine is a sound recording device most commonly used to record speech for playback or to be typed into print. It includes digital voice recorders and tape recorder. The name "Dictaphone" is a trademark of the company of the same n ...

and play it back in corporate settings. In the 1950s-1960s there were primitive attempts to build automated

systems by

Bell Labs Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984), then AT&T Bell Laboratories (1984–1996) and Bell Labs Innovations (1996–2007), is an American industrial research and scientific development company owned by mult ...

, IBM, and others. However, it was not until the 1980s that

Hidden Markov Models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an obs ...

were used to recognize up to 1,000 words that speech recognition systems became relevant. Around 2011,

Siri Siri ( ) is a virtual assistant that is part of Apple Inc.'s iOS, iPadOS, watchOS, macOS, tvOS, and audioOS operating systems. It uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questio ...

emerged on Apple iPhones as the first voice assistant accessible to consumers. This innovation led to a dramatic shift to building voice-first computing architectures.

PS4 The PlayStation 4 (PS4) is a home video game console developed by Sony Interactive Entertainment. Announced as the successor to the PlayStation 3 in February 2013, it was launched on November 15, 2013, in North America, November 29, 2013 in E ...

was released by Sony in North America in 2013 (70+ million devices), Amazon released the

in 2014 (30+ million devices),

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...

released Cortana (2015 - 400 million Windows 10 users), Google released

(2016 - 2 billion active monthly users on Android phones), and

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...

released

HomePod The HomePod is a smart speaker developed by Apple Inc. The HomePod was designed to work with the Apple Music subscription service. The HomePod was announced on June 5, 2017, at the Apple Worldwide Developers Conference. Its launch was later del ...

(2018 - 500,000 devices sold and 1 billion devices active with iOS/Siri). These shifts, along with advancements in cloud infrastructure (e.g.

Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...

) and

codecs A codec is a device or computer program that encodes or decodes a data stream or signal. ''Codec'' is a portmanteau of coder/decoder. In electronic communications, an endec is a device that acts as both an encoder and a decoder on a signal or da ...

, have solidified the voice computing field and made it widely relevant to the public at large.

Hardware

A voice computer is assembled hardware and software to process voice inputs. Note that voice computers do not necessarily need a screen, such as in the traditional

. In other embodiments, traditional

laptop computers A laptop, laptop computer, or notebook computer is a small, portable personal computer (PC) with a screen and alphanumeric keyboard. Laptops typically have a clam shell form factor with the screen mounted on the inside of the upper li ...

mobile phones A mobile phone, cellular phone, cell phone, cellphone, handphone, hand phone or pocket phone, sometimes shortened to simply mobile, cell, or just phone, is a portable telephone that can make and receive calls over a radio frequency link whil ...

could be used as voice computers. Moreover, there has become increasingly more interfaces for voice computers with the advent of IoT-enabled devices, such as within cars or televisions. As of September 2018, there are currently over 20,000 types of devices compatible with Amazon Alexa.

Software

Voice computing software can read/write, record, clean, encrypt/decrypt, playback, transcode, transcribe, compress, publish, featurize, model, and visualize voice files. Here are some popular software packages related to voice computing:

Applications

Voice computing applications span many industries including voice assistants, healthcare, e-Commerce, finance, supply chain, agriculture, text-to-speech, security, marketing, customer support, recruiting, cloud computing, microphones, speakers, and podcasting. Voice technology is projected to grow at a CAGR of 19-25% by 2025, making it an attractive industry for startups and investors alike.

Legal considerations

In the United States, the states have varying

telephone call recording laws Telephone call recording laws refer to the official legislation regarding call recording in different countries, including how or if the consent of parties is required beforehand. Australia The federal ''Telecommunications (Interception and Acce ...

. In some states, it is legal to record a conversation with the consent of only one party, in others the consent of all parties is required. Moreover,

COPPA The Children's Online Privacy Protection Act of 1998 (COPPA) is a United States federal law, located at (). The act, effective April 21, 2000, applies to the online collection of personal information by persons or entities under Federal ju ...

is a significant law to protect minors using the Internet. With an increasing number of minors interacting with voice computing devices (e.g. the Amazon Alexa), on October 23, 2017 the

Federal Trade Commission The Federal Trade Commission (FTC) is an independent agency of the United States government whose principal mission is the enforcement of civil (non-criminal) antitrust law and the promotion of consumer protection. The FTC shares jurisdiction ov ...

relaxed the COPAA rule so that children can issue voice searches and commands. Lastly,

GDPR The General Data Protection Regulation (GDPR) is a European Union regulation on data protection and privacy in the EU and the European Economic Area (EEA). The GDPR is an important component of EU privacy law and of human rights law, in partic ...

is a new European law that governs the

right to be forgotten The right to be forgotten (RTBF) is the right to have private information about a person be removed from Internet searches and other directories under some circumstances. The concept has been discussed and put into practice in several jurisdiction ...

and many other clauses for EU citizens. GDPR also is clear that companies need to outline clear measures to obtain consent if audio recordings are made and define the purpose and scope as to how these recordings will be used, e.g., for training purposes. The bar for valid consent has been raised under the GDPR. Consents must be freely given, specific, informed, and unambiguous; tacit consent is no longer sufficient.

Research conferences

There are many research conferences that relate to voice computing. Some of these include: *

International Conference on Acoustics, Speech, and Signal Processing ICASSP, the International Conference on Acoustics, Speech, and Signal Processing, is an annual flagship conference organized of IEEE Signal Processing Society. All papers included in its proceedings have been indexed by Ei Compendex. The first ICAS ...

* Interspeech * AVEC * IEEE Int'l Conf. on Automatic Face and Gesture Recognition * ACII2019 The 8th Int'l Conf. on Affective Computing and Intelligent Interaction

Developer community

Google Assistant has roughly 2,000 actions as of January 2018. There are over 50,000 Alexa skills worldwide as of September 2018. In June 2017,

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

released AudioSet, a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. It contains 1,010,480 videos of human speech files, or 2,793.5 hours in total. It was released as part of the IEEE ICASSP 2017 Conference. In November 2017,

Mozilla Foundation The Mozilla Foundation (stylized as moz://a) is an American non-profit organization that exists to support and collectively lead the open source Mozilla project. Founded in July 2003, the organization sets the policies that govern development, ...

released the Common Voice Project, a collection of speech files to help contribute to the larger open source machine learning community. The voicebank is currently 12GB in size, with more than 500 hours of English-language voice data that have been collected from 112 countries since the project's inception in June 2017. This dataset has already resulted in creative projects like the DeepSpeech model, an open source transcription model.DeepSpeech. https://github.com/mozilla/DeepSpeech

References

{{Reflist Speech recognition History of human–computer interaction Voice technology Natural language processing Computational linguistics Computational fields of study