Speech processing is the study of

speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...

signals A signal is both the process and the result of Signal transmission, transmission of data over some transmission media, media accomplished by embedding some variation. Signals are important in multiple subject fields including signal processin ...

and the processing methods of signals. The signals are usually processed in a

digital Digital usually refers to something using discrete digits, often binary digits. Businesses *Digital bank, a form of financial institution *Digital Equipment Corporation (DEC) or Digital, a computer company *Digital Research (DR or DRI), a software ...

representation, so speech processing can be regarded as a special case of

digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a ...

, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

, speaker diarization, speech enhancement,

speaker recognition Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...

, etc.

History

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple

phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians ...

elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker. Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.

Linear predictive coding Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model ...

(LPC), a speech processing algorithm, was first proposed by

Fumitada Itakura is a Japanese scientist. He did pioneering work in statistical signal processing, and its application to speech analysis, synthesis and coding, including the development of the linear predictive coding (LPC) and line spectral pairs (LSP) metho ...

Nagoya University , abbreviated to or NU, is a Japanese national research university located in Chikusa-ku, Nagoya. It was established in 1939 as the last of the nine Imperial Universities in the then Empire of Japan, and is now a Designated National Universit ...

and Shuzo Saito of

Nippon Telegraph and Telephone (NTT) is a Japanese telecommunications holding company headquartered in Tokyo, Japan. Ranked 55th in ''Fortune'' Global 500, NTT is the fourth largest telecommunications company in the world in terms of revenue, as well as the third largest pu ...

(NTT) in 1966. Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at

Bell Labs Nokia Bell Labs, commonly referred to as ''Bell Labs'', is an American industrial research and development company owned by Finnish technology company Nokia. With headquarters located in Murray Hill, New Jersey, Murray Hill, New Jersey, the compa ...

during the 1970s. LPC was the basis for voice-over-IP (VoIP) technology, as well as speech synthesizer chips, such as the

Texas Instruments LPC Speech Chips The Texas Instruments LPC Speech Chips are a series of speech synthesizer digital signal processor integrated circuits created by Texas Instruments beginning in 1978. They continued to be developed and marketed for many years, though the speech de ...

used in the Speak & Spell toys from 1978. One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by

AT&T AT&T Inc., an abbreviation for its predecessor's former name, the American Telephone and Telegraph Company, is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the w ...

in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary. By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

and

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

. In 2012, Geoffrey Hinton and his team at the

University of Toronto The University of Toronto (UToronto or U of T) is a public university, public research university whose main campus is located on the grounds that surround Queen's Park (Toronto), Queen's Park in Toronto, Ontario, Canada. It was founded by ...

demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry. By the mid-2010s, companies like

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...

Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...

, and

Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...

had integrated advanced speech recognition systems into their virtual assistants such as

Google Assistant Google Assistant is a virtual assistant software application developed by Google that is primarily available on home automation and mobile devices. Based on artificial intelligence, Google Assistant can engage in two-way conversations, unlike ...

, Cortana, Alexa, and

Siri Siri ( , backronym: Speech Interpretation and Recognition Interface) is a digital assistant purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating sys ...

. These systems utilized deep learning models to provide more natural and accurate voice interactions. The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech. In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.

Techniques

Dynamic time warping

Dynamic time warping (DTW) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.

Hidden Markov models

A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the

Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Ma ...

, the

conditional probability distribution In probability theory and statistics, the conditional probability distribution is a probability distribution that describes the probability of an outcome given the occurrence of a particular event. Given two jointly distributed random variables X ...

of the hidden variable ''x''(''t'') at time ''t'', given the values of the hidden variable ''x'' at all times, depends ''only'' on the value of the hidden variable ''x''(''t'' − 1). Similarly, the value of the observed variable ''y''(''t'') only depends on the value of the hidden variable ''x''(''t'') (both at time ''t'').

Artificial neural networks

An artificial neural network (ANN) is based on a collection of connected units or nodes called

artificial neuron An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an ''artificial neural network''. The design of the artificial neuron was inspired ...

s, which loosely model the

neuron A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

s in a biological

brain The brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for ...

. Each connection, like the

synapse In the nervous system, a synapse is a structure that allows a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or a target effector cell. Synapses can be classified as either chemical or electrical, depending o ...

s in a biological

, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a

real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...

, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.

Phase-aware processing

Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase: result of

arctangent In mathematics, the inverse trigonometric functions (occasionally also called ''antitrigonometric'', ''cyclometric'', or ''arcus'' functions) are the inverse functions of the trigonometric functions, under suitably restricted domains. Specific ...

function is not continuous due to periodical jumps on

2 \pi

. After phase unwrapping (see, Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:

\phi(h,l) = \phi_(h,l) + \Psi(h,l)

, where

\phi_(h,l) = \omega_0(l') _\Delta t

is linear phase (

_\Delta t

is temporal shift at each frame of analysis),

\Psi(h,l)

is phase contribution of the vocal tract and phase source. Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase and its derivatives by time (

instantaneous frequency Instantaneous phase and frequency are important concepts in signal processing that occur in the context of the representation and analysis of time-varying functions. The instantaneous phase (also known as local phase or simply phase) of a ''compl ...

) and frequency (

group delay In signal processing, group delay and phase delay are functions that describe in different ways the delay times experienced by a signal’s various sinusoidal frequency components as they pass through a linear time-invariant (LTI) system (such as ...

), smoothing of phase across frequency. Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.

Applications

Interactive voice response Interactive voice response (IVR) is a technology that allows telephone users to interact with a computer-operated telephone system through the use of voice and DTMF tones input with a keypad. In telephony, IVR allows customers to interact with a ...

Virtual Assistants Virtual may refer to: * Virtual image, an apparent image of an object (as opposed to a real object), in the study of optics * Virtual (horse), a thoroughbred racehorse * Virtual channel, a channel designation which differs from that of the actual ...

* Voice Identification *

Emotion Recognition Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Gener ...

* Call Center Automation *

Robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...

References

{{Authority control Speech Signal processing