Speech coding is an application of
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
to
digital audio
Digital audio is a representation of sound recorded in, or converted into, digital signal (signal processing), digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical sampling (signal processing), ...
signals containing
speech
Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...
. Speech coding uses speech-specific
parameter estimation using
audio signal processing
Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves—longitudinal waves which travel through air, consisting ...
techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
Common applications of speech coding are
mobile telephony
Mobile telephony is the provision of wireless telephone services to mobile phones, distinguishing it from fixed-location telephony provided via landline phones. Traditionally, telephony specifically refers to voice communication, though th ...
and
voice over IP
Voice over Internet Protocol (VoIP), also known as IP telephony, is a set of technologies used primarily for voice communication sessions over Internet Protocol (IP) networks, such as the Internet. VoIP enables voice calls to be transmitted as ...
(VoIP). The most widely used speech coding technique in mobile telephony is
linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model ...
(LPC), while the most widely used in VoIP applications are the LPC and
modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where s ...
(MDCT) techniques.
The techniques employed in speech coding are similar to those used in
audio data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
and
audio coding
An audio coding format (or sometimes audio compression format) is a content representation format for storage or transmission of digital audio (such as in digital television, digital radio and in audio and video files). Examples of audio coding f ...
where appreciation of
psychoacoustics
Psychoacoustics is the branch of psychophysics involving the scientific study of the perception of sound by the human auditory system. It is the branch of science studying the psychological responses associated with sound including noise, speech, ...
is used to transmit only data that is relevant to the human auditory system. For example, in
voiceband
A voice frequency (VF) or voice band is the range of audio frequencies used for the transmission of speech.
Frequency band
In telephony, the usable voice frequency band ranges from approximately 300 to 3400 Hz. It is for this reason th ...
speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal retains adequate
intelligibility.
Speech coding differs from other forms of audio coding in that speech is a simpler signal than other audio signals, and statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. Speech coding stresses the preservation of intelligibility and ''pleasantness'' of speech while using a constrained amount of transmitted data. In addition, most speech applications require low coding delay, as
latency interferes with speech interaction.
Categories
Speech coders are of two classes:
# Waveform coders
#* Time-domain:
PCM
Pulse-code modulation (PCM) is a method used to Digital signal (signal processing), digitally represent analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio application ...
,
ADPCM
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise rati ...
#* Frequency-domain:
sub-band coding,
ATRAC
Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. MiniDisc was the first commercial product to incorporate ATRAC, in 1992. ATRAC allowed a relatively small disc like MiniDisc t ...
#
Vocoder
A vocoder (, a portmanteau of ''vo''ice and en''coder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
The vocoder wa ...
s
#*
Linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model ...
(LPC)
#*
Formant coding
#*
Machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, i.e.
neural vocoder
Sample companding viewed as a form of speech coding
The
A-law
An A-law algorithm is a standard companding algorithm, used in European 8-bit PCM digital communications systems to optimize, i.e. modify, the dynamic range of an analog signal for digitizing. It is one of the two companding algorithms in the ...
and
μ-law algorithm
The μ-law algorithm (sometimes written Mu (letter), mu-law, often abbreviated as u-law) is a companding algorithm, primarily used in 8-bit PCM Digital data, digital telecommunications systems in North America and Japan. It is one of the two c ...
s used in
G.711 PCM
digital telephony
Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunications services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is ...
can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12
bits of resolution. Logarithmic companding are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a
periodic waveform
A periodic function, also called a periodic waveform (or simply periodic wave), is a function that repeats its values at regular intervals or periods. The repeatable part of the function or waveform is called a ''cycle''. For example, the tr ...
having a single
fundamental frequency
The fundamental frequency, often referred to simply as the ''fundamental'' (abbreviated as 0 or 1 ), is defined as the lowest frequency of a Periodic signal, periodic waveform. In music, the fundamental is the musical pitch (music), pitch of a n ...
with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly
delta modulation
Delta modulation (DM, ΔM, or Δ-modulation) is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential ...
variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.
In 2008,
G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
Modern speech compression
Much of the later work in speech compression was motivated by military research into digital communications for
secure military radios, where very low data rates were used to achieve effective operation in a hostile radio environment. At the same time, far more
processing power
In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instruction ...
was available, in the form of
VLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.
The most widely used speech coding algorithms are based on
linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model ...
(LPC). In particular, the most common speech coding scheme is the LPC-based
code-excited linear prediction (CELP) coding, which is used for example in the
GSM
The Global System for Mobile Communications (GSM) is a family of standards to describe the protocols for second-generation (2G) digital cellular networks, as used by mobile devices such as mobile phones and Mobile broadband modem, mobile broadba ...
standard. In CELP, the modeling is divided in two stages, a
linear predictive stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as
line spectral pairs
Line spectral pairs (LSP) or line spectral frequencies (LSF) are used to represent linear predictive coding, linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties (e.g. smaller sensitivity to quantizatio ...
(LSPs). In addition to the actual speech coding of the signal, it is often necessary to use
channel coding
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for error control, controlling errors in data transmission over unreliable or noisy communication channel ...
for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.
The
modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where s ...
(MDCT) is used in the LD-MDCT technique used by the
AAC-LD
The MPEG-4 Low Delay Audio Coder (a.k.a. AAC Low Delay, or AAC-LD) is audio compression standard designed to combine the advantages of perceptual audio coding with the low delay necessary for two-way communication. It is closely derived from the M ...
format introduced in 1999.
MDCT has since been widely adopted in
voice-over-IP (VoIP) applications, such as the
G.729.1 wideband audio
Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency, frequency range of audio signals transmitted ove ...
codec introduced in 2006,
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
's
FaceTime
FaceTime is a proprietary videotelephony product developed by Apple. FaceTime is available on supported iOS mobile devices running iOS 4 and later and Mac computers that run and later. FaceTime supports any iOS device with a forward-facin ...
(using AAC-LD) introduced in 2010,
and the
CELT
The Celts ( , see Names of the Celts#Pronunciation, pronunciation for different usages) or Celtic peoples ( ) were a collection of Indo-European languages, Indo-European peoples. "The Celts, an ancient Indo-European people, reached the apoge ...
codec introduced in 2011.
[Presentation of the CELT codec](_blank)
by Timothy B. Terriberry (65 minutes of video, see als
presentation slides
in PDF)
Opus is a
free software
Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...
audio coder. It combines the speech-oriented LPC-based
SILK
Silk is a natural fiber, natural protein fiber, some forms of which can be weaving, woven into textiles. The protein fiber of silk is composed mainly of fibroin and is most commonly produced by certain insect larvae to form cocoon (silk), c ...
algorithm and the lower-latency MDCT-based CELT algorithm, switching between or combining them as needed for maximal efficiency.
It is widely used for VoIP calls in
WhatsApp
WhatsApp (officially WhatsApp Messenger) is an American social media, instant messaging (IM), and voice-over-IP (VoIP) service owned by technology conglomerate Meta. It allows users to send text, voice messages and video messages, make vo ...
.
The
PlayStation 4
The PlayStation 4 (PS4) is a home video game console developed by Sony Interactive Entertainment. Announced as the successor to the PlayStation 3 in February 2013, it was launched on November 15, 2013, in North America, November 29, 2013, in ...
video game console also uses Opus for its
PlayStation Network
PlayStation Network (PSN) is a digital media entertainment service provided by Sony Interactive Entertainment. Launched in November 2006, PSN was originally conceived for the PlayStation video game consoles, but soon extended to encompass smartp ...
system party chat.
A number of codecs with even lower
bit rate
In telecommunications and computing, bit rate (bitrate or as a variable ''R'') is the number of bits that are conveyed or processed per unit of time.
The bit rate is expressed in the unit bit per second (symbol: bit/s), often in conjunction ...
s have been demonstrated.
Codec2, which operates at bit rates as low as , sees use in amateur radio. NATO currently uses
MELPe, offering intelligible speech at and below. Neural vocoder approaches have also emerged:
Lyra
, from ; pronounced: ) is a small constellation. It is one of the 48 listed by the 2nd century astronomer Ptolemy, and is one of the modern 88 constellations recognized by the International Astronomical Union. Lyra was often represented on star ...
by Google gives an "almost eerie" quality at .
Microsoft's
Satin
A satin weave is a type of Textile, fabric weave that produces a characteristically glossy, smooth or lustrous material, typically with a glossy top surface and a dull back; it is not durable, as it tends to snag. It is one of three fundamen ...
also uses machine learning, but uses a higher tunable bitrate and is wideband.
Sub-fields
;
Wideband audio
Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency, frequency range of audio signals transmitted ove ...
coding
*
Linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model ...
(LPC)
**
AMR-WB
Adaptive Multi-Rate Wideband (AMR-WB) is a patented Wideband audio, wideband speech coding, speech audio coding standard developed based on Adaptive Multi-Rate audio codec, Adaptive Multi-Rate encoding, using a similar methodology to algebraic cod ...
for
WCDMA
The Universal Mobile Telecommunications System (UMTS) is a 3G mobile cellular system for networks based on the GSM standard. UMTS uses wideband code-division multiple access (W- CDMA) radio access technology to offer greater spectral efficienc ...
networks
**
VMR-WB for
CDMA2000
CDMA2000 (also known as C2K or IMT Multi‑Carrier (IMT‑MC)) is a family of 3G mobile technology standards for sending voice, data, and signaling data between mobile phones and cell sites. It is developed by 3GPP2 as a backwards-compatib ...
networks
**
Speex
{{More citations needed, date=May 2025
The Speex project is an attempt to create a free software speech codec, unencumbered by patent restrictions. Speex is licensed under the BSD License and is used with the Xiph.org Foundation's Ogg containe ...
, IP-MR,
SILK
Silk is a natural fiber, natural protein fiber, some forms of which can be weaving, woven into textiles. The protein fiber of silk is composed mainly of fibroin and is most commonly produced by certain insect larvae to form cocoon (silk), c ...
(part of
Opus), and
USAC/xHE-AAC for VoIP and
videoconferencing
Videotelephony (also known as videoconferencing or video calling) is the use of audio signal, audio and video for simultaneous two-way communication. Today, videotelephony is widespread. There are many terms to refer to videotelephony. ''Vide ...
*
Modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where s ...
(MDCT)
**
AAC-LD
The MPEG-4 Low Delay Audio Coder (a.k.a. AAC Low Delay, or AAC-LD) is audio compression standard designed to combine the advantages of perceptual audio coding with the low delay necessary for two-way communication. It is closely derived from the M ...
,
G.722.1,
G.729.1,
CELT
The Celts ( , see Names of the Celts#Pronunciation, pronunciation for different usages) or Celtic peoples ( ) were a collection of Indo-European languages, Indo-European peoples. "The Celts, an ancient Indo-European people, reached the apoge ...
and
Opus for VoIP and videoconferencing
*
Adaptive differential pulse-code modulation
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise rati ...
(ADPCM)
**
G.722 for VoIP
* Neural speech coding
**
Lyra
, from ; pronounced: ) is a small constellation. It is one of the 48 listed by the 2nd century astronomer Ptolemy, and is one of the modern 88 constellations recognized by the International Astronomical Union. Lyra was often represented on star ...
(Google): V1 uses neural network reconstruction of log-mel spectrogram; V2 is an end-to-end
autoencoder
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function ...
.
**
Satin
A satin weave is a type of Textile, fabric weave that produces a characteristically glossy, smooth or lustrous material, typically with a glossy top surface and a dull back; it is not durable, as it tends to snag. It is one of three fundamen ...
(Microsoft)
** LPCNet (Mozilla, Xiph): neural network reconstruction of LPC features
;
Narrowband
Narrowband signals are signals that occupy a narrow range of frequencies or that have a small fractional bandwidth. In the audio spectrum, ''narrowband sounds'' are sounds that occupy a narrow range of frequencies. In telephony, narrowband is ...
audio coding
* LPC
**
FNBDT
The Secure Communications Interoperability Protocol (SCIP) is a US standard for secure voice and data communication, focircuit-switchedone-to-one connections, not packet-switched networks. SCIP derived from the US Government Future Narrowband Di ...
for military applications
**
SMV for
CDMA
Code-division multiple access (CDMA) is a channel access method used by various radio communication technologies. CDMA is an example of multiple access, where several transmitters can send information simultaneously over a single communicatio ...
networks
**
Full Rate,
Half Rate
Half Rate (HR or GSM-HR or GSM 06.20) is a speech coding system for GSM, developed in the early 1990s.
Since the codec, operating at 5.6 kbit/s, requires half the Bandwidth (computing), bandwidth of the Full Rate codec, network capacity for v ...
,
EFR and
AMR for
GSM
The Global System for Mobile Communications (GSM) is a family of standards to describe the protocols for second-generation (2G) digital cellular networks, as used by mobile devices such as mobile phones and Mobile broadband modem, mobile broadba ...
networks
**
G.723.1,
G.728,
G.729,
G.729.1 and
iLBC for VoIP or videoconferencing
* ADPCM
**
G.726 for VoIP
*
Multi-Band Excitation (MBE)
**
AMBE+ for
digital
Digital usually refers to something using discrete digits, often binary digits.
Businesses
*Digital bank, a form of financial institution
*Digital Equipment Corporation (DEC) or Digital, a computer company
*Digital Research (DR or DRI), a software ...
mobile radio
Mobile radio or mobiles refer to wireless communications systems and devices which are based on radio frequencies (using commonly UHF or VHF frequencies), and where the path of communications is movable on either end. There are a variety of vi ...
and
satellite phone
A satellite telephone, satellite phone or satphone is a type of mobile phone that connects to other phones or the telephone network by radio link through satellites orbiting the Earth instead of terrestrial cell sites, as cellphones do. Therefo ...
**
Codec 2
See also
*
Digital signal processing
Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a ...
*
Speech interface guideline
*
Speech processing
Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to ...
*
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...
*
Vector quantization
Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was ori ...
References
External links
ITU-T Test Signals for Telecommunication Systems Test SamplesITU-T Perceptual evaluation of speech quality (PESQ) tool Sources
{{Compression Methods
Data compression