Speech coding is an application of
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...
of
digital audio
Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samp ...
signals containing
speech
Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
. Speech coding uses speech-specific
parameter estimation using
audio signal processing
Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves— longitudinal waves which travel through air, consist ...
techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
Some applications of speech coding are
mobile telephony
Mobile telephony is the provision of telephone services to phones which may move around freely rather than stay fixed in one location. Telephony is supposed to specifically point to a voice-only service or connection, though sometimes the l ...
and
voice over IP
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet. The terms Internet t ...
(VoIP). The most widely used speech coding technique in mobile telephony is
linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive mod ...
(LPC), while the most widely used in VoIP applications are the LPC and
modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where ...
(MDCT) techniques.
The techniques employed in speech coding are similar to those used in
audio data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
and
audio coding
An audio coding format (or sometimes audio compression format) is a content representation format for storage or transmission of digital audio (such as in digital television, digital radio and in audio and video files). Examples of audio coding f ...
where knowledge in
psychoacoustics
Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how humans perceive various sounds. More specifically, it is the branch of science studying the psychological responses associated wi ...
is used to transmit only data that is relevant to the human auditory system. For example, in
voiceband
A voice frequency (VF) or voice band is the range of audio frequencies used for the transmission of speech.
Frequency band
In telephony, the usable voice frequency band ranges from approximately 300 to 3400 Hz. It is for this reason t ...
speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal is still adequate for
intelligibility.
Speech coding differs from other forms of audio coding in that speech is a simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information that is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and ''pleasantness'' of speech, with a constrained amount of transmitted data. In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.
Categories
Speech coders are of two types:
# Waveform coders
#* Time-domain:
PCM
Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the am ...
,
ADPCM
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise ratio ...
#* Frequency-domain:
sub-band coding
In signal processing, sub-band coding (SBC) is any form of transform coding that breaks a signal into a number of different frequency bands, typically by using a fast Fourier transform, and encodes each one independently. This decomposition is ...
,
ATRAC
Adaptive Transform Acoustic Coding (ATRAC) is a family of proprietary audio compression algorithms developed by Sony. MiniDisc was the first commercial product to incorporate ATRAC in 1992. ATRAC allowed a relatively small disc like MiniDisc to h ...
#
Vocoder
A vocoder (, a portmanteau of ''voice'' and ''encoder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
The vocoder ...
s
#*
Linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive mod ...
(LPC)
#*
Formant coding
Sample companding viewed as a form of speech coding
The
A-law
An A-law algorithm is a standard companding algorithm, used in European 8-bit PCM digital communications systems to optimize, i.e. modify, the dynamic range of an analog signal for digitizing. It is one of two versions of the G.711 standar ...
and
μ-law algorithm
The μ-law algorithm (sometimes written Mu (letter), mu-law, often typographic approximation, approximated as u-law) is a companding algorithm, primarily used in 8-bit PCM Digital data, digital telecommunication systems in North America and Jap ...
s (
G.711
G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. G.711 passes audio signals in the range of 300–3400 Hz and samples them at the rate of 8,000 samples per second ...
) used in traditional
PCM digital telephony
Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunication services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is i ...
can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. The logarithmic companding laws are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a
periodic waveform having a single
fundamental frequency
The fundamental frequency, often referred to simply as the ''fundamental'', is defined as the lowest frequency of a periodic waveform. In music, the fundamental is the musical pitch of a note that is perceived as the lowest partial present. In ...
with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly
delta modulation
A delta modulation (DM or Δ-modulation) is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse ...
variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.
In 2008,
G.711.1
G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. G.711 passes audio signals in the range of 300–3400 Hz and samples them at the rate of 8,000 samples per second ...
codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
Modern speech compression
Much of the later work in speech compression was motivated by military research into digital communications for
secure military radios, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more
processing power
In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instruction ...
was available, in the form of
VLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.
These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital
mobile phone network
A cellular network or mobile network is a communication network where the link to and from end nodes is wireless. The network is distributed over land areas called "cells", each served by at least one fixed-location transceiver (typically thre ...
s with substantially higher channel capacities than the analog systems that preceded them.
The most widely used speech coding algorithms are based on
linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive mod ...
(LPC). In particular, the most common speech coding scheme is the LPC-based
code-excited linear prediction
Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm originally proposed by Manfred R. Schroeder and Bishnu S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algori ...
(CELP) coding, which is used for example in the
GSM
The Global System for Mobile Communications (GSM) is a standard developed by the European Telecommunications Standards Institute (ETSI) to describe the protocols for second-generation ( 2G) digital cellular networks used by mobile devices such as ...
standard. In CELP, the modeling is divided in two stages, a
linear predictive stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as
line spectral pairs
Line spectral pairs (LSP) or line spectral frequencies (LSF) are used to represent linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties (e.g. smaller sensitivity to quantization noise) that make them s ...
(LSPs). In addition to the actual speech coding of the signal, it is often necessary to use
channel coding
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.
The
modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped: it is designed to be performed on consecutive blocks of a larger dataset, where ...
(MDCT), a type of
discrete cosine transform (DCT) algorithm, was adapted into a speech coding algorithm called LD-MDCT, used for the
AAC-LD
The MPEG-4 Low Delay Audio Coder (a.k.a. AAC Low Delay, or AAC-LD) is audio compression standard designed to combine the advantages of perceptual audio coding with the low delay necessary for two-way communication. It is closely derived from the ...
format introduced in 1999.
MDCT has since been widely adopted in
voice-over-IP
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet. The terms Internet t ...
(VoIP) applications, such as the
G.729.1
G.729.1 is an 8-32 kbit/s embedded speech and audio codec providing bitstream interoperability with G.729, G.729 Annex A and G.729 Annex B. Its official name is ''G.729-based embedded variable bit rate codec: An 8-32 kbit/s scalable wideband cod ...
wideband audio
Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone ...
codec introduced in 2006,
Apple
An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...
's
FaceTime
FaceTime is a Proprietary software, proprietary videotelephony product developed by Apple Inc. FaceTime is available on supported iOS mobile devices running iOS 4 and later and Mac computers that run and later. FaceTime supports any iOS devic ...
(using AAC-LD) introduced in 2010,
and the
CELT
The Celts (, see pronunciation for different usages) or Celtic peoples () are. "CELTS location: Greater Europe time period: Second millennium B.C.E. to present ancestry: Celtic a collection of Indo-European peoples. "The Celts, an ancient ...
codec introduced in 2011.
[Presentation of the CELT codec](_blank)
by Timothy B. Terriberry (65 minutes of video, see als
presentation slides
in PDF)
Opus
''Opus'' (pl. ''opera'') is a Latin word meaning "work". Italian equivalents are ''opera'' (singular) and ''opere'' (pl.).
Opus or OPUS may refer to:
Arts and entertainment Music
* Opus number, (abbr. Op.) specifying order of (usually) publicatio ...
is a
free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
speech coder. It combines both the MDCT and LPC audio compression algorithms. It is widely used for VoIP calls in
WhatsApp
WhatsApp (also called WhatsApp Messenger) is an internationally available freeware, cross-platform, centralized instant messaging (IM) and voice-over-IP (VoIP) service owned by American company Meta Platforms (formerly Facebook). It allows us ...
.
The
PlayStation 4
The PlayStation 4 (PS4) is a home video game console developed by Sony Interactive Entertainment. Announced as the successor to the PlayStation 3 in February 2013, it was launched on November 15, 2013, in North America, November 29, 2013 in ...
video game console also uses Opus for its
PlayStation Network
PlayStation Network (PSN) is a digital media entertainment service provided by Sony Interactive Entertainment. Launched in November 2006, PSN was originally conceived for the PlayStation video game consoles, but soon extended to encompass smartp ...
system party chat.
Codec2 Codec 2 is a low-bitrate speech audio codec (speech coding) that is patent free and open source. Codec 2 compresses speech using sinusoidal coding, a method specialized for human speech. Bit rates of 3200 to 450 bit/s have been successfully cre ...
is another free software speech coder, which operates at
bit rate
In telecommunications and computing, bit rate (bitrate or as a variable ''R'') is the number of bits that are conveyed or processed per unit of time.
The bit rate is expressed in the unit bit per second (symbol: bit/s), often in conjunction w ...
s as low as 450 bit/s.
Sub-fields
;
Wideband audio
Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone ...
coding
*
Linear predictive coding
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive mod ...
(LPC)
**
AMR-WB
Adaptive Multi-Rate Wideband (AMR-WB) is a patented wideband speech audio coding standard developed based on Adaptive Multi-Rate encoding, using a similar methodology to algebraic code-excited linear prediction (ACELP). AMR-WB provides improved sp ...
for
WCDMA
The Universal Mobile Telecommunications System (UMTS) is a third generation mobile cellular system for networks based on the GSM standard. Developed and maintained by the 3GPP (3rd Generation Partnership Project), UMTS is a component of the Inte ...
networks
**
VMR-WB
Variable-Rate Multimode Wideband (VMR-WB) is a source-controlled variable-rate multimode codec designed for robust encoding/decoding of wideband/narrowband speech. The operation of VMR-WB is controlled by speech signal characteristics (i.e., source ...
for
CDMA2000
CDMA2000 (also known as C2K or IMT Multi‑Carrier (IMT‑MC)) is a family of 3G mobile technology standards for sending voice, data, and signaling data between mobile phones and cell sites. It is developed by 3GPP2 as a backwards-compatible ...
networks
**
Speex
Speex is an audio compression codec specifically tuned for the reproduction of human speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm.Xiph.OrIntrodu ...
, IP-MR,
SILK
Silk is a natural protein fiber, some forms of which can be woven into textiles. The protein fiber of silk is composed mainly of fibroin and is produced by certain insect larvae to form cocoons. The best-known silk is obtained from the coc ...
and
Opus
''Opus'' (pl. ''opera'') is a Latin word meaning "work". Italian equivalents are ''opera'' (singular) and ''opere'' (pl.).
Opus or OPUS may refer to:
Arts and entertainment Music
* Opus number, (abbr. Op.) specifying order of (usually) publicatio ...
for
voice-over-IP
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet. The terms Internet t ...
(VoIP) and
videoconferencing
Videotelephony, also known as videoconferencing and video teleconferencing, is the two-way or multipoint reception and transmission of audio signal, audio and video signals by people in different locations for Real-time, real time communication. ...
*
Modified discrete cosine transform
The modified discrete cosine transform (MDCT) is a transform based on the type-IV discrete cosine transform (DCT-IV), with the additional property of being lapped transform, lapped: it is designed to be performed on consecutive blocks of a larger ...
(MDCT)
**
AAC-LD
The MPEG-4 Low Delay Audio Coder (a.k.a. AAC Low Delay, or AAC-LD) is audio compression standard designed to combine the advantages of perceptual audio coding with the low delay necessary for two-way communication. It is closely derived from the ...
,
G.722.1
G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate (24 and 32 kbit/s) wideband (50 Hz – 7 kHz audio bandwidth, 16 ksps (kilo- samples per second) audio coding. It is a partial imple ...
,
G.729.1
G.729.1 is an 8-32 kbit/s embedded speech and audio codec providing bitstream interoperability with G.729, G.729 Annex A and G.729 Annex B. Its official name is ''G.729-based embedded variable bit rate codec: An 8-32 kbit/s scalable wideband cod ...
,
CELT
The Celts (, see pronunciation for different usages) or Celtic peoples () are. "CELTS location: Greater Europe time period: Second millennium B.C.E. to present ancestry: Celtic a collection of Indo-European peoples. "The Celts, an ancient ...
and Opus for VoIP and videoconferencing
*
Adaptive differential pulse-code modulation
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise ratio ...
(ADPCM)
**
G.722
G.722 is an ITU-T standard 7 kHz wideband audio codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM (SB-ADPCM). The corresponding narrow-band codec based on ...
for VoIP
;
Narrowband
Narrowband signals are signals that occupy a narrow range of frequencies or that have a small fractional bandwidth. In the audio spectrum, narrowband sounds are sounds that occupy a narrow range of frequencies. In telephony, narrowband is usua ...
audio coding
* LPC
**
FNBDT
The Secure Communications Interoperability Protocol (SCIP) is a US standard for secure voice and data communication, foone-to-one connections, not packet-switched networks. SCIP derived from the US Government Future Narrowband Digital Terminal ( ...
for military applications
**
SMV for
CDMA
Code-division multiple access (CDMA) is a channel access method used by various radio communication technologies. CDMA is an example of multiple access, where several transmitters can send information simultaneously over a single communication ...
networks
**
Full Rate
Full Rate (FR or GSM-FR or GSM 06.10 or sometimes simply GSM) was the first digital speech coding standard used in the GSM digital mobile phone system. It uses linear predictive coding (LPC). The bit rate of the codec is 13 kbit/s, or 1.625 bits/a ...
,
Half Rate
Half Rate (HR or GSM-HR or GSM 06.20) is a speech coding system for GSM, developed in the early 1990s.
Since the codec, operating at 5.6 kbit/s, requires half the bandwidth of the Full Rate codec, network capacity for voice traffic is doubled, at ...
,
EFR and
AMR for
GSM
The Global System for Mobile Communications (GSM) is a standard developed by the European Telecommunications Standards Institute (ETSI) to describe the protocols for second-generation ( 2G) digital cellular networks used by mobile devices such as ...
networks
**
G.723.1
G.723.1 is an audio codec for voice that compresses voice audio in frames. An algorithmic look-ahead of duration means that total algorithmic delay is . Its official name is ''Dual rate speech coder for multimedia communications transmitting at ...
,
G.728,
G.729
G.729 is a royalty-free narrow-band vocoder-based audio data compression algorithm using a frame length of 10 milliseconds. It is officially described as ''Coding of speech at 8 kbit/s using code-excited linear prediction'' speech coding (CS-ACEL ...
,
G.729.1
G.729.1 is an 8-32 kbit/s embedded speech and audio codec providing bitstream interoperability with G.729, G.729 Annex A and G.729 Annex B. Its official name is ''G.729-based embedded variable bit rate codec: An 8-32 kbit/s scalable wideband cod ...
and
iLBC
Internet Low Bitrate Codec (iLBC) is a royalty-free narrowband speech audio coding format and an open-source reference implementation (codec), developed by Global IP Solutions (GIPS) formerly Global IP Sound (acquired by Google Inc in 2011). It w ...
for VoIP or videoconferencing
* ADPCM
**
G.726
G.726 is an ITU-T ADPCM speech codec standard covering the transmission of voice at rates of 16, 24, 32, and 40 kbit/s. It was introduced to supersede both G.721, which covered ADPCM at 32 kbit/s, and G.723, which described ADPCM for ...
for VoIP
*
Multi-Band Excitation
In telecommunications, a multi-band device (including (2) dual-band, (3) tri-band, (4) quad-band and (5) penta-band devices) is a communication device (especially a mobile phone) that supports multiple radio frequency bands. All devices which ha ...
(MBE)
**
AMBE+ for
digital
Digital usually refers to something using discrete digits, often binary digits.
Technology and computing Hardware
*Digital electronics, electronic circuits which operate using digital signals
**Digital camera, which captures and stores digital i ...
mobile radio
Mobile radio or mobiles refer to wireless communications systems and devices which are based on radio frequencies(using commonly UHF or VHF frequencies), and where the path of communications is movable on either end. There are a variety of view ...
and
satellite telephone
A satellite telephone, satellite phone or satphone is a type of mobile phone that connects to other phones or the telephone network by radio through orbiting satellites instead of terrestrial cell sites, as cellphones do. The advantage of a ...
**
Codec 2 Codec 2 is a low-bitrate speech audio codec (speech coding) that is patent free and open source. Codec 2 compresses speech using sinusoidal coding, a method specialized for human speech. Bit rates of 3200 to 450 bit/s have been successfully cre ...
See also
*
Digital signal processing
Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...
*
Speech interface guideline
Speech interface guideline is a guideline with the aim for guiding decisions and criteria regarding designing interfaces operated by human voice. Speech interface system has many advantages such as consistent service and saving cost. However, for ...
*
Speech processing
Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied t ...
*
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
*
Vector quantization
Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by di ...
References
External links
ITU-T Test Signals for Telecommunication Systems Test SamplesITU-T Perceptual evaluation of speech quality (PESQ) tool Sources
{{Compression Methods