Speech coding is an application of
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
of
digital audio signals containing
speech
Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
. Speech coding uses speech-specific
parameter estimation using
audio signal processing
Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves— longitudinal waves which travel through air, consist ...
techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
Some applications of speech coding are
mobile telephony and
voice over IP
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet. The terms Interne ...
(VoIP). The most widely used speech coding technique in mobile telephony is
linear predictive coding (LPC), while the most widely used in VoIP applications are the LPC and
modified discrete cosine transform (MDCT) techniques.
The techniques employed in speech coding are similar to those used in
audio data compression and
audio coding where knowledge in
psychoacoustics
Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how humans perceive various sounds. More specifically, it is the branch of science studying the psychological responses associated ...
is used to transmit only data that is relevant to the human auditory system. For example, in
voiceband speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal is still adequate for
intelligibility.
Speech coding differs from other forms of audio coding in that speech is a simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information that is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and ''pleasantness'' of speech, with a constrained amount of transmitted data. In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.
Categories
Speech coders are of two types:
# Waveform coders
#* Time-domain:
PCM
Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amp ...
,
ADPCM
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise rati ...
#* Frequency-domain:
sub-band coding
In signal processing, sub-band coding (SBC) is any form of transform coding that breaks a signal into a number of different frequency bands, typically by using a fast Fourier transform, and encodes each one independently. This decomposition is ...
,
ATRAC
#
Vocoder
A vocoder (, a portmanteau of ''voice'' and ''encoder'') is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
The vocoder was ...
s
#*
Linear predictive coding (LPC)
#*
Formant coding
Sample companding viewed as a form of speech coding
The
A-law and
μ-law algorithms (
G.711) used in traditional
PCM
Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amp ...
digital telephony
Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunication services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is i ...
can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. The logarithmic companding laws are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a
periodic waveform having a single
fundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly
delta modulation variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.
In 2008,
G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
Modern speech compression
Much of the later work in speech compression was motivated by military research into digital communications for
secure military radios, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more
processing power was available, in the form of
VLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.
These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital
mobile phone networks with substantially higher channel capacities than the analog systems that preceded them.
The most widely used speech coding algorithms are based on
linear predictive coding (LPC). In particular, the most common speech coding scheme is the LPC-based
code-excited linear prediction (CELP) coding, which is used for example in the
GSM standard. In CELP, the modeling is divided in two stages, a
linear predictive stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as
line spectral pairs (LSPs). In addition to the actual speech coding of the signal, it is often necessary to use
channel coding for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.
The
modified discrete cosine transform (MDCT), a type of
discrete cosine transform (DCT) algorithm, was adapted into a speech coding algorithm called LD-MDCT, used for the
AAC-LD format introduced in 1999.
MDCT has since been widely adopted in
voice-over-IP (VoIP) applications, such as the
G.729.1 wideband audio codec introduced in 2006,
Apple
An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus '' Malus''. The tree originated in Central Asia, where its wild ances ...
's
FaceTime (using AAC-LD) introduced in 2010,
and the
CELT
The Celts (, see pronunciation for different usages) or Celtic peoples () are. "CELTS location: Greater Europe time period: Second millennium B.C.E. to present ancestry: Celtic a collection of Indo-European peoples. "The Celts, an ancien ...
codec introduced in 2011.
[Presentation of the CELT codec](_blank)
by Timothy B. Terriberry (65 minutes of video, see als
presentation slides
in PDF)
Opus is a
free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, ...
speech coder. It combines both the MDCT and LPC audio compression algorithms. It is widely used for VoIP calls in
WhatsApp
WhatsApp (also called WhatsApp Messenger) is an internationally available freeware, cross-platform, centralized instant messaging (IM) and voice-over-IP (VoIP) service owned by American company Meta Platforms (formerly Facebook). It allows user ...
.
The
PlayStation 4
The PlayStation 4 (PS4) is a home video game console developed by Sony Interactive Entertainment. Announced as the successor to the PlayStation 3 in February 2013, it was launched on November 15, 2013, in North America, November 29, 2013 in ...
video game console also uses Opus for its
PlayStation Network system party chat.
Codec2 Codec 2 is a low-bitrate speech audio codec (speech coding) that is patent free and open source. Codec 2 compresses speech using sinusoidal coding, a method specialized for human speech. Bit rates of 3200 to 450 bit/s have been successfully cre ...
is another free software speech coder, which operates at
bit rate
In telecommunications and computing, bit rate (bitrate or as a variable ''R'') is the number of bits that are conveyed or processed per unit of time.
The bit rate is expressed in the unit bit per second (symbol: bit/s), often in conjunction w ...
s as low as 450 bit/s.
Sub-fields
;
Wideband audio coding
*
Linear predictive coding (LPC)
**
AMR-WB for
WCDMA networks
**
VMR-WB for
CDMA2000 networks
**
Speex, IP-MR,
SILK
Silk is a natural protein fiber, some forms of which can be woven into textiles. The protein fiber of silk is composed mainly of fibroin and is produced by certain insect larvae to form cocoons. The best-known silk is obtained from the ...
and
Opus for
voice-over-IP (VoIP) and
videoconferencing
Videotelephony, also known as videoconferencing and video teleconferencing, is the two-way or multipoint reception and transmission of audio and video signals by people in different locations for real time communication.McGraw-Hill Concise Encyc ...
*
Modified discrete cosine transform (MDCT)
**
AAC-LD,
G.722.1,
G.729.1,
CELT
The Celts (, see pronunciation for different usages) or Celtic peoples () are. "CELTS location: Greater Europe time period: Second millennium B.C.E. to present ancestry: Celtic a collection of Indo-European peoples. "The Celts, an ancien ...
and Opus for VoIP and videoconferencing
*
Adaptive differential pulse-code modulation (ADPCM)
**
G.722 for VoIP
;
Narrowband audio coding
* LPC
**
FNBDT for military applications
**
SMV SMV may refer to:
People
* Sir Mokshagundam Visvesvaraya, Indian engineer, politician and Diwan of Mysore
In computer science
* Symbolic model verification
* SMV modelling language, used in model checking by the CMU SMV and NuSMV model checkers
P ...
for
CDMA networks
**
Full Rate,
Half Rate,
EFR and
AMR for
GSM networks
**
G.723.1,
G.728 G.728 is an ITU-T standard for speech coding operating at 16 kbit/s. It is officially described as ''Coding of speech at 16 kbit/s using low-delay code excited linear prediction''.
Technology used is LD-CELP, low-delay code excited linear pre ...
,
G.729,
G.729.1 and
iLBC for VoIP or videoconferencing
* ADPCM
**
G.726 for VoIP
*
Multi-Band Excitation
In telecommunications, a multi-band device (including (2) dual-band, (3) tri-band, (4) quad-band and (5) penta-band devices) is a communication device (especially a mobile phone) that supports multiple radio frequency bands. All devices which ...
(MBE)
**
AMBE+ for
digital mobile radio and
satellite telephone
**
Codec 2
See also
*
Digital signal processing
Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner ar ...
*
Speech interface guideline
*
Speech processing
*
Speech synthesis
*
Vector quantization
References
External links
ITU-T Test Signals for Telecommunication Systems Test SamplesITU-T Perceptual evaluation of speech quality (PESQ) tool Sources
{{Compression Methods