Audio time stretching and pitch scaling
   HOME

TheInfoList



OR:

Time stretching is the process of changing the speed or duration of an
audio signal An audio signal is a representation of sound, typically using either a changing level of electrical voltage for analog signals, or a series of binary numbers for digital signals. Audio signals have frequencies in the audio frequency range of ro ...
without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed.
Pitch shift Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval ( transposition) are called pitch shifters. Pitch and tim ...
is pitch scaling implemented in an
effects unit An effects unit or effects pedal is an electronic device that alters the sound of a musical instrument or other audio source through audio signal processing. Common effects include distortion/overdrive, often used with electric guitar in ...
and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording. These processes are often used to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. Time stretching is often used to adjust
radio commercial In the United States, commercial radio stations make most of their revenue by selling airtime to be used for running radio advertisements. These advertisements are the result of a business or a service providing a valuable consideration, usually ...
s and the audio of
television advertisement A television advertisement (also called a television commercial, TV commercial, commercial, spot, television spot, TV spot, advert, television advert, TV advert, television ad, TV ad or simply an ad) is a span of television programming produce ...
s to fit exactly into the 30 or 60 seconds available. It can be used to conform longer material to a designated time slot, such as a 1-hour broadcast.


Resampling

The simplest way to change the duration or pitch of an audio recording is to change the playback speed. For a
digital audio Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samp ...
recording, this can be accomplished through
sample rate conversion Sample-rate conversion, sampling-frequency conversion or resampling is the process of changing the sampling rate or sampling frequency of a discrete signal to obtain a new discrete representation of the underlying continuous signal. Application ...
. Unfortunately, the frequencies in the recording are always scaled at the same ratio as the speed, transposing its perceived pitch up or down in the process. Slowing down the recording to increase duration also lowers the pitch, speeding it up for a shorter duration also raises the pitch creating the Chipmunk effect. Thus the two effects cannot be separated when using this method. A drum track containing no pitched instruments can be moderately sample-rate converted to adjust tempo without adverse effects, but a pitched track cannot.


Frequency domain


Phase vocoder

One way of stretching the length of a signal without affecting the pitch is to build a phase vocoder after Flanagan, Golden, and Portnoff. Basic steps: #compute the instantaneous frequency/amplitude relationship of the signal using the STFT, which is the
discrete Fourier transform In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a comple ...
of a short, overlapping and smoothly windowed block of samples; #apply some processing to the Fourier transform magnitudes and phases (like resampling the FFT blocks); and #perform an inverse STFT by taking the inverse Fourier transform on each chunk and adding the resulting waveform chunks, also called overlap and add (OLA). The phase vocoder handles
sinusoid A sine wave, sinusoidal wave, or just sinusoid is a mathematical curve defined in terms of the ''sine'' trigonometric function, of which it is the graph. It is a type of continuous wave and also a smooth periodic function. It occurs often in ...
components well, but early implementations introduced considerable smearing on
transient ECHELON, originally a secret government code name, is a surveillance program (signals intelligence/SIGINT collection and analysis network) operated by the five signatory states to the UKUSA Security Agreement:Given the 5 dialects that us ...
("beat") waveforms at all non-integer compression/expansion rates, which renders the results phasey and diffuse. Recent improvements allow better quality results at all compression/expansion ratios but a residual smearing effect still remains. The phase vocoder technique can also be used to perform pitch shifting, chorusing, timbre manipulation, harmonizing, and other unusual modifications, all of which can be changed as a function of time.


Sinusoidal spectral modeling

Another method for time stretching relies on a spectral model of the signal. In this method, peaks are identified in frames using the STFT of the signal, and sinusoidal "tracks" are created by connecting peaks in adjacent frames. The tracks are then re-synthesized at a new time scale. This method can yield good results on both polyphonic and percussive material, especially when the signal is separated into sub-bands. However, this method is more computationally demanding than other methods.


Time domain


SOLA

Rabiner and Schafer in 1978 put forth an alternate solution that works in the
time domain Time domain refers to the analysis of mathematical functions, physical signals or time series of economic or environmental data, with respect to time. In the time domain, the signal or function's value is known for all real numbers, for the c ...
: attempt to find the
period Period may refer to: Common uses * Era, a length or span of time * Full stop (or period), a punctuation mark Arts, entertainment, and media * Period (music), a concept in musical composition * Periodic sentence (or rhetorical period), a concept ...
(or equivalently the
fundamental frequency The fundamental frequency, often referred to simply as the ''fundamental'', is defined as the lowest frequency of a periodic waveform. In music, the fundamental is the musical pitch of a note that is perceived as the lowest partial present. I ...
) of a given section of the wave using some
pitch detection algorithm Pitch may refer to: Acoustic frequency * Pitch (music), the perceived frequency of sound including "definite pitch" and "indefinite pitch" ** Absolute pitch or "perfect pitch" ** Pitch class, a set of all pitches that are a whole number of octav ...
(commonly the peak of the signal's
autocorrelation Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable ...
, or sometimes cepstral processing), and crossfade one period into another. This is called time-domain harmonic scaling or the synchronized overlap-add method (SOLA) and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation mis-estimates the period of a signal with complicated harmonics (such as
orchestra An orchestra (; ) is a large instrumental ensemble typical of classical music, which combines instruments from different families. There are typically four main sections of instruments: * bowed string instruments, such as the violin, viola, c ...
l pieces).
Adobe Audition Adobe Audition is a digital audio workstation developed by Adobe Inc. featuring both a multitrack, non-destructive mix/edit environment and a destructive-approach waveform editing view. Origins Syntrillium Software was founded in the early 19 ...
(formerly Cool Edit Pro) seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency. This is much more limited in scope than the phase vocoder based processing, but can be made much less processor intensive, for real-time applications. It provides the most coherent results for single-pitched sounds like voice or musically monophonic instrument recordings. High-end commercial audio processing packages either combine the two techniques (for example by separating the signal into sinusoid and transient waveforms), or use other techniques based on the
wavelet A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases or decreases, and then returns to zero one or more times. Wavelets are termed a "brief oscillation". A taxonomy of wavelets has been established, based on the num ...
transform, or artificial neural network processing, producing the highest-quality time stretching.


Frame-based approach

In order to preserve an audio signal's pitch when stretching or compressing its duration, many time-scale modification (TSM) procedures follow a frame-based approach. Given an original discrete-time audio signal, this strategy's first step is to split the signal into short ''analysis frames'' of fixed length. The analysis frames are spaced by a fixed number of samples, called the ''analysis hopsize'' H_a\in\mathbb. To achieve the actual time-scale modification, the analysis frames are then temporally relocated to have a ''synthesis hopsize'' H_s\in\mathbb. This frame relocation results in a modification of the signal's duration by a ''stretching factor'' of \alpha=H_s/H_a. However, simply superimposing the unmodified analysis frames typically results in undesired artifacts such as phase discontinuities or amplitude fluctuations. To prevent these kinds of artifacts, the analysis frames are adapted to form ''synthesis frames'', prior to the reconstruction of the time-scale modified output signal. The strategy of how to derive the synthesis frames from the analysis frames is a key difference among different TSM procedures.


Speed hearing and speed talking

For the specific case of speech, time stretching can be performed using PSOLA.
Time-compressed speech Time-compressed speech refers to an audio recording of verbal text in which the text is presented in a much shorter time interval than it would through normally-paced real time speech. The basic purpose is to make recorded speech contain more wor ...
is the representation of verbal text in compressed time. While one might expect speeding up to reduce comprehension, Herb Friedman says that "Experiments have shown that the brain works most efficiently if the information rate through the ears—via speech—is the 'average' reading rate, which is about 200–300 wpm (words per minute), yet the average rate of speech is in the neighborhood of 100–150 wpm." Listening to time-compressed speech is seen as the equivalent of speed reading.


Pitch scaling

These techniques can also be used to
transpose In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The tr ...
an audio sample while holding speed or duration constant. This may be accomplished by time stretching and then resampling back to the original length. Alternatively, the frequency of the sinusoids in a sinusoidal model may be altered directly, and the signal reconstructed at the appropriate time scale. Transposing can be called ''
frequency Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is eq ...
scaling'' or ''
pitch shift Pitch shifting is a sound recording technique in which the original pitch of a sound is raised or lowered. Effects units that raise or lower pitch by a pre-designated musical interval ( transposition) are called pitch shifters. Pitch and tim ...
ing'', depending on perspective. For example, one could move the pitch of every note up by a perfect fifth, keeping the tempo the same. One can view this transposition as "pitch shifting", "shifting" each note up 7 keys on a piano keyboard, or adding a fixed amount on the
Mel scale The mel scale (after the word '' melody'') is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perc ...
, or adding a fixed amount in linear
pitch space In music theory, pitch spaces model relationships between pitches. These models typically use distance to model the degree of relatedness, with closely related pitches placed near one another, and less closely related pitches placed farther apa ...
. One can view the same transposition as "frequency scaling", "scaling" (multiplying) the frequency of every note by 3/2. Musical transposition preserves the ratios of the
harmonic A harmonic is a wave with a frequency that is a positive integer multiple of the ''fundamental frequency'', the frequency of the original periodic signal, such as a sinusoidal wave. The original signal is also called the ''1st harmonic'', t ...
frequencies that determine the sound's
timbre In music, timbre ( ), also known as tone color or tone quality (from psychoacoustics), is the perceived sound quality of a musical note, sound or tone. Timbre distinguishes different types of sound production, such as choir voices and musica ...
, unlike the ''frequency shift'' performed by
amplitude modulation Amplitude modulation (AM) is a modulation technique used in electronic communication, most commonly for transmitting messages with a radio wave. In amplitude modulation, the amplitude (signal strength) of the wave is varied in proportion to ...
, which adds a fixed frequency offset to the frequency of every note. (In theory one could perform a literal ''pitch scaling'' in which the musical pitch space location is scaled higher note would be shifted at a greater interval in linear pitch space than a lower note but that is highly unusual, and not musical.) Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the
formant In speech science and phonetics, a formant is the broad spectral maximum that results from an acoustic resonance of the human vocal tract. In acoustics, a formant is usually defined as a broad peak, or local maximum, in the spectrum. For harmoni ...
s into a sort of
Alvin and the Chipmunks Alvin and the Chipmunks, originally David Seville and the Chipmunks or simply The Chipmunks, are an American animated virtual band and media franchise first created by Ross Bagdasarian for novelty records in 1958. The group consists of three ...
-like effect, which may be desirable or undesirable. A process that preserves the formants and character of a voice involves analyzing the signal with a channel vocoder or LPC vocoder plus any of several
pitch detection algorithm Pitch may refer to: Acoustic frequency * Pitch (music), the perceived frequency of sound including "definite pitch" and "indefinite pitch" ** Absolute pitch or "perfect pitch" ** Pitch class, a set of all pitches that are a whole number of octav ...
s and then resynthesizing it at a different fundamental frequency. A detailed description of older analog recording techniques for pitch shifting can be found within the
Alvin and the Chipmunks Alvin and the Chipmunks, originally David Seville and the Chipmunks or simply The Chipmunks, are an American animated virtual band and media franchise first created by Ross Bagdasarian for novelty records in 1958. The group consists of three ...
entry.


In consumer software

Pitch-corrected audio timestretch is found in every modern
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
as part of the
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
standard for media playback. Similar controls are ubiquitous in media applications and frameworks such as
GStreamer GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing systems to complete complex workflows. For instance, GStreamer can be used to build a system that reads files in one format, processes the ...
and
Unity Unity may refer to: Buildings * Unity Building, Oregon, Illinois, US; a historic building * Unity Building (Chicago), Illinois, US; a skyscraper * Unity Buildings, Liverpool, UK; two buildings in England * Unity Chapel, Wyoming, Wisconsin, US; a ...
.


See also

*
Audio signal processing Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves— longitudinal waves which travel through air, consist ...
*
Beatmatching Beatmatching or pitch cue is a disc jockey technique of pitch shifting or timestretching an upcoming track to match its tempo to that of the currently playing track, and to adjust them such that the beats (and, usually, the bars) are synchron ...
*
Dynamic tonality Dynamic tonality is a paradigm for tuning and timbre which generalizes the special relationship between just intonation and the harmonic series to apply to a wider set of pseudo-just tunings and related pseudo-harmonic timbres.Duffin, R.W., 2006 ...
— real-time changes of
tuning Tuning can refer to: Common uses * Tuning, the process of tuning a tuned amplifier or other electronic component * Musical tuning, musical systems of tuning, and the act of tuning an instrument or voice ** Guitar tunings ** Piano tuning, adjusti ...
and
timbre In music, timbre ( ), also known as tone color or tone quality (from psychoacoustics), is the perceived sound quality of a musical note, sound or tone. Timbre distinguishes different types of sound production, such as choir voices and musica ...
* Pitch correction *
Scrubbing (audio) In digital audio editing, scrubbing is an interaction in which a user drags a cursor or playhead across a segment of a waveform to hear it. Scrubbing is a convenient way to quickly navigate an audio file, and is a common feature of modern digital a ...
*
Sound effect A sound effect (or audio effect) is an artificially created or enhanced sound, or sound process used to emphasize artistic or other content of films, television shows, live performance, animation, video games, music, or other media. Traditi ...
s *
Time-compressed speech Time-compressed speech refers to an audio recording of verbal text in which the text is presented in a much shorter time interval than it would through normally-paced real time speech. The basic purpose is to make recorded speech contain more wor ...


References


External links


Time Stretching and Pitch Shifting Overview
A comprehensive overview of current time and pitch modification techniques by Stephan Bernsee
Stephan Bernsee's smbPitchShift C source code
C source code for doing frequency domain pitch manipulation
pitchshift.js from KievII
A Javascript pitchshifter based on smbPitchShift code, from the open sourc
KievII libraryThe Phase Vocoder: A Tutorial
- A good description of the phase vocoder
New Phase-Vocoder Techniques for Pitch-Shifting, Harmonizing and Other Exotic EffectsA new Approach to Transient Processing in the Phase VocoderHow to build a pitch shifter
Theory, equations, figures and performances of a real-time guitar pitch shifter running on a DSP chip
ZTX Time Stretching Library
Free and commercial versions of a popular 3rd party time stretching library for iOS, Linux, Windows and Mac OS X
Elastique by zplane
commercial cross-platform library, mainly used by DJ and DAW manufacturers
Voice Synth
from Qneo - specialized synthesizer for creative voice sculpting
TSM toolbox
Free MATLAB implementations of various Time-Scale Modification procedures
PaulStretch
a well-known algorithm for extreme (> 10×) time stretching {{DEFAULTSORT:Audio time-scale pitch modification Audio engineering Digital signal processing Sound effects