Computational auditory scene analysis (CASA) is the study of

auditory scene analysis In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. T ...

by computational means.Wang, D. L. and Brown, G. J. (Eds.) (2006). ''Computational auditory scene analysis: Principles, algorithms and applications''. IEEE Press/Wiley-Interscience In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of

blind signal separation Blind may refer to: * The state of blindness, being unable to see * A window blind, a covering for a window Blind may also refer to: Arts, entertainment, and media Films * ''Blind'' (2007 film), a Dutch drama by Tamar van den Dop * ''Blind' ...

in that it is (at least to some extent) based on the mechanisms of the human

auditory system The auditory system is the sensory system for the sense of hearing. It includes both the sensory organs (the ears) and the auditory parts of the sensory system. System overview The outer ear funnels sound vibrations to the eardrum, increasin ...

, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the

cocktail party problem The cocktail party effect is the phenomenon of the brain's ability to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, such as when a partygoer can focus on a single conversation in a noisy room ...

Principles

Since CASA serves to model functionality parts of the auditory system, it is necessary to view parts of the biological auditory system in terms of known physical models. Consisting of three areas, the outer, middle and inner ear, the auditory periphery acts as a complex transducer that converts sound vibrations into action potentials in the auditory nerve. The

outer ear The outer ear, external ear, or auris externa is the external part of the ear, which consists of the auricle (also pinna) and the ear canal. It gathers sound energy and focuses it on the eardrum (tympanic membrane). Structure Auricle The ...

consists of the external ear,

ear canal The ear canal (external acoustic meatus, external auditory meatus, EAM) is a pathway running from the outer ear to the middle ear. The adult human ear canal extends from the pinna to the eardrum and is about in length and in diameter. Struc ...

and the

ear drum In the anatomy of humans and various other tetrapods, the eardrum, also called the tympanic membrane or myringa, is a thin, cone-shaped membrane that separates the external ear from the middle ear. Its function is to transmit sound from the air ...

. The outer ear, like an acoustic funnel, helps locating the sound source.Warren, R.(1999). ''Auditory Perception: A New Analysis and Synthesis''. New York: Cambridge University Press. The ear canal acts as a resonant tube (like an organ pipe) to amplify frequencies between 2–5.5 kHz with a maximum amplification of about 11 dB occurring around 4 kHz.Wiener, F.(1947), "On the diffraction of a progressive wave by the human head". ''Journal of the Acoustical Society of America'', 19, 143–146. As the organ of hearing, the

cochlea The cochlea is the part of the inner ear involved in hearing. It is a spiral-shaped cavity in the bony labyrinth, in humans making 2.75 turns around its axis, the modiolus. A core component of the cochlea is the Organ of Corti, the sensory org ...

consists of two membranes, Reissner’s and the

basilar membrane The basilar membrane is a stiff structural element within the cochlea of the inner ear which separates two liquid-filled tubes that run along the coil of the cochlea, the scala media and the scala tympani. The basilar membrane moves up and down in ...

. The basilar membrane moves to audio stimuli through the specific stimulus frequency matches the resonant frequency of a particular region of the basilar membrane. The movement the basilar membrane displaces the inner hair cells in one direction, which encodes a half-wave rectified signal of action potentials in the spiral ganglion cells. The axons of these cells make up the auditory nerve, encoding the rectified stimulus. The auditory nerve responses select certain frequencies, similar to the basilar membrane. For lower frequencies, the fibers exhibit "phase locking". Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation. There are also neuroanatomical associations of ASA through the posterior cortical areas, including the posterior superior temporal lobes and the

posterior cingulate The posterior cingulate cortex (PCC) is the caudal part of the cingulate cortex, located posterior to the anterior cingulate cortex. This is the upper part of the "limbic lobe". The cingulate cortex is made up of an area around the midline of the ...

. Studies have found that impairments in ASA and segregation and grouping operations are affected in patients with

Alzheimer's disease Alzheimer's disease (AD) is a neurodegeneration, neurodegenerative disease that usually starts slowly and progressively worsens. It is the cause of 60–70% of cases of dementia. The most common early symptom is difficulty in short-term me ...

.Goll, J., Kim, L. (2012), "Impairments of auditory scene analysis in Alzheimer's disease", ''Brain'' 135 (1), 190–200.

System Architecture

Cochleagram

As the first stage of CASA processing, the cochleagram creates a time-frequency representation of the input signal. By mimicking the components of the outer and middle ear, the signal is broken up into different frequencies that are naturally selected by the cochlea and hair cells. Because of the frequency selectivity of the basilar membrane, a

filter bank In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input signal into multiple components, each one carrying a single frequency Sub-band coding, sub-band of the original signal. One application of ...

is used to model the membrane, with each filter associated with a specific point on the basilar membrane. Since the hair cells produce spike patterns, each filter of the model should also produce a similar spike in the

impulse response In signal processing and control theory, the impulse response, or impulse response function (IRF), of a dynamic system is its output when presented with a brief input signal, called an Dirac delta function, impulse (). More generally, an impulse ...

. The use of a

gammatone filter A gammatone filter is a linear filter described by an impulse response that is the product of a gamma distribution and sinusoidal tone. It is a widely used model of auditory filters in the auditory system. A gammatone response was originally pro ...

provides an impulse response as the product of a gamma function and a tone. The output of the gammatone filter can be regarded as a measurement of the basilar membrane displacement. Most CASA systems represent the firing rate in the auditory nerve rather than a spike-based. To obtain this, the filter bank outputs are half-wave rectified followed by a square root. (Other models, such as automatic gain controllers have been implemented). The half-rectified wave is similar to the displacement model of the hair cells. Additional models of the hair cells include the Meddis hair cell model which pairs with the gammatone filter bank, by modeling the hair cell transduction.Meddis, R., Hewitt, M., Shackleton, T. (1990). "Implementation details of a computational model of the inner hair-cell/auditory nerve synapse". ''Journal of the Acoustical Society of America'' 87(4) 1813–1816. Based on the assumption that there are three reservoirs of transmitter substance within each hair cell, and the transmitters are released in proportion to the degree of displacement to the basilar membrane, the release is equated with the probability of a spike generated in the nerve fiber. This model replicates many of the nerve responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation.

Correlogram

Important model of pitch perception by unifying 2 schools of pitch theory: * Place theories (emphasizing the role of resolved harmonics) * Temporal theories (emphasizing the role of unresolved harmonics) The correlogram is generally computed in the time domain by autocorrelating the simulated auditory nerve firing activity to the output of each filter channel. By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponds to the perceived pitch.

Cross-Correlogram

Because the ears receive audio signals at different times, the sound source can be determined by using the delays retrieved from the two ears. Jeffress, L.A. (1948). "A place theory of sound localization". ''Journal of Comparative and Physiological Psychology'', 41 35–39. By cross-correlating the delays from the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal. The use of interaural cross-correlation mechanism has been supported through physiological studies, paralleling the arrangement of neurons in the auditory

midbrain The midbrain or mesencephalon is the forward-most portion of the brainstem and is associated with vision, hearing, motor control, sleep and wakefulness, arousal (alertness), and temperature regulation. The name comes from the Greek ''mesos'', " ...

.Yin, T., Chan, J. (1990). "Interaural time sensitivity in medial superior olive of cat" ''Journal Neurophysiology'', 64(2) 465–488.

Time-Frequency Masks

To segregate the sound source, CASA systems mask the cochleagram. This mask, sometimes a

Wiener filter In signal processing, the Wiener filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant ( LTI) filtering of an observed noisy process, assuming known stationary signal and noise spectra, and ...

, weighs the target source regions and suppresses the rest. The physiological motivation behind the mask results from the auditory perception where sound is rendered inaudible by a louder sound.Moore, B. (2003). ''An Introduction to the Psychology of Hearing'' (5th ed.). Academic Press, London.

Resynthesis

A resynthesis pathway reconstructs an audio signal from a group of segments. Achieved by inverting the cochleagram, high quality resynthesized speech signals can be obtained.

Applications

Monaural CASA

Monaural sound separation first began with separating voices based on frequency. There were many early developments based on segmenting different speech signals through frequency. Other models followed on this process, by the addition of adaption through state space models, batch processing, and prediction-driven architecture.Ellis, D (1996). "Predication-Driven Computational Auditory Scene Analysis". PhD thesis, MIT Department of Electrical Engineering and Computer Science. The use of CASA has improved the robustness of ASR and speech separation systems.Li, P., Guan, Y. (2010). "Monaural speech separation based on MASVQ and CASA for robust speech recognition" ''Computer Speech and Language'', 24, 30–44.

Binaural CASA

Since CASA is modeling human auditory pathways, binaural CASA systems better the human model by providing sound localization, auditory grouping and robustness to reverberation by including 2 spatially separated microphones. With methods similar to cross-correlation, systems are able to extract the target signal from both input microphones.Bodden, M. (1993). "Modeling human sound-source locations and cocktail party effect" ''Acta Acustica'' 1 43–55.Lyon, R.(1983). "A computational model of binaural locations and separation". ''Proceedings of the International Conference on Acoustics, Speech and Signal Processing'' 1148–1151.

Neural CASA Models

Since the biological auditory system is deeply connected with the actions of neurons, CASA systems also incorporated neural models within the design. Two different models provide the basis for this area. Malsburg and Schneider proposed a

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

model with oscillators to represent features of different streams (synchronized and desynchronized).Von der Malsburg, C., Schneider, W. (1986). "A neural cocktail-party processor". ''Biological Cybernetics'' 54 29–40. Wang also presented a model using a network of excitatory units with a global inhibitor with delay lines to represent the auditory scene within the time-frequency.Wang, D.(1994). "Auditory stream segregation based on oscillatory correlation". ''Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings'', 624–632.Wang, D.(1996), "Primitive auditory segregation based on oscillatory correlation". ''Cognitive Science'' 20, 409–456.

Analysis of Musical Audio Signals

Typical approaches in CASA systems starts with segmenting sound-sources into individual constituents, in its attempts to mimic the physical auditory system. However, there is evidence that the brain does not necessarily process audio input separately, but rather as a mixture.Bregman, A (1995). "Constraints on computational models of auditory scene analysis as derived from human perception". ''The Journal of the Acoustical Society of Japan (E)'', 16(3), 133–136. Instead of breaking the audio signal down to individual constituents, the input is broken down of by higher level descriptors, such as chords, bass and melody, beat structure, and chorus and phrase repetitions. These descriptors run into difficulties in real-world scenarios, with monaural and binaural signals. Also, the estimation of these descriptors is highly dependent on the cultural influence of the musical input. For example, within Western music, the melody and bass influences the identity of the piece, with the core formed by the melody. By distinguishing the frequency responses of melody and bass, a fundamental frequency can be estimated and filtered for distinction.Goto, M.(2004). "A real-time music-scene-description system: predominate-F0 estimation for detecting melody and bass lines in real-world audio signals". ''Speech Communication'', 43, 311–329. Chord detection can be implemented through pattern recognition, by extracting low-level features describing harmonic content.Zbigniew, R., Wieczorkowska, A.(2010). "Advances in Music Information Retrieval". ''Studies in Computational Intelligence'', 274 119–142. The techniques utilized in music scene analysis can also be applied to

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

, and other environmental sounds.Masuda-Katsuse, I (2001). "A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise". ''Proceedings Eurospeech'', 1119–1122. Future bodies of work include a top-down integration of audio signal processing, such as a real-time beat-tracking system and expanding out of the signal processing realm with the incorporation of auditory psychology and physiology.Goto, M (2001). "An Audio-based real-time beat tracking system for music with or without drum sounds". ''Journal of New Music Research'', 30(2): 159–171.

Neural Perceptual Modeling

While many models consider the audio signal as a complex combination of different frequencies, modeling the auditory system can also require consideration for the neural components. By taking a holistic process, where a stream (of feature-based sounds) correspond to neuronal activity distributed in many brain areas, the perception of the sound could be mapped and modeled. Two different solutions have been proposed to the binding of the audio perception and the area in the brain. Hierarchical coding models many cells to encode all possible combinations of features and objects in the auditory scene.deCharms, R., Merzenich, M, (1996). "Primary cortical representation of sounds by the coordination of action-potential timing". ''Nature'', 381, 610–613.Wang, D.(2005). "The time dimension of scene analysis". ''IEEE Transactions on Neural Networks'', 16(6), 1401–1426. Temporal or oscillatory correlation addressing the binding problem by focusing on the synchrony and desynchrony between neural oscillations to encode the state of binding among the auditory features. These two solutions are very similar to the debacle between place coding and temporal coding. While drawing from modeling neural components, another phenomenon of ASA comes into play with CASA systems: the extent of modeling neural mechanisms. The studies of CASA systems have involved modeling some known mechanisms, such as the bandpass nature of cochlear filtering and random auditory nerve firing patterns, however, these models may not lead to finding new mechanisms, but rather give an understanding of purpose to the known mechanisms.Bregman, A.(1990). ''Auditory Scene Analysis''. Cambridge: MIT Press.

References

{{reflist Hearing Sound Digital signal processing Computational fields of study