HOME

TheInfoList



OR:

Computer audition (CA) or machine listening is the general field of study of
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in ''
Technology Review ''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology, and editorially independent of the university. It was founded in 1899 as ''The Technology Review'', and was re-launched without "The" in ...
'', talks about these systems "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents." Inspired by models of human audition, CA deals with questions of representation, transduction, grouping, use of musical knowledge and general sound
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
for the purpose of performing intelligent operations on audio and music signals by the computer. Technically this requires a combination of methods from the fields of
signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...
, auditory modelling, music perception and
cognition Cognition refers to "the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses". It encompasses all aspects of intellectual functions and processes such as: perception, attention, though ...
,
pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics ...
, and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, as well as more traditional methods of
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...
for musical knowledge representation.


Applications

Like
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human ...
versus image processing, computer audition versus audio engineering deals with understanding of audio rather than processing. It also differs from problems of speech understanding by machine since it deals with general audio signals, such as natural sounds and musical recordings. Applications of computer audition are widely varying, and include search for sounds,
genre Genre () is any form or type of communication in any mode (written, spoken, digital, artistic, etc.) with socially-agreed-upon conventions developed over time. In popular usage, it normally describes a category of literature, music, or other f ...
recognition, acoustic monitoring, music transcription, score following, audio texture,
music improvisation Musical improvisation (also known as musical extemporization) is the creative activity of immediate ("in the moment") musical composition, which combines performance with communication of emotions and instrumental technique as well as spontaneous ...
, emotion in audio and so on.


Related disciplines

Computer Audition overlaps with the following disciplines: *
Music information retrieval Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. MIR is a small but growing field of research with many real-world applications. Those involved in MIR may have a background in academic musicol ...
: methods for search and analysis of similarity between music signals. *
Auditory scene analysis In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. Th ...
: understanding and description of audio sources and events. * Computational
musicology Musicology (from Greek μουσική ''mousikē'' 'music' and -λογια ''-logia'', 'domain of study') is the scholarly analysis and research-based study of music. Musicology departments traditionally belong to the humanities, although some m ...
and mathematical music theory: use of algorithms that employ musical knowledge for analysis of music data. *
Computer music Computer music is the application of computing technology in music composition, to help human composers create new music or to have computers independently create music, such as with algorithmic composition programs. It includes the theory and ...
: use of computers in creative musical applications. * Machine musicianship: audition driven interactive music systems.


Areas of study

Since audio signals are interpreted by the human ear–brain system, that complex perceptual mechanism should be simulated somehow in software for "machine listening". In other words, to perform on par with humans, the computer should hear and understand audio content much as humans do. Analyzing audio accurately involves several fields: electrical engineering (spectrum analysis, filtering, and audio transforms); artificial intelligence (machine learning and sound classification); psychoacoustics (sound perception); cognitive sciences (neuroscience and artificial intelligence); acoustics (physics of sound production); and music (harmony, rhythm, and timbre). Furthermore, audio transformations such as pitch shifting, time stretching, and sound object filtering, should be perceptually and musically meaningful. For best results, these transformations require perceptual understanding of spectral models, high-level feature extraction, and sound analysis/synthesis. Finally, structuring and coding the content of an audio file (sound and metadata) could benefit from efficient compression schemes, which discard inaudible information in the sound.Machine Listening Course Webpage at MIT
/ref> Computational models of music and sound perception and cognition can lead to a more meaningful representation, a more intuitive digital manipulation and generation of sound and music in musical human-machine interfaces. The study of CA could be roughly divided into the following sub-problems: # Representation: signal and symbolic. This aspect deals with time-frequency representations, both in terms of notes and spectral models, including pattern playback and audio texture. #
Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
: sound descriptors, segmentation, onset, pitch and
envelope An envelope is a common packaging item, usually made of thin, flat material. It is designed to contain a flat object, such as a letter or card. Traditional envelopes are made from sheets of paper cut to one of three shapes: a rhombus, a ...
detection, chroma, and auditory representations. # Musical knowledge structures: analysis of
tonality Tonality is the arrangement of pitches and/or chords of a musical work in a hierarchy of perceived relations, stabilities, attractions and directionality. In this hierarchy, the single pitch or triadic chord with the greatest stability is ca ...
,
rhythm Rhythm (from Greek , ''rhythmos'', "any regular recurring motion, symmetry") generally means a " movement marked by the regulated succession of strong and weak elements, or of opposite or different conditions". This general meaning of regular re ...
, and
harmonies In music, harmony is the process by which individual sounds are joined together or composed into whole units or compositions. Often, the term harmony refers to simultaneously occurring frequencies, pitches ( tones, notes), or chords. Howev ...
. # Sound similarity: methods for comparison between sounds, sound identification, novelty detection, segmentation, and clustering. # Sequence modeling: matching and alignment between signals and note sequences. # Source separation: methods of grouping of simultaneous sounds, such as multiple pitch detection and time-frequency clustering methods. # Auditory cognition: modeling of emotions, anticipation and familiarity, auditory surprise, and analysis of musical structure. # Multi-modal analysis: finding correspondences between textual, visual, and audio signals.


Representation issues

Computer audition deals with audio signals that can be represented in a variety of fashions, from direct encoding of digital audio in two or more channels to symbolically represented synthesis instructions. Audio signals are usually represented in terms of analogue or
digital Digital usually refers to something using discrete digits, often binary digits. Technology and computing Hardware *Digital electronics, electronic circuits which operate using digital signals ** Digital camera, which captures and stores digital ...
recordings. Digital recordings are samples of acoustic waveform or parameters of audio compression algorithms. One of the unique properties of musical signals is that they often combine different types of representations, such as graphical scores and sequences of performance actions that are encoded as
MIDI MIDI (; Musical Instrument Digital Interface) is a technical standard that describes a communications protocol, digital interface, and electrical connectors that connect a wide variety of electronic musical instruments, computers, and ...
files. Since audio signals usually comprise multiple sound sources, then unlike speech signals that can be efficiently described in terms of specific models (such as source-filter model), it is hard to devise a parametric representation for general audio. Parametric audio representations usually use
filter bank In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input signal into multiple components, each one carrying a single frequency sub-band of the original signal. One application of a filter bank is ...
s or
sinusoid A sine wave, sinusoidal wave, or just sinusoid is a mathematical curve defined in terms of the ''sine'' trigonometric function, of which it is the graph. It is a type of continuous wave and also a smooth periodic function. It occurs often in ...
al models to capture multiple sound parameters, sometimes increasing the representation size in order to capture internal structure in the signal. Additional types of data that are relevant for computer audition are textual descriptions of audio contents, such as annotations, reviews, and visual information in the case of audio-visual recordings.


Features

Description of contents of general audio signals usually requires extraction of features that capture specific aspects of the audio signal. Generally speaking, one could divide the features into signal or mathematical descriptors such as energy, description of spectral shape etc., statistical characterization such as change or novelty detection, special representations that are better adapted to the nature of musical signals or the auditory system, such as logarithmic growth of sensitivity (
bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...
) in frequency or
octave In music, an octave ( la, octavus: eighth) or perfect octave (sometimes called the diapason) is the interval between one musical pitch and another with double its frequency. The octave relationship is a natural phenomenon that has been refer ...
invariance (chroma). Since parametric models in audio usually require very many parameters, the features are used to summarize properties of multiple parameters in a more compact or salient representation.


Musical knowledge

Finding specific musical structures is possible by using musical knowledge as well as supervised and unsupervised machine learning methods. Examples of this include detection of tonality according to distribution of frequencies that correspond to patterns of occurrence of notes in musical scales, distribution of note onset times for detection of beat structure, distribution of energies in different frequencies to detect musical chords and so on.


Sound similarity and sequence modeling

Comparison of sounds can be done by comparison of features with or without reference to time. In some cases an overall similarity can be assessed by close values of features between two sounds. In other cases when temporal structure is important, methods of dynamic time warping need to be applied to "correct" for different temporal scales of acoustic events. Finding repetitions and similar sub-sequences of sonic events is important for tasks such as texture synthesis and machine improvisation.


Source separation

Since one of the basic characteristics of general audio is that it comprises multiple simultaneously sounding sources, such as multiple musical instruments, people talking, machine noises or animal vocalization, the ability to identify and separate individual sources is very desirable. Unfortunately, there are no methods that can solve this problem in a
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
fashion. Existing methods of source separation rely sometimes on correlation between different audio channels in multi-channel recordings. The ability to separate sources from stereo signals requires different techniques than those usually applied in communications where multiple sensors are available. Other source separation methods rely on training or clustering of features in mono recording, such as tracking harmonically related partials for multiple pitch detection. Some methods, before explicit recognition, rely on revealing structures in data without knowing the structures (like recognizing objects in abstract pictures without attributing them meaningful labels) by finding the least complex data representations, for instance describing audio scenes as generated by a few tone patterns and their trajectories (polyphonic voices) and acoustical contours drawn by a tone (chords).


Auditory cognition

Listening to music and general audio is commonly not a task directed activity. People enjoy music for various poorly understood reasons, which are commonly referred to the emotional effect of music due to creation of expectations and their realization or violation. Animals attend to signs of danger in sounds, which could be either specific or general notions of surprising and unexpected change. Generally, this creates a situation where computer audition can not rely solely on detection of specific features or sound properties and has to come up with general methods of adapting to changing auditory environment and monitoring its structure. This consists of analysis of larger repetition and
self-similarity __NOTOC__ In mathematics, a self-similar object is exactly or approximately similar to a part of itself (i.e., the whole has the same shape as one or more of the parts). Many objects in the real world, such as coastlines, are statistically se ...
structures in audio to detect innovation, as well as ability to predict local feature dynamics.


Multi-modal analysis

Among the available data for describing music, there are textual representations, such as liner notes, reviews and criticisms that describe the audio contents in words. In other cases human reactions such as emotional judgements or psycho-physiological measurements might provide an insight into the contents and structure of audio. Computer Audition tries to find relation between these different representations in order to provide this additional understanding of the audio contents.


See also

*
3D sound localization 3D sound localization refers to an acoustic technology that is used to locate the source of a sound in a three-dimensional space. The source location is usually determined by the direction of the incoming sound waves (horizontal and vertical angles ...
*
Audio signal processing Audio signal processing is a subfield of signal processing that is concerned with the electronic manipulation of audio signals. Audio signals are electronic representations of sound waves— longitudinal waves which travel through air, consist ...
*
List of emerging technologies This is a list of emerging technologies, in-development technical innovations with significant potential in their applications. The criteria for this list is that the technology must: # Exist in some way; purely hypothetical technologies ca ...
* Medical intelligence and language engineering lab *
Music and artificial intelligence Artificial intelligence and music (AIM) is a common subject in the International Computer Music Conference, the Computing Society Conference and the International Joint Conference on Artificial Intelligence. The first International Computer Music ...
*
Sound recognition Sound recognition is a technology, which is based on both traditional pattern recognition theories and audio signal analysis methods. Sound recognition technologies contain preliminary data processing, feature extraction and classification algori ...


External links


UCSD Computer Audition Lab





Department of Electrical Engineering, IIT (Bangalore)

Sound and Music Computing, Aalborg University Copenhagen, Denmark


References

{{Computer audition Artificial intelligence Digital signal processing