Multimedia information retrieval (MMIR or MIR) is a research discipline of

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

that aims at extracting semantic information from

multimedia Multimedia is a form of communication that uses a combination of different content forms such as text, audio, images, animations, or video into a single interactive presentation, in contrast to tradition ...

data sources.H Eidenberger. ''Fundamental Media Understanding'', atpress, 2011, p. 1. Data sources include directly perceivable media such as

audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum *Digital audio, representation of sound ...

image An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...

and

video Video is an electronic medium for the recording, copying, playback, broadcasting, and display of moving visual media. Video was first developed for mechanical television systems, which were quickly replaced by cathode-ray tube (CRT) syste ...

, indirectly perceivable sources such as

text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...

, semantic descriptions, biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups: # Methods for the summarization of media content (

feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...

). The result of feature extraction is a description. # Methods for the filtering of media descriptions (for example, elimination of redundancy) # Methods for the

categorization Categorization is the ability and activity of recognizing shared features or similarities between the elements of the experience of the world (such as Object (philosophy), objects, events, or ideas), organizing and classifying experience by a ...

of media descriptions into classes.

Feature extraction methods

Feature extraction is motivated by the sheer size of multimedia objects as well as their redundancy and, possibly, noisiness. Generally, two possible goals can be achieved by feature extraction: * Summarization of media content. Methods for summarization include in the audio domain, for example, mel-frequency cepstral coefficients, Zero Crossings Rate, Short-Time Energy. In the visual domain, color histograms such as the

MPEG-7 MPEG-7 is a multimedia content description standard. It was standardized in ISO/ IEC 15938 (Multimedia content description interface). This description will be associated with the content itself, to allow fast and efficient searching for material th ...

Scalable Color Descriptor can be used for summarization. * Detection of patterns by

auto-correlation Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable ...

and/or

cross-correlation In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a ''sliding dot product'' or ''sliding inner-product''. It is commonly used fo ...

. Patterns are recurring media chunks that can either be detected by comparing chunks over the media dimensions (time, space, etc.) or comparing media chunks to templates (e.g. face templates, phrases). Typical methods include Linear Predictive Coding in the audio/biosignal domain, texture description in the visual domain and n-grams in text information retrieval.

Merging and filtering methods

Multimedia Information Retrieval implies that multiple channels are employed for the understanding of media content. Each of this channels is described by media-specific feature transformations. The resulting descriptions have to be merged to one description per media object. Merging can be performed by simple concatenation if the descriptions are of fixed size. Variable-sized descriptions – as they frequently occur in motion description – have to be normalized to a fixed length first. Frequently used methods for description filtering include

factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...

(e.g. by PCA), singular value decomposition (e.g. as latent semantic indexing in text retrieval) and the extraction and testing of statistical moments. Advanced concepts such as the

Kalman filter For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimat ...

are used for merging of descriptions.

Categorization methods

Generally, all forms of machine learning can be employed for the categorization of multimedia descriptions though some methods are more frequently used in one area than another. For example,

hidden Markov models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

are state-of-the-art in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

, while

dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walki ...

– a semantically related method – is state-of-the-art in gene sequence alignment. The list of applicable classifiers includes the following: * Metric approaches (

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...

, Minkowski distances, dynamic alignment) * Nearest Neighbor methods (

K-nearest neighbors algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regres ...

, K-means,

self-organizing map A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher dimensional data set while preserving the to ...

) * Risk Minimization (Support vector regression,

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

) * Density-based Methods (Bayes nets,

Markov process A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happe ...

es, mixture models) * Neural Networks (

Perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...

, associative memories, spiking nets) * Heuristics (

Decision trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

, random forests, etc.) The selection of the best classifier for a given problem (test set with descriptions and class labels, so-called

ground truth Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. "ground t ...

) can be performed automatically, for example, using the

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...

Data Miner.

Open problems

The quality of MMIR Systems depends heavily on the quality of the training data. Discriminative descriptions can be extracted from media sources in various forms. Machine learning provides categorization methods for all types of data. However, the classifier can only be as good as the given training data. On the other hand, it requires considerable effort to provide class labels for large databases. The future success of MMIR will depend on the provision of such data. The annual TRECVID competition is currently one of the most relevant sources of high-quality ground truth.

Related areas

MMIR provides an overview over methods employed in the areas of information retrieval. Methods of one area are adapted and employed on other types of media. Multimedia content is merged before the classification is performed. MMIR methods are, therefore, usually reused from other areas such as: * Bioinformation analysis * Biosignal processing * Content-based image and video retrieval *

Face recognition A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and wo ...

* Audio and music classification (Music information retrieval) *

Automatic content recognition Automatic content recognition (ACR) is a technology to identify content played on a media device or present within a media file. Devices implementing ACR can allow the device or the manufacturer to collect content consumption information automatic ...

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

* Technical chart analysis * Video browsing * Text information retrieval *

Image retrieval An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as caption ...

Learning to rank Learning to rank. Slides from Tie-Yan Liu's talk at WWW 2009 conference aravailable online or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construct ...

The

International Journal of Multimedia Information Retrieval The ''International Journal of Multimedia Information Retrieval'' is a quarterly peer-reviewed scientific journal published by Springer Science+Business Media covering all aspects of multimedia information retrieval. It was established in 2012 an ...

documents the development of MMIR as a research discipline that is independent of these areas. See also ''Handbook of Multimedia Information Retrieval''H Eidenberger. ''Handbook of Multimedia Information Retrieval'', atpress, 2012. for a complete overview over this research discipline.

References

{{reflist Information retrieval genres