Multimedia information retrieval (MMIR or MIR) is a research discipline of

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

that aims at extracting semantic information from

multimedia Multimedia is a form of communication that uses a combination of different content forms, such as Text (literary theory), writing, Sound, audio, images, animations, or video, into a single presentation. T ...

data sources.H Eidenberger. ''Fundamental Media Understanding'', atpress, 2011, p. 1. Data sources include directly perceivable media such as

audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum *Digital audio, representation of sound ...

image An image or picture is a visual representation. An image can be Two-dimensional space, two-dimensional, such as a drawing, painting, or photograph, or Three-dimensional space, three-dimensional, such as a carving or sculpture. Images may be di ...

and

video Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...

, indirectly perceivable sources such as

text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...

, semantic descriptions, biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups: # Methods for the summarization of media content (

feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

). The result of feature extraction is a description. # Methods for the filtering of media descriptions (for example, elimination of redundancy) # Methods for the

categorization Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identi ...

of media descriptions into classes.

Feature extraction methods

Feature extraction is motivated by the sheer size of multimedia objects as well as their redundancy and, possibly, noisiness. Generally, two possible goals can be achieved by feature extraction: * Summarization of media content. Methods for summarization include in the audio domain, for example, mel-frequency cepstral coefficients, Zero Crossings Rate, Short-Time Energy. In the visual domain, color histograms such as the MPEG-7 Scalable Color Descriptor can be used for summarization. * Detection of patterns by auto-correlation and/or

cross-correlation In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a ''sliding dot product'' or ''sliding inner-product''. It is commonly used f ...

. Patterns are recurring media chunks that can either be detected by comparing chunks over the media dimensions (time, space, etc.) or comparing media chunks to templates (e.g. face templates, phrases). Typical methods include Linear Predictive Coding in the audio/biosignal domain, texture description in the visual domain and n-grams in text information retrieval.

Merging and filtering methods

Multimedia Information Retrieval implies that multiple channels are employed for the understanding of media content. Each of this channels is described by media-specific feature transformations. The resulting descriptions have to be merged to one description per media object. Merging can be performed by simple concatenation if the descriptions are of fixed size. Variable-sized descriptions – as they frequently occur in motion description – have to be normalized to a fixed length first. Frequently used methods for description filtering include

factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...

(e.g. by PCA), singular value decomposition (e.g. as latent semantic indexing in text retrieval) and the extraction and testing of statistical moments. Advanced concepts such as the

Kalman filter In statistics and control theory, Kalman filtering (also known as linear quadratic estimation) is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, to produce estimates of unk ...

are used for merging of descriptions.

Categorization methods

Generally, all forms of machine learning can be employed for the categorization of multimedia descriptions though some methods are more frequently used in one area than another. For example,

hidden Markov models A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

are state-of-the-art in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, while

dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walk ...

– a semantically related method – is state-of-the-art in gene sequence alignment. The list of applicable classifiers includes the following: * Metric approaches (

Cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...

, Minkowski distances, dynamic alignment) * Nearest Neighbor methods (

K-nearest neighbors algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a Non-parametric statistics, non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Lawson Hodges Jr., Joseph Hodges in 1951, and later expand ...

, K-means,

self-organizing map A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data set while preserving the t ...

) * Risk Minimization (Support vector regression,

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

) * Density-based Methods (Bayes nets,

Markov process In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, ...

es, mixture models) * Neural Networks (

Perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

, associative memories, spiking nets) * Heuristics (

Decision trees A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

, random forests, etc.) The selection of the best classifier for a given problem (test set with descriptions and class labels, so-called

ground truth Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. ''ground ...

) can be performed automatically, for example, using the

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...

Data Miner. Models of Multimedia Information Retrieval Spoken Language Audio Retrieval Spoken Language Audio Retrieval focuses on audio content containing spoken words. It involves the transcription of spoken content into text using Automatic Speech Recognition (ASR) and indexing the transcriptions for text-based search. Key Features: Techniques: ASR for transcription and text indexing. Query Types: Text-based queries. Applications: Searching podcast transcripts. Analyzing customer service call logs. Finding specific phrases in meeting recordings. Challenges: Errors in ASR can reduce retrieval accuracy. Multilingual and accent variability requires robust systems. Non-Speech Audio Retrieval Non-Speech Audio Retrieval handles audio content without spoken words, such as music, environmental sounds, or sound effects. This model relies on extracting audio features like pitch, rhythm, and timbre to identify relevant audio. Key Features: Techniques: Acoustic feature extraction (e.g., spectrograms, MFCCs). Query Types: Audio samples or textual descriptions. Applications: Music recommendation systems. Environmental sound detection (e.g., gunshots, animal calls). Sound effect retrieval in media production. Challenges: Difficulty in bridging the semantic gap between user queries and low-level audio features. Efficient indexing of large datasets. Graph Retrieval Graph Retrieval retrieves information represented as graphs, which consist of nodes (entities) and edges (relationships). It is widely used in social networks, knowledge graphs, and bioinformatics. Key Features: Techniques: Graph matching, adjacency list/matrix storage, and graph databases (e.g., Neo4j). Query Types: Subgraphs, patterns, or textual queries. Applications: Social network analysis. Searching knowledge graphs. Molecular structure retrieval. Challenges: Computationally intensive subgraph matching. Scalability for large, complex graphs. Imagery Retrieval Imagery Retrieval retrieves images based on user input, such as textual descriptions or visual samples. It leverages both low-level features and semantic analysis for search. Key Features: Techniques: Content-Based Image Retrieval (CBIR), visual feature extraction, semantic analysis. Query Types: Text, sketches, or example images. Applications: Stock image search. E-commerce product matching. Medical imaging analysis. Challenges: Bridging the semantic gap between user queries and image content. Efficient indexing of large-scale image datasets. Video Retrieval Video Retrieval is the process of finding specific video content based on user queries. It involves analyzing both the visual and temporal features of videos. Key Features: Techniques: Keyframe extraction, motion pattern analysis, temporal indexing. Query Types: Textual descriptions, sample clips, or temporal queries. Applications: Streaming service recommendations. Surveillance footage analysis. Sports analytics. Challenges: Managing the large file sizes of video content. Efficient analysis of temporal sequences and multimodal features. Comparison of Retrieval Models Model Data Type Query Types Applications Spoken Language Audio Speech recordings Text queries Podcasts, meeting logs, call centers Non-Speech Audio Music, sound effects Audio samples or text Music apps, environmental sounds Graph Retrieval Graph structures Subgraphs, patterns Knowledge graphs, bioinformatics Imagery Retrieval Images Text, sketches, or images E-commerce, medical imaging Video Retrieval Videos (visual + temporal) Text, clips, or time queries Surveillance, sports analysis Conclusion Multimedia Information Retrieval plays a crucial role in organizing and accessing vast multimedia data repositories. The variety of retrieval models ensures that users can effectively interact with and extract insights from complex multimedia datasets. Future advancements in artificial intelligence and machine learning are expected to improve the accuracy and scalability of MIR systems.

Related areas

MMIR provides an overview over methods employed in the areas of information retrieval. Methods of one area are adapted and employed on other types of media. Multimedia content is merged before the classification is performed. MMIR methods are, therefore, usually reused from other areas such as: * Bioinformation analysis * Biosignal processing * Content-based image and video retrieval *

Face recognition A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...

* Audio and music classification (Music information retrieval) *

Automatic content recognition Automatic content recognition (ACR) is a technology used to identify content played on a media device or presented within a media file. Devices with ACR can allow for the collection of content consumption information automatically at the screen or ...

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

* Technical chart analysis * Video browsing * Text information retrieval *

Image retrieval An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captio ...

Learning to rank Learning to rank. Slides from Tie-Yan Liu's talk at World Wide Web Conference, WWW 2009 conference aravailable online or machine-learned ranking (MLR) is the application of machine learning, typically Supervised learning, supervised, Semi-supervi ...

The International Journal of Multimedia Information Retrieval documents the development of MMIR as a research discipline that is independent of these areas. See also ''Handbook of Multimedia Information Retrieval''H Eidenberger. ''Handbook of Multimedia Information Retrieval'', atpress, 2012. for a complete overview over this research discipline.

References

{{reflist Information retrieval genres