Multimodal Representation Learning
   HOME

TheInfoList



OR:

Multimodal representation learning is a subfield of
representation learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...
focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis.


Motivation

The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques.


Approaches and methods


Canonical-correlation analysis based methods

Canonical-correlation analysis (CCA) was first introduced in 1936 by
Harold Hotelling Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...
and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data
matrices Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the ...
X \in \mathbb^ and Y \in \mathbb^ representing different modalities, CCA finds projection vectors w_x\in\mathbb^p  and w_y\in\mathbb^q  that maximizes the correlation between the projected variables: \rho = \max_ \frac such that \Sigma_ and \Sigma_ are the within-modality covariance matrices, and \Sigma_ is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA.


Kernel CCA

Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions K_x  and K_y with corresponding Gram matrices K_x\in\mathbb^ and K_y\in\mathbb^ , KCCA seeks coefficients \alpha and \beta that maximize: \rho = \max_ \frac To prevent
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
, regularization terms are typically added, resulting in: \rho = \max_ \frac where \lambda_x and \lambda_y are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its O(n^2)  memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers.


Deep CCA

Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks f_x and f_y for each modality to transform the original data before applying CCA: \max_ \operatorname\left( f_x(X; \theta_x), f_y(Y; \theta_y) \right) where \theta_x and \theta_y represent the parameters of the neural networks, and W_x and W_y  are the CCA projection matrices. The correlation objective is computed as: \operatorname(H_x, H_y) = \operatorname\left( T^ H_x^T H_y S^ \right) where H_x=f_x(X) and H_y=f_y(Y) are the network outputs, T=H_x^TH_x+r_xI, S=H_y^TH_y+r_yI  and r_x, r_y are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization.


Graph-based methods

Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional
graph neural network Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs. One prominent example is molecular drug design. Each input sample is a graph representation of a molecule, where atoms f ...
s (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as
deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not b ...
s (DBN) and deep
Boltzmann machine A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin glass, spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick m ...
s (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks.


Diffusion maps

Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities.


Multi-view diffusion maps

Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a
random walk In mathematics, a random walk, sometimes known as a drunkard's walk, is a stochastic process that describes a path that consists of a succession of random steps on some Space (mathematics), mathematical space. An elementary example of a rand ...
process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associated diffusion distances. The spectral decomposition of this kernel enables the discovery of an embedding that better leverages the information from all views. This method has demonstrated utility in various machine learning tasks, including classification, clustering, and manifold learning.


Alternating diffusion

Alternating diffusion based methods provide another strategy for multimodal representation learning by focusing on extracting the common underlying sources of variability present across multiple views or sensors. These methods aim to filter out sensor-specific or nuisance components, assuming that the phenomenon of interest is captured by two or more sensors. The core idea involves constructing an alternating diffusion operator by sequentially applying diffusion processes derived from each modality, typically through their product or intersection. This process allows the method to capture the structure related to common hidden variables that drive the observed multimodal data.{{Cite journal , last1=Katz , first1=Ori , last2=Talmon , first2=Ronen , last3=Lo , first3=Yu-Lun , last4=Wu , first4=Hau-Tieng , date=January 2019 , title=Alternating diffusion maps for multimodal data fusion , url=https://linkinghub.elsevier.com/retrieve/pii/S1566253517300192 , journal=Information Fusion , language=en , volume=45 , pages=346–360 , doi=10.1016/j.inffus.2018.01.007, url-access=subscription


See also

*
Representation learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...
*
Canonical correlation In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors ''X'' = (''X''1, ..., ''X'n'') and ''Y'' ...
*
Deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
*
Multimodal learning Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, imp ...
*
Nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear de ...


References

Machine learning