Multimodal representation learning is a subfield of

representation learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis.

Motivation

The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques.

Approaches and methods

Canonical-correlation analysis based methods

Canonical-correlation analysis (CCA) was first introduced in 1936 by

Harold Hotelling Harold Hotelling (; September 29, 1895 – December 26, 1973) was an American mathematical statistician and an influential economic theorist, known for Hotelling's law, Hotelling's lemma, and Hotelling's rule in economics, as well as Hotelling ...

and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data

matrices Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the ...

X \in   \mathbb^

and

Y \in  \mathbb^

representing different modalities, CCA finds projection vectors

w_x\in\mathbb^p

and

w_y\in\mathbb^q

that maximizes the correlation between the projected variables:

\rho = \max_ \frac

such that

\Sigma_

and

\Sigma_

are the within-modality covariance matrices, and

\Sigma_

is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA.

Kernel CCA

Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions

K_x

and

K_y

with corresponding Gram matrices

K_x\in\mathbb^

and

K_y\in\mathbb^

, KCCA seeks coefficients

\alpha

and

\beta

that maximize:

\rho = \max_ \frac

To prevent

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

, regularization terms are typically added, resulting in:

\rho = \max_ \frac

where

\lambda_x

and

\lambda_y

are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its

O(n^2)

memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers.

Deep CCA

Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks

f_x

and

f_y

for each modality to transform the original data before applying CCA:

\max_ \operatorname\left( f_x(X; \theta_x), f_y(Y; \theta_y) \right)

where

\theta_x

and

\theta_y

represent the parameters of the neural networks, and

W_x

and

W_y

are the CCA projection matrices. The correlation objective is computed as:

\operatorname(H_x, H_y) = \operatorname\left( T^ H_x^T H_y S^ \right)

where

H_x=f_x(X)

and

H_y=f_y(Y)

are the network outputs,

T=H_x^TH_x+r_xI

S=H_y^TH_y+r_yI

and

r_x, r_y

are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization.

Graph-based methods

Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional

graph neural network Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs. One prominent example is molecular drug design. Each input sample is a graph representation of a molecule, where atoms f ...

s (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not b ...

s (DBN) and deep

Boltzmann machine A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin glass, spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick m ...

s (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks.

Diffusion maps

Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities.

Multi-view diffusion maps

Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a

random walk In mathematics, a random walk, sometimes known as a drunkard's walk, is a stochastic process that describes a path that consists of a succession of random steps on some Space (mathematics), mathematical space. An elementary example of a rand ...

process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associated diffusion distances. The spectral decomposition of this kernel enables the discovery of an embedding that better leverages the information from all views. This method has demonstrated utility in various machine learning tasks, including classification, clustering, and manifold learning.

Alternating diffusion

Alternating diffusion based methods provide another strategy for multimodal representation learning by focusing on extracting the common underlying sources of variability present across multiple views or sensors. These methods aim to filter out sensor-specific or nuisance components, assuming that the phenomenon of interest is captured by two or more sensors. The core idea involves constructing an alternating diffusion operator by sequentially applying diffusion processes derived from each modality, typically through their product or intersection. This process allows the method to capture the structure related to common hidden variables that drive the observed multimodal data.{{Cite journal , last1=Katz , first1=Ori , last2=Talmon , first2=Ronen , last3=Lo , first3=Yu-Lun , last4=Wu , first4=Hau-Tieng , date=January 2019 , title=Alternating diffusion maps for multimodal data fusion , url=https://linkinghub.elsevier.com/retrieve/pii/S1566253517300192 , journal=Information Fusion , language=en , volume=45 , pages=346–360 , doi=10.1016/j.inffus.2018.01.007, url-access=subscription

References

Machine learning

Motivation

Approaches and methods

Canonical-correlation analysis based methods

Kernel CCA

Deep CCA

Graph-based methods

Diffusion maps

Multi-view diffusion maps

Alternating diffusion

See also

References