Multimodal learning attempts to model the combination of different modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as discrete word count vectors) with imaging data consisting of

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the s ...

intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required.

Motivation

Many models and algorithms have been implemented to retrieve and classify a certain type of data, e.g. image or text (where humans who interact with machines can extract images in a form of pictures and text that could be any message etc.). However, data usually comes with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes.

Background: Boltzmann machine

Boltzmann machine A Ludwig Boltzmann, Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick ...

is a type of stochastic neural network invented by

Geoffrey Hinton Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ...

and

Terry Sejnowski Terrence Joseph Sejnowski (born 13 August 1947) is the Francis Crick Professor at the Salk Institute for Biological Studies where he directs the Computational Neurobiology Laboratory and is the director of the Crick-Jacobs center for theoretical ...

in 1985. Boltzmann machines can be seen as the

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselve ...

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, ...

counterpart of

Hopfield net A Hopfield network (or Ising model of a neural network or Ising–Lenz–Little model) is a form of recurrent artificial neural network and a type of spin glass system popularised by John Hopfield in 1982 as described earlier by Little in 1974 ba ...

s. They are named after the

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...

in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine. A more efficient architecture is called

restricted Boltzmann machine A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and ros ...

where connection is only allowed between hidden unit and visible unit, which is described in the next section.

Restricted Boltzmann machine

A restricted Boltzmann machine is an undirected graphical model with stochastic visible variable and stochastic hidden variables. Each visible variable is connected to each hidden variable. The energy function of the model is defined as :

E(\mathbf v,\mathbf h;\theta) =  -\sum_^D\sum_^W_v_ih_j -\sum_^Db_iv_i -\sum_^Fa_jh_j

where

\theta = \

are model parameters:

W_

represents the symmetric interaction term between visible unit

i

and hidden unit

j

;

b_i

and

a_j

are bias terms. The joint distribution of the system is defined as :

P(\mathbf;\theta) = \frac\sum_\mathrm(-E(\mathbf v,\mathbf h;\theta))

where

\mathcal(\theta)

is a normalizing constant. The conditional distribution over hidden

\mathbf h

and

\mathbf v

can be derived as logistic function in terms of model parameters. :

P(\mathbf h, \mathbf v;\theta) = \prod_^Fp(h_j, \mathbf v)

, with

p(h_j=1, \mathbf v) = g(\sum_^DW_v_i + a_j)

P(\mathbf v, \mathbf h;\theta) = \prod_^Dp(v_i, \mathbf h)

, with

p(v_i=1, \mathbf h) = g(\sum_^FW_h_j + b_i)

where

g(x) = \frac

is the logistic function. The derivative of the log-likelihood with respect to the model parameters can be decomposed as the difference between the ''model's expectation'' and ''data-dependent expectation''.

Gaussian-Bernoulli RBM

Gaussian-Bernoulli RBMs are a variant of restricted Boltzmann machine used for modeling real-valued vectors such as pixel intensities. It is usually used to model the image data. The energy of the system of the Gaussian-Bernoulli RBM is defined as :

E(\mathbf v,\mathbf h;\theta) = \sum_^D\frac -\sum_^D\sum_^\fracW_v_ih_j -\sum_^Db_iv_i -\sum_^Fa_jh_j

where

\theta = \

are the model parameters. The joint distribution is defined the same as the one in

. The conditional distributions now become :

P(\mathbf h, \mathbf v;\theta) = \prod_^Fp(h_j, \mathbf v)

, with

p(h_j=1, \mathbf v) = g(\sum_^DW_\frac + a_j)

P(\mathbf v, \mathbf h;\theta) = \prod_^Dp(v_i, \mathbf h)

, with

p(v_i, \mathbf h) \sim \mathcal(\sigma_i\sum_^FW_h_j + b_i,\sigma_i^2)

In Gaussian-Bernoulli RBM, the visible unit conditioned on hidden units is modeled as a Gaussian distribution.

Replicated Softmax Model

The Replicated Softmax Model is also an variant of restricted Boltzmann machine and commonly used to model word count vectors in a document. In a typical

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

problem, let

K

be the dictionary size, and

M

be the number of words in the document. Let

\mathbf V

be a

M \times K

binary matrix with

v_ = 1

only when the

i^

word in the document is the

k^

word in the dictionary.

\hat v_k

denotes the count for the

k^

word in the dictionary. The energy of the state

\

for a document contains

M

words is defined as :

E(\mathbf V,\mathbf h) = -\sum_^\sum_^W_\hat v_kh_j - \sum_^Kb_k\hat v_k - M\sum_^a_jh_j

The conditional distributions are given by :

p(h_j=1, \mathbf V) = g(Ma_j + \sum_^K\hat v_kW_)

p(v_ = 1, \mathbf h) = \frac)

Deep Boltzmann machines

A deep Boltzmann machine has a sequence of layers of hidden units.There are only connections between adjacent hidden layers, as well as between visible units and hidden units in the first hidden layer. The energy function of the system adds layer interaction terms to the energy function of general restricted Boltzmann machine and is defined by

\begin 
E( )  = & -\sum_^D\sum_^W_^v_ih_j^ -\sum_^\sum_^W_^h_j^h_^\\ 
& -\sum_^\sum_^W_^h_l^h_p^
 - \sum_^Db_iv_i - \sum_^b_j^h_j^ - \sum_^b_l^h_l^ - \sum_^b_p^h_p^
\end

The joint distribution is :

P(\mathbf;\theta) = \frac\sum_\mathrm(-E(\mathbf v,\mathbf h^,\mathbf h^,\mathbf h^;\theta))

Multimodal deep Boltzmann machines

Multimodal deep Boltzmann machine uses an image-text bi-modal DBM where the image pathway is modeled as Gaussian-Bernoulli DBM and text pathway as Replicated Softmax DBM, and each DBM has two hidden layers and one visible layer. The two DBMs join together at an additional top hidden layer. The joint distribution over the multi-modal inputs defined as

\begin
P(\mathbf v^m,\mathbf v^t;\theta) & =  \sum_P(\mathbf h^,\mathbf h^,\mathbf h^)(\sum_P(\mathbf v_m,\mathbf h^, \mathbf h^))(\sum_P(\mathbf v^t,\mathbf h^, \mathbf h^))\\
& =  \frac\sum_\mathrm(\sum_W_^v_k^th_j^ \\
&+ \sum_W_^h_j^h_l^+\sum_kb_k^tv_k^t+M\sum_jb_j^h_j^+\sum_lb_l^h_l^\\
& -  \sum_i\frac + \sum_\fracW_^h_j^ \\
&+ \sum_W_^h_j^h_l^+\sum_jb_j^h_j^+\sum_lb_l^h_l\\
& + \sum_W^h_l^h_p^ + \sum_W^h_l^h_p^ + \sum_pb_p^h_p^
\end

The conditional distributions over the visible and hidden units are :

p(h_j^=1, \mathbf v^m,\mathbf h^) =  g(\sum_^DW_^\frac + \sum_^W_^h_l^+b_j^)

p(h_l^=1, \mathbf h^,\mathbf h^)  =  g(\sum_^W_^h_j^ + \sum_^W_^h_p^+b_l^)

p(h_j^=1, \mathbf v^t,\mathbf h^)  =  g(\sum_^W_^v_k^ + \sum_^W_^h_l^+Mb_j^)

p(h_l^=1, \mathbf h^,\mathbf h^)  =  g(\sum_^W_^h_j^ + \sum_^W_^h_p^+b_l^)

p(h_p^=1, \mathbf h^)  =  g(\sum_^W_^h_l^ + \sum_^W_^h_l^+b_p^)

p(v_^t = 1, \mathbf h^)  =  \frac

p(v_i^m, \mathbf h^)  \sim  \mathcal(\sigma_i\sum_^W_^h_j^ + b_i^m,\sigma_i^2)

Inference and learning

Exact maximum likelihood learning in this model is intractable, but approximate learning of DBMs can be carried out by using a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC based stochastic approximation procedure is used to approximate the model’s expected sufficient statistics.

Application

Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an exa ...

and

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not be ...

, when models are tested on data with both image-text modalities or with single modality. Multimodal deep Boltzmann machine is also able to predict missing modalities given the observed ones with reasonably good precision. Self Supervised Learning brings a more interesting and powerful model for multimodality.

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

developed CLIP and

DALL-E DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a ve ...

models that revolutionized multimodality. Multimodal deep learning is used for

cancer screening Cancer screening aims to detect cancer before symptoms appear. This may involve blood tests, urine tests, 23andme, DNA tests, other tests, or medical imaging. The benefits of screening in terms of cancer prevention, early detection and subsequent ...

– at least one system under development integrates such different types of data. * Teaching hospital press release:

References

{{reflist Artificial neural networks Multimodal interaction

Motivation

Background: Boltzmann machine

Restricted Boltzmann machine

Gaussian-Bernoulli RBM

Replicated Softmax Model

Deep Boltzmann machines

Multimodal deep Boltzmann machines

Inference and learning

Application

See also

References