A restricted Boltzmann machine (RBM) is a
generative
Generative may refer to:
* Generative actor, a person who instigates social change
* Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer
* Generative music, mus ...
stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected unit ...
that can learn a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
over its set of inputs.
RBMs were initially invented under the name Harmonium by
Paul Smolensky
Paul Smolensky (born May 5, 1955) is Krieger-Eisenhower Professor of Cognitive Science at the Johns Hopkins University and a Senior Principal Researcher at Microsoft Research, Redmond Washington.
Along with Alan Prince, in 1993 he developed O ...
in 1986,
and rose to prominence after
Geoffrey Hinton
Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on a ...
and collaborators invented fast learning algorithms for them in the mid-2000. RBMs have found applications in
dimensionality reduction
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
,
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
,
collaborative filtering
Collaborative filtering (CF) is a technique used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook Recommender Systems Handbook, Springer, 2011, pp. 1-35 Collaborative filtering ...
,
feature learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...
,
topic model
In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
ling
[Ruslan Salakhutdinov and Geoffrey Hinton (2010)]
Replicated softmax: an undirected topic model
''Neural Information Processing Systems
The Conference and Workshop on Neural Information Processing Systems (abbreviated as NeurIPS and formerly NIPS) is a machine learning and computational neuroscience conference held every December. The conference is currently a double-track meet ...
'' 23.
and even
many body quantum mechanics. They can be trained in either
supervised or
unsupervised
''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...
ways, depending on the task.
As their name implies, RBMs are a variant of
Boltzmann machine
A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic ...
s, with the restriction that their
neurons
A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa. N ...
must form a
bipartite graph:
a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between
hidden units. This restriction allows for more efficient training
algorithms
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
than are available for the general class of Boltzmann machines, in particular the
gradient-based contrastive divergence algorithm.
[Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005)]
On contrastive divergence learning
''Artificial Intelligence and Statistics''.
Restricted Boltzmann machines can also be used in
deep learning networks. In particular,
deep belief network
In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not bet ...
s can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
and
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...
.
Structure
The standard type of RBM has binary-valued (
Boolean) hidden and visible units, and consists of a
matrix
Matrix most commonly refers to:
* ''The Matrix'' (franchise), an American media franchise
** ''The Matrix'', a 1999 science-fiction action film
** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
of weights
of size
. Each weight element
of the matrix is associated with the connection between the visible (input) unit
and the hidden unit
. In addition, there are bias weights (offsets)
for
and
for
. Given the weights and biases, the ''energy'' of a configuration (pair of boolean vectors) is defined as
:
or, in matrix notation,
:
This energy function is analogous to that of a
Hopfield network
A Hopfield network (or Ising model of a neural network or Ising–Lenz–Little model) is a form of recurrent artificial neural network and a type of spin glass system popularised by John Hopfield in 1982 as described earlier by Little in 1974 b ...
. As with general Boltzmann machines, the
joint probability distribution for the visible and hidden vectors is defined in terms of the energy function as follows,
[Geoffrey Hinton (2010). ]
A Practical Guide to Training Restricted Boltzmann Machines
'. UTML TR 2010–003, University of Toronto.
:
where
is a
partition function defined as the sum of
over all possible configurations, which can be interpreted as a
normalizing constant
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one.
...
to ensure that the probabilities sum to 1. The
marginal probability
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...
of a visible vector is the sum of
over all possible hidden layer configurations,
:
,
and vice versa. Since the underlying graph structure of the RBM is
bipartite (meaning there is no intra-layer connections), the hidden unit activations are
mutually independent
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
given the visible unit activations. Conversely, the visible unit activations are mutually independent given the hidden unit activations.
That is, for ''m'' visible units and ''n'' hidden units, the
conditional probability
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
of a configuration of the visible units , given a configuration of the hidden units , is
:
.
Conversely, the conditional probability of given is
:
.
The individual activation probabilities are given by
:
and
where
denotes the
logistic sigmoid.
The visible units of Restricted Boltzmann Machine can be
multinomial, although the hidden units are
Bernoulli Bernoulli can refer to:
People
*Bernoulli family of 17th and 18th century Swiss mathematicians:
** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle
**Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...
. In this case, the logistic function for visible units is replaced by the
softmax function
The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
:
where ''K'' is the number of discrete values that the visible values have. They are applied in topic modeling,
and
recommender system
A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular u ...
s.
Relation to other models
Restricted Boltzmann machines are a special case of
Boltzmann machine
A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic ...
s and
Markov random field
In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to b ...
s.
[Asja Fischer and Christian Igel]
Training Restricted Boltzmann Machines: An Introduction
. Pattern Recognition 47, pp. 25-39, 2014
Their
graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
corresponds to that of
factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
.
Training algorithm
Restricted Boltzmann machines are trained to maximize the product of probabilities assigned to some training set
(a matrix, each row of which is treated as a visible vector
),
:
or equivalently, to maximize the
expected log probability
In probability theory and computer science, a log probability is simply a logarithm of a probability. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard , 1/math> unit interval.
Since t ...
of a training sample
selected randomly from
:
: