A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin-glass model with an external field, i.e., a

Sherrington–Kirkpatrick model In condensed matter physics, a spin glass is a magnetic state characterized by randomness, besides cooperative behavior in freezing of spins at a temperature called 'freezing temperature' ''Tf''. In ferromagnetic solids, component atoms' magn ...

, that is a stochastic

Ising model The Ising model () (or Lenz-Ising model or Ising-Lenz model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent ...

. It is a

statistical physics Statistical physics is a branch of physics that evolved from a foundation of statistical mechanics, which uses methods of probability theory and statistics, and particularly the mathematical tools for dealing with large populations and approxim ...

technique applied in the context of cognitive science. It is also classified as a Markov random field. Boltzmann machines are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple

physical process Physical changes are changes affecting the form of a chemical substance, but not its chemical composition. Physical changes are used to separate mixtures into their component compounds, but can not usually be used to separate compounds into chem ...

es. Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems. They are named after the Boltzmann distribution in

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic b ...

, which is used in their

sampling function In mathematics, a Dirac comb (also known as shah function, impulse train or sampling function) is a periodic function with the formula \operatorname_(t) \ := \sum_^ \delta(t - k T) for some given period T. Here ''t'' is a real variable and th ...

. They were heavily popularized and promoted by Geoffrey Hinton, Terry Sejnowski and

Yann LeCun Yann André LeCun ( , ; originally spelled Le Cun; born 8 July 1960) is a French computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Professo ...

in cognitive sciences communities and in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

. As a more general class within

these models are called "

energy based model An energy-based model (EBM) is a form of generative model (GM) imported directly from statistical physics to learning. GMs learn an underlying data distribution by analyzing a sample dataset. Once trained, a GM can produce other datasets that also ...

s" (EBM), because Hamiltonians of

spin glasses In condensed matter physics, a spin glass is a magnetic state characterized by randomness, besides cooperative behavior in freezing of spins at a temperature called 'freezing temperature' ''Tf''. In ferromagnetic solids, component atoms' magne ...

are used as a starting point to define the learning task.

Structure

A Boltzmann machine, like a

, is a network of units with a total "energy" ( Hamiltonian) defined for the overall network. Its units produce binary results. Boltzmann machine weights are

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

. The global energy

E

in a Boltzmann machine is identical in form to that of

Hopfield network A Hopfield network (or Ising model of a neural network or Ising–Lenz–Little model) is a form of recurrent artificial neural network and a type of spin glass system popularised by John Hopfield in 1982 as described earlier by Little in 1974 b ...

s and

s: :

E = -\left(\sum_ w_ \, s_i \, s_j + \sum_i \theta_i \, s_i \right)

Where: *

w_

is the connection strength between unit

j

and unit

i

. *

s_i

is the state,

s_i \in \

, of unit

i

. *

\theta_i

is the bias of unit

i

in the global energy function. (

-\theta_i

is the activation threshold for the unit.) Often the weights

w_

are represented as a symmetric matrix

W=_/math> with zeros along the diagonal.

Unit state probability

The difference in the global energy that results from a single unit

i

equaling 0 (off) versus 1 (on), written

\Delta E_i

, assuming a symmetric matrix of weights, is given by: :

\Delta E_i = \sum_ w_ \, s_j + \sum_ w_ \, s_j + \theta_i

This can be expressed as the difference of energies of two states: :

\Delta E_i = E_\text - E_\text

Substituting the energy of each state with its relative probability according to the Boltzmann factor (the property of a Boltzmann distribution that the energy of a state is proportional to the negative log probability of that state) gives: :

\Delta E_i = -k_B\,T\ln(p_\text) - (-k_B\,T\ln(p_\text))

where

k_B

is the Boltzmann constant and is absorbed into the artificial notion of temperature

T

. We then rearrange terms and consider that the probabilities of the unit being on and off must sum to one: :

\frac = \ln(p_\text) - \ln(p_\text)

\frac = \ln(p_\text) - \ln(1 - p_\text)

\frac = \ln\left(\frac\right)

-\frac = \ln\left(\frac\right)

-\frac = \ln\left(\frac - 1\right)

\exp\left(-\frac\right) = \frac - 1

Solving for

p_\text

, the probability that the

i

-th unit is on gives: :

p_\text = \frac

where the scalar

T

is referred to as the

temperature Temperature is a physical quantity that expresses quantitatively the perceptions of hotness and coldness. Temperature is measured with a thermometer. Thermometers are calibrated in various temperature scales that historically have relied o ...

of the system. This relation is the source of the

logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...

found in probability expressions in variants of the Boltzmann machine.

Equilibrium state

The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at

thermal equilibrium Two physical systems are in thermal equilibrium if there is no net flow of thermal energy between them when they are connected by a path permeable to heat. Thermal equilibrium obeys the zeroth law of thermodynamics. A system is said to be in ...

", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching a

at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing. To train the network so that the chance it will converge to a global state according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.

Training

The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the training set is a set of binary vectors over the set V. The distribution over the training set is denoted

P^(V)

. The distribution over global states converges as the Boltzmann machine reaches

. We denote this distribution, after we

marginalize Social exclusion or social marginalisation is the social disadvantage and relegation to the fringe of society. It is a term that has been used widely in Europe and was first used in France in the late 20th century. It is used across discipl ...

it over the hidden units, as

P^(V)

. Our goal is to approximate the "real" distribution

P^(V)

using the

P^(V)

produced by the machine. The similarity of the two distributions is measured by the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...

G

: :

G = \sum_

where the sum is over all the possible states of

V

G

is a function of the weights, since they determine the energy of a state, and the energy determines

P^(v)

, as promised by the Boltzmann distribution. A gradient descent algorithm over

G

, changes a given weight,

w_

by subtracting the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Pa ...

G

with respect to the weight. Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to

P^

). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight,

w_

, is given by the equation: :

\frac = -\frac_^-p_^/math>

where:
* p_^is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the positive phase.
* p_^is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the negative phase.
* R denotes the learning rate

This result follows from the fact that at

the probability

P^(s)

of any global state

s

when the network is free-running is given by the Boltzmann distribution. This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (

synapse In the nervous system, a synapse is a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or to the target effector cell. Synapses are essential to the transmission of nervous impulses from ...

, biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation. The training of a Boltzmann machine does not use the

EM algorithm EM, Em or em may refer to: Arts and entertainment Music * EM, the E major musical scale * Em, the E minor musical scale * Electronic music, music that employs electronic musical instruments and electronic music technology in its production * Ency ...

, which is heavily used in

. By minimizing the KL-divergence, it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step. Training the biases is similar, but uses only single node activity: :

\frac = -\frac_^-p_^/math>

Problems

Theoretically the Boltzmann machine is a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph. Unfortunately, Boltzmann machines experience a serious practical problem, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial size. This is due to important effects, specifically: * the required time order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths * connection strengths are more plastic when the connected units have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to follow a

random walk In mathematics, a random walk is a random process that describes a path that consists of a succession of random steps on some mathematical space. An elementary example of a random walk is the random walk on the integer number line \mathbb Z ...

until the activities saturate.

Types

Restricted Boltzmann machine

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in a restricted Boltzmann machine (RBM) which does not allow intralayer connections between hidden units and visible units, i.e. there is no connection between visible to visible and hidden to hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

strategies. As each new layer is added the generative model improves. An extension to the restricted Boltzmann machine allows using real valued data rather than binary data. One example of a practical RBM application is in speech recognition.

Deep Boltzmann machine

A deep Boltzmann machine (DBM) is a type of binary pairwise Markov random field (

undirected In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...

probabilistic graphical model) with multiple layers of hidden

random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units

\boldsymbol \in \^D

and layers of hidden units

\boldsymbol^ \in \^, \boldsymbol^ \in \^, \ldots, \boldsymbol^ \in \^

. No connection links units of the same layer (like RBM). For the , the probability assigned to vector is :

p(\boldsymbol) = \frac\sum_h e^,

where

\boldsymbol = \

are the set of hidden units, and

\theta = \

are the model parameters, representing visible-visible, visible-hidden and hidden-hidden interactions. In a DBN only the top two layers form a restricted Boltzmann machine (which is an undirected graphical model), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected. Like DBNs, DBMs can learn complex and abstract internal representations of the input in tasks such as

object Object may refer to: General meanings * Object (philosophy), a thing, being, or concept ** Object (abstract), an object which does not exist at any particular time or place ** Physical object, an identifiable collection of matter * Goal, an ...

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

, using limited, labeled data to fine-tune the representations built using a large set of unlabeled sensory input data. However, unlike DBNs and deep convolutional neural networks, they pursue the inference and training procedure in both directions, bottom-up and top-down, which allow the DBM to better unveil the representations of the input structures. However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by using

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...

(MCMC). This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation. Deep Boltzmann Machines were mentioned in suggesting COVID treatment in Feb 2020 by renowned AI Scientist & data scientist Prasad Kothari in Forbes.

Spike-and-slab RBMs

The need for deep learning with real-valued inputs, as in

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponym ...

RBMs, led to the spike-and-slab RBM (''ss'' RBM), which models continuous-valued inputs with binary

latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...

s. Similar to basic RBMs and its variants, a spike-and-slab RBM is a

bipartite graph In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V, that is every edge connects a vertex in U to one in V. Vertex sets U and V a ...

, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a

density Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematicall ...

over continuous domain; their mixture forms a

prior Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be low ...

. An extension of ss RBM called µ-ss RBM provides extra modeling capacity using additional terms in the

energy function Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

. One of these terms enables the model to form a

conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...

of the spike variables by marginalizing out the slab variables given an observation. In Mathematics ''Main articles: Gibbs measure and

Log-linear model A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it h ...

'' In more general mathematical setting, the Boltzmann distribution is also known as the Gibbs measure. In

statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

and

it is called a

log-linear model A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it h ...

. In

the Boltzmann distribution is used in the sampling distribution of stochastic neural networks such as the Boltzmann machine.

History

The Boltzmann machine is based on a spin-glass model of Sherrington-Kirkpatrick's stochastic

Ising Model The Ising model () (or Lenz-Ising model or Ising-Lenz model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent ...

. The original contribution in applying such energy based models in cognitive science appeared in papers by Hinton and Sejnowski. The seminal publication by John Hopfield connected physics and statistical mechanics, mentioning spin glasses. The idea of applying the Ising model with annealed

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diff ...

is present in

Douglas Hofstadter Douglas Richard Hofstadter (born February 15, 1945) is an American scholar of cognitive science, physics, and comparative literature whose research includes concepts such as the sense of self in relation to the external world, consciousness, a ...

Copycat Copycat refers to a person who copies some aspect of some thing or somebody else. Copycat may also refer to: Intellectual property rights * Copyright infringement, use of another’s ideas or words without permission * Patent infringement, a v ...

project. Similar ideas (with a change of sign in the energy function) are found in

Paul Smolensky Paul Smolensky (born May 5, 1955) is Krieger-Eisenhower Professor of Cognitive Science at the Johns Hopkins University and a Senior Principal Researcher at Microsoft Research, Redmond Washington. Along with Alan Prince, in 1993 he developed O ...

's "Harmony Theory". The explicit analogy drawn with statistical mechanics in the Boltzmann Machine formulation led to the use of terminology borrowed from physics (e.g., "energy" rather than "harmony"), which became standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the adoption of a variety of concepts and methods from statistical mechanics. The various proposals to use simulated annealing for inference were apparently independent. Ising models became considered to be a special case of Markov random fields, which find widespread application in

linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Ling ...

robotics Robotics is an interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist humans. Robotics integrat ...

computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...

and

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...

References

# https://www.mis.mpg.de/preprints/2018/preprint2018_87.pdf

External links

Scholarpedia article by Hinton about Boltzmann machinesTalk at Google by Geoffrey Hinton
{{Authority control Neural network architectures

Machine A machine is a physical system using Power (physics), power to apply Force, forces and control Motion, movement to perform an action. The term is commonly applied to artificial devices, such as those employing engines or motors, but also to na ...