A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after

Ludwig Boltzmann Ludwig Eduard Boltzmann ( ; ; 20 February 1844 – 5 September 1906) was an Austrian mathematician and Theoretical physics, theoretical physicist. His greatest achievements were the development of statistical mechanics and the statistical ex ...

, is a spin-glass model with an external field, i.e., a

Sherrington–Kirkpatrick model In condensed matter physics, a spin glass is a magnetic state characterized by randomness, besides cooperative behavior in freezing of Spin (physics), spins at a temperature called the "freezing temperature," ''T''f. In Ferromagnetism, ferroma ...

, that is a stochastic

Ising model The Ising model (or Lenz–Ising model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical models in physics, mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that r ...

. It is a

statistical physics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

technique applied in the context of

cognitive science Cognitive science is the interdisciplinary, scientific study of the mind and its processes. It examines the nature, the tasks, and the functions of cognition (in a broad sense). Mental faculties of concern to cognitive scientists include percep ...

. It is also classified as a

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph In discrete mathematics, particularly ...

. Boltzmann machines are theoretically intriguing because of the locality and

Hebbian Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptat ...

nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple

physical process Physical changes are changes affecting the form of a chemical substance, but not its chemical composition. Physical changes are used to separate mixtures into their component compounds, but can not usually be used to separate compounds into chem ...

es. Boltzmann machines with unconstrained connectivity have not been proven useful for practical problems in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

inference Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinct ...

, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems. They are named after the

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

, which is used in their sampling function. They were heavily popularized and promoted by

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

, Terry Sejnowski and

Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...

in cognitive sciences communities, particularly in

, as part of " energy-based models" (EBM), because Hamiltonians of spin glasses as energy are used as a starting point to define the learning task.

Structure

A Boltzmann machine, like a

, is a network of units with a total "energy" (

Hamiltonian Hamiltonian may refer to: * Hamiltonian mechanics, a function that represents the total energy of a system * Hamiltonian (quantum mechanics), an operator corresponding to the total energy of that system ** Dyall Hamiltonian, a modified Hamiltonian ...

) defined for the overall network. Its units produce binary results. Boltzmann machine weights are

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

. The global energy

E

in a Boltzmann machine is identical in form to that of

Hopfield network A Hopfield network (or associative memory) is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where ...

s and

s: :

E = -\left(\sum_ w_ \, s_i \, s_j + \sum_i \theta_i \, s_i \right)

Where: *

w_

is the connection strength between unit

j

and unit

i

. *

s_i

is the state,

s_i \in \

, of unit

i

. *

\theta_i

is the bias of unit

i

in the global energy function. (

-\theta_i

is the activation threshold for the unit.) Often the weights

w_

are represented as a symmetric matrix

W=_/math> with zeros along the diagonal.

Unit state probability

The difference in the global energy that results from a single unit

i

equaling 0 (off) versus 1 (on), written

\Delta E_i

, assuming a symmetric matrix of weights, is given by: :

\Delta E_i = \sum_ w_ \, s_j + \sum_ w_ \, s_j + \theta_i

This can be expressed as the difference of energies of two states: :

\Delta E_i = E_\text - E_\text

Substituting the energy of each state with its relative probability according to the

Boltzmann factor Factor (Latin, ) may refer to: Commerce * Factor (agent), a person who acts for, notably a mercantile and colonial agent * Factor (Scotland), a person or firm managing a Scottish estate * Factors of production, such a factor is a resource used ...

(the property of a

that the energy of a state is proportional to the negative log probability of that state) yields: :

\Delta E_
    = -k_ T \ln(p_\text)
       - (-k_ T \ln(p_\text)),

where

k_

is the

Boltzmann constant The Boltzmann constant ( or ) is the proportionality factor that relates the average relative thermal energy of particles in a ideal gas, gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin (K) and the ...

and is absorbed into the artificial notion of temperature

T

. Noting that the probabilities of the unit being ''on'' or ''off'' sum to

1

allows for the simplification: :

-\frac
    = -\ln(p_) + \ln(p_)
    = \ln\Big(\frac\Big)
    = \ln(p_^ - 1),

whence the probability that the

i

-th unit is given by :

p_ = \frac,

where the scalar

T

is referred to as the

temperature Temperature is a physical quantity that quantitatively expresses the attribute of hotness or coldness. Temperature is measurement, measured with a thermometer. It reflects the average kinetic energy of the vibrating and colliding atoms making ...

of the system. This relation is the source of the

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...

found in probability expressions in variants of the Boltzmann machine.

Equilibrium state

The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to a

, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at

thermal equilibrium Two physical systems are in thermal equilibrium if there is no net flow of thermal energy between them when they are connected by a path permeable to heat. Thermal equilibrium obeys the zeroth law of thermodynamics. A system is said to be in t ...

", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching a

at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is called

simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...

. To train the network so that the chance it will converge to a global state according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.

Training

The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

is a set of binary vectors over the set V. The distribution over the training set is denoted

P^(V)

. The distribution over global states converges as the Boltzmann machine reaches

. We denote this distribution, after we marginalize it over the hidden units, as

P^(V)

. Our goal is to approximate the "real" distribution

P^(V)

using the

P^(V)

produced by the machine. The similarity of the two distributions is measured by the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

G

: :

G = \sum_

where the sum is over all the possible states of

V

G

is a function of the weights, since they determine the energy of a state, and the energy determines

P^(v)

, as promised by the Boltzmann distribution. A

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

algorithm over

G

changes a given weight,

w_

, by subtracting the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

G

with respect to the weight. Boltzmann machine training involves two alternating phases. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to

P^

). The other is the "negative" phase where the network is allowed to run freely, i.e. only the input nodes have their state determined by external data, but the output nodes are allowed to float. The gradient with respect to a given weight,

w_

, is given by the equation: :

\frac = -\frac_^-p_^/math>

where:
* p_^is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the positive phase.
* p_^is the probability that units ''i'' and ''j'' are both on when the machine is at equilibrium on the negative phase.
* R denotes the learning rate This result follows from the fact that at

the probability

P^(s)

of any global state

s

when the network is free-running is given by the Boltzmann distribution. This learning rule is biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (

synapse In the nervous system, a synapse is a structure that allows a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or a target effector cell. Synapses can be classified as either chemical or electrical, depending o ...

, biologically) does not need information about anything other than the two neurons it connects. This is more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

. The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in

. By minimizing the KL-divergence, it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step. Training the biases is similar, but uses only single node activity: :

\frac = -\frac_^-p_^/math>

Problems

Theoretically the Boltzmann machine is a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph. Unfortunately, Boltzmann machines experience a serious practical problem, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial size. This is due to important effects, specifically: * the required time order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths * connection strengths are more plastic when the connected units have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to follow a

random walk In mathematics, a random walk, sometimes known as a drunkard's walk, is a stochastic process that describes a path that consists of a succession of random steps on some Space (mathematics), mathematical space. An elementary example of a rand ...

until the activities saturate.

Types

Restricted Boltzmann machine

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in a restricted Boltzmann machine (RBM) which does not allow intralayer connections between hidden units and visible units, i.e. there is no connection between visible to visible and hidden to hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

strategies. As each new layer is added the generative model improves. An extension to the restricted Boltzmann machine allows using real valued data rather than binary data. One example of a practical RBM application is in speech recognition.

Deep Boltzmann machine

A deep Boltzmann machine (DBM) is a type of binary pairwise

(

undirected In discrete mathematics, particularly in graph theory, a graph is a structure consisting of a set of objects where some pairs of the objects are in some sense "related". The objects are represented by abstractions called '' vertices'' (also call ...

probabilistic graphical model) with multiple layers of hidden

random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...

. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units

\boldsymbol \in \^D

and layers of hidden units

\boldsymbol^ \in \^, \boldsymbol^ \in \^, \ldots, \boldsymbol^ \in \^

. No connection links units of the same layer (like RBM). For the , the probability assigned to vector is :

p(\boldsymbol) = \frac\sum_h e^,

where

\boldsymbol = \

are the set of hidden units, and

\theta = \

are the model parameters, representing visible-hidden and hidden-hidden interactions. In a DBN only the top two layers form a restricted Boltzmann machine (which is an undirected graphical model), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected. Like DBNs, DBMs can learn complex and abstract internal representations of the input in tasks such as

object Object may refer to: General meanings * Object (philosophy), a thing, being, or concept ** Object (abstract), an object which does not exist at any particular time or place ** Physical object, an identifiable collection of matter * Goal, an a ...

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, using limited, labeled data to fine-tune the representations built using a large set of unlabeled sensory input data. However, unlike DBNs and deep

convolutional neural networks A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different type ...

, they pursue the inference and training procedure in both directions, bottom-up and top-down, which allow the DBM to better unveil the representations of the input structures. However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by using

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...

(MCMC). This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation.

Spike-and-slab RBMs

The need for deep learning with

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an ...

inputs, as in Gaussian RBMs, led to the spike-and-slab RBM (''ss'' RBM), which models continuous-valued inputs with binary

latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...

s. Similar to basic RBMs and its variants, a spike-and-slab RBM is a

bipartite graph In the mathematics, mathematical field of graph theory, a bipartite graph (or bigraph) is a Graph (discrete mathematics), graph whose vertex (graph theory), vertices can be divided into two disjoint sets, disjoint and Independent set (graph theo ...

, while like G RBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a

density Density (volumetric mass density or specific mass) is the ratio of a substance's mass to its volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' (or ''d'') can also be u ...

over continuous domain; their mixture forms a

prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...

. An extension of ss RBM called μ-ss RBM provides extra modeling capacity using additional terms in the energy function. One of these terms enables the model to form a

conditional distribution Conditional (if then) may refer to: * Causal conditional, if X then Y, where X is a cause of Y *Conditional probability, the probability of an event A given that another event B * Conditional proof, in logic: a proof that asserts a conditional, ...

of the spike variables by marginalizing out the slab variables given an observation.

In mathematics

In more general mathematical setting, the Boltzmann distribution is also known as the Gibbs measure. In

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

and

it is called a log-linear model. In

the Boltzmann distribution is used in the sampling distribution of stochastic neural networks such as the Boltzmann machine.

History

The Boltzmann machine is based on the Sherrington–Kirkpatrick spin glass model by David Sherrington and Scott Kirkpatrick. The seminal publication by John Hopfield (1982) applied methods of statistical mechanics, mainly the recently developed (1970s) theory of spin glasses, to study associative memory (later named the "Hopfield network"). The original contribution in applying such energy-based models in cognitive science appeared in papers by

and Terry Sejnowski. In a 1995 interview, Hinton stated that in 1983 February or March, he was going to give a talk on

in Hopfield networks, so he had to design a learning algorithm for the talk, resulting in the Boltzmann machine learning algorithm. The idea of applying the Ising model with annealed

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...

was used in

Douglas Hofstadter Douglas Richard Hofstadter (born 15 February 1945) is an American cognitive and computer scientist whose research includes concepts such as the sense of self in relation to the external world, consciousness, analogy-making, Strange loop, strange ...

's Copycat project (1984). The explicit analogy drawn with statistical mechanics in the Boltzmann machine formulation led to the use of terminology borrowed from physics (e.g., "energy"), which became standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the adoption of a variety of concepts and methods from statistical mechanics. The various proposals to use simulated annealing for inference were apparently independent. Similar ideas (with a change of sign in the energy function) are found in Paul Smolensky's "Harmony Theory". Ising models can be generalized to

s, which find widespread application in

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

and

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

. In 2024, Hopfield and Hinton were awarded

Nobel Prize in Physics The Nobel Prize in Physics () is an annual award given by the Royal Swedish Academy of Sciences for those who have made the most outstanding contributions to mankind in the field of physics. It is one of the five Nobel Prizes established by the ...

for their foundational contributions to

, such as the Boltzmann machine.

References

External links

Scholarpedia article by Hinton about Boltzmann machinesTalk at Google by Geoffrey Hinton
{{Authority control Neural network architectures

Machine A machine is a physical system that uses power to apply forces and control movement to perform an action. The term is commonly applied to artificial devices, such as those employing engines or motors, but also to natural biological macromol ...

Mathematical physics