machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, a deep belief network (DBN) is a

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, mus ...

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...

, or alternatively a class of

deep Deep or The Deep may refer to: Places United States * Deep Creek (Appomattox River tributary), Virginia * Deep Creek (Great Salt Lake), Idaho and Utah * Deep Creek (Mahantango Creek tributary), Pennsylvania * Deep Creek (Mojave River tributary), C ...

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

, composed of multiple layers of

latent variables In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...

("hidden units"), with connections between the layers but not between units within each layer. When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors. After this learning step, a DBN can be further trained with

supervision Supervision is an act or instance of directing, managing, or oversight. Etymology The English noun "supervision" derives from the two Latin words "super" (above) and "videre" (see, observe). Spelling The spelling is "Supervision" in Standard E ...

to perform

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

. DBNs can be viewed as a composition of simple, unsupervised networks such as

restricted Boltzmann machine A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and rose ...

s (RBMs) or

autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder lear ...

s, where each sub-network's hidden layer serves as the visible layer for the next. An RBM is an

undirected In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...

, generative energy-based model with a "visible" input layer and a hidden layer and connections between but not within layers. This composition leads to a fast, layer-by-layer unsupervised training procedure, where

contrastive divergence A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986, and rose ...

is applied to each sub-network in turn, starting from the "lowest" pair of layers (the lowest visible layer is a

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

). The observation that DBNs can be trained greedily, one layer at a time, led to one of the first effective

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

algorithms. Overall, there are many attractive implementations and uses of DBNs in real-life applications and scenarios (e.g.,

electroencephalography Electroencephalography (EEG) is a method to record an electrogram of the spontaneous electrical activity of the brain. The biosignals detected by EEG have been shown to represent the postsynaptic potentials of pyramidal neurons in the neocortex ...

drug discovery In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which new candidate medications are discovered. Historically, drugs were discovered by identifying the active ingredient from traditional remedies or by ...

Training

The training method for RBMs proposed by

Geoffrey Hinton Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ar ...

for use with training "Product of Expert" models is called

(CD). CD provides an approximation to the

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

method that would ideally be applied for learning the weights. In training a single RBM, weight updates are performed with

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

via the following equation:

w_(t+1) = w_(t) + \eta\frac

where,

p(v)

is the probability of a visible vector, which is given by

p(v) = \frac\sum_he^

Z

is the partition function (used for normalizing) and

E(v,h)

is the energy function assigned to the state of the network. A lower energy indicates the network is in a more "desirable" configuration. The gradient

\frac

has the simple form

\langle v_ih_j\rangle_\text - \langle v_ih_j\rangle_\text

where

\langle\cdots\rangle_p

represent averages with respect to distribution

p

. The issue arises in sampling

\langle v_ih_j\rangle_\text

because this requires extended alternating

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is dif ...

. CD replaces this step by running alternating Gibbs sampling for

n

steps (values of

n = 1

perform well). After

n

steps, the data are sampled and that sample is used in place of

\langle v_ih_j\rangle_\text

. The CD procedure works as follows: # Initialize the visible units to a training vector. # Update the hidden units in parallel given the visible units:

p(h_j = 1 \mid \textbf) = \sigma(b_j + \sum_i v_iw_)

\sigma

is the

sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: :S(x) = \frac = \f ...

and

b_j

is the bias of

h_j

. # Update the visible units in parallel given the hidden units:

p(v_i = 1 \mid \textbf) = \sigma(a_i + \sum_j h_jw_)

a_i

is the bias of

v_i

. This is called the "reconstruction" step. # Re-update the hidden units in parallel given the reconstructed visible units using the same equation as in step 2. # Perform the weight update:

\Delta w_ \propto \langle v_ih_j\rangle_\text - \langle v_ih_j\rangle_\text

. Once an RBM is trained, another RBM is "stacked" atop it, taking its input from the final trained layer. The new visible layer is initialized to a training vector, and values for the units in the already-trained layers are assigned using the current weights and biases. The new RBM is then trained with the procedure above. This whole process is repeated until the desired stopping criterion is met. Although the approximation of CD to maximum likelihood is crude (does not follow the gradient of any function), it is empirically effective.

References

External links

* * {{cite web , title=Deep Belief Network Example , website=Deeplearning4j Tutorials , url=http://deeplearning4j.org/deepbeliefnetwork.html , access-date=2015-02-22 , archive-url=https://web.archive.org/web/20161003210144/http://deeplearning4j.org/deepbeliefnetwork.html , archive-date=2016-10-03 , url-status=dead Neural network architectures Probabilistic models

Training

See also

References

External links