Mathematics Of Artificial Neural Networks
   HOME

TheInfoList



OR:

An artificial neural network (ANN) combines biological principles with advanced statistics to solve problems in domains such as
pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways.


Structure


Neuron

A neuron with label j receiving an input p_j(t) from predecessor neurons consists of the following components: * an ''activation'' a_j(t), the neuron's state, depending on a discrete time parameter, * an optional ''threshold'' \theta_j, which stays fixed unless changed by learning, * an ''activation function'' f that computes the new activation at a given time t+1 from a_j(t), \theta_j and the net input p_j(t) giving rise to the relation :: a_j(t+1) = f(a_j(t), p_j(t), \theta_j), * and an ''output function'' f_\text computing the output from the activation :: o_j(t) = f_\text(a_j(t)). Often the output function is simply the
identity function Graph of the identity function on the real numbers In mathematics, an identity function, also called an identity relation, identity map or identity transformation, is a function that always returns the value that was used as its argument, unc ...
. An ''input neuron'' has no predecessor but serves as input interface for the whole network. Similarly an ''output neuron'' has no successor and thus serves as output interface of the whole network.


Propagation function

The ''propagation function'' computes the ''input'' p_j(t) to the neuron j from the outputs o_i(t)and typically has the form : p_j(t) = \sum_i o_i(t) w_.


Bias

A bias term can be added, changing the form to the following: : p_j(t) = \sum_i o_i(t) w_+ w_, where w_ is a bias.


Neural networks as functions

Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) \textstyle f : X \rightarrow Y or a distribution over \textstyle X or both \textstyle X and \textstyle Y. Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a ''class'' of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function \textstyle f(x) is defined as a composition of other functions \textstyle g_i(x), that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the ''nonlinear weighted sum'', where \textstyle f (x) = K \left(\sum_i w_i g_i(x)\right) , where \textstyle K (commonly referred to as the
activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...
) is some predefined function, such as the
hyperbolic tangent In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the ...
,
sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...
,
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
, or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions \textstyle g_i as a
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
\textstyle g = (g_1, g_2, \ldots, g_n). This figure depicts such a decomposition of \textstyle f, with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input \textstyle x is transformed into a 3-dimensional vector \textstyle h, which is then transformed into a 2-dimensional vector \textstyle g, which is finally transformed into \textstyle f. This view is most commonly encountered in the context of
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...
. The second view is the probabilistic view: the
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
\textstyle F = f(G) depends upon the random variable \textstyle G = g(H), which depends upon \textstyle H=h(X), which depends upon the random variable \textstyle X. This view is most commonly encountered in the context of
graphical models A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. Graphical models are commonly used i ...
. The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of \textstyle g are independent of each other given their input \textstyle h). This naturally enables a degree of parallelism in the implementation. Networks such as the previous one are commonly called feedforward, because their graph is a
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...
. Networks with
cycles Cycle, cycles, or cyclic may refer to: Anthropology and social sciences * Cyclic history, a theory of history * Cyclical theory, a theory of American political history associated with Arthur Schlesinger, Sr. * Social cycle, various cycles in ...
are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where \textstyle f is shown as dependent upon itself. However, an implied temporal dependence is not shown.


Backpropagation

Backpropagation training algorithms fall into three categories: *
steepest descent Gradient descent is a method for unconstrained mathematical optimization. It is a :First order methods, first-order Iterative algorithm, iterative algorithm for minimizing a differentiable function, differentiable multivariate function. The ide ...
(with variable
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
and
momentum In Newtonian mechanics, momentum (: momenta or momentums; more specifically linear momentum or translational momentum) is the product of the mass and velocity of an object. It is a vector quantity, possessing a magnitude and a direction. ...
, resilient backpropagation); * quasi-Newton ( Broyden–Fletcher–Goldfarb–Shanno, one step secant); * Levenberg–Marquardt and
conjugate gradient In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an it ...
(Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient).


Algorithm

Let N be a network with e connections, m inputs and n outputs. Below, x_1,x_2,\dots denote vectors in \mathbb^m, y_1,y_2,\dots vectors in \mathbb^n, and w_0, w_1, w_2, \ldots vectors in \mathbb^e. These are called ''inputs'', ''outputs'' and ''weights'', respectively. The network corresponds to a function y = f_N(w, x) which, given a weight w, maps an input x to an output y. In
supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
, a sequence of '' training examples'' (x_1,y_1), \dots, (x_p, y_p) produces a sequence of weights w_0, w_1, \dots, w_p starting from some initial weight w_0, usually chosen at random. These weights are computed in turn: first compute w_i using only (x_i, y_i, w_) for i = 1, \dots, p. The output of the algorithm is then w_p, giving a new function x \mapsto f_N(w_p, x). The computation is the same in each step, hence only the case i = 1 is described. w_1 is calculated from (x_1, y_1, w_0) by considering a variable weight w and applying
gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...
to the function w\mapsto E(f_N(w, x_1), y_1) to find a local minimum, starting at w = w_0. This makes w_1 the minimizing weight found by gradient descent.


Learning pseudocode

To implement the algorithm above, explicit formulas are required for the gradient of the function w \mapsto E(f_N(w, x), y) where the function is E(y,y')= , y-y', ^2. The learning algorithm can be divided into two phases: propagation and weight update.


Propagation

Propagation involves the following steps: * Propagation forward through the network to generate the output value(s) * Calculation of the cost (error term) * Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons.


Weight update

For each weight: * Multiply the weight's output delta and input activation to find the gradient of the weight. * Subtract the ratio (percentage) of the weight's gradient from the weight. The ''
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
'' is the ratio (percentage) that influences the speed and quality of learning. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurate the training. The sign of the gradient of a weight indicates whether the error varies directly with or inversely to the weight. Therefore, the weight must be updated in the opposite direction, "descending" the gradient. Learning is repeated (on new batches) until the network performs adequately.


Pseudocode

Pseudocode In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...
for a
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
algorithm for training a three-layer network (one hidden layer): initialize network weights (often small random values). do for each training example named ex do prediction = neural-net-output(network, ex) ''// forward pass'' actual = teacher-output(ex) compute error (prediction - actual) at the output units ''// backward pass'' ''// backward pass continued'' update network weights ''// input layer not modified by error estimate'' until error rate becomes acceptably low return the network The lines labeled "backward pass" can be implemented using the backpropagation algorithm, which calculates the gradient of the error of the network regarding the network's modifiable weights.Werbos, Paul J. (1994). ''The Roots of Backpropagation''. From Ordered Derivatives to Neural Networks and Political Forecasting. New York, NY: John Wiley & Sons, Inc.


References

{{Mathematics of Computational statistics Classification algorithms Computational neuroscience