Feedforward refers to recognition-inference architecture of neural networks.

Artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward.

Recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s, or neural networks with loops allow information from later processing stages to feed back to earlier stages for sequence processing. However, at every stage of inference a feedforward multiplication remains the core, essential for backpropagationRumelhart, David E., Geoffrey E. Hinton, and R. J. Williams.
Learning Internal Representations by Error Propagation
. David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986. or backpropagation through time. Thus neural networks cannot contain feedback like

negative feedback Negative feedback (or balancing feedback) occurs when some function (Mathematics), function of the output of a system, process, or mechanism is feedback, fed back in a manner that tends to reduce the fluctuations in the output, whether caused ...

positive feedback Positive feedback (exacerbating feedback, self-reinforcing feedback) is a process that occurs in a feedback loop where the outcome of a process reinforces the inciting process to build momentum. As such, these forces can exacerbate the effects ...

where the outputs feed back to the ''very same'' inputs and modify them, because this forms an infinite loop which is not possible to rewind in time to generate an error signal through backpropagation. This issue and nomenclature appear to be a point of confusion between some computer scientists and scientists in other fields studying brain networks.

Mathematical foundations

Activation function

The two historically common activation functions are both sigmoids, and are described by :

y(v_i) = \tanh(v_i) ~~ \textrm ~~ y(v_i) = (1+e^)^

. The first is a

hyperbolic tangent In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the ...

that ranges from -1 to 1, while the other is the

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...

, which is similar in shape but ranges from 0 to 1. Here

y_i

is the output of the

i

th node (neuron) and

v_i

is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models). In recent developments of

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical

problems A problem is a difficulty which may be resolved by problem solving. Problem(s) or The Problem may also refer to: People * Problem (rapper), (born 1985) American rapper Books * ''Problems'' (Aristotle), an Aristotelian (or pseudo-Aristotelian) co ...

related to the sigmoids.

Learning

Learning occurs by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

, and is carried out through

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

. We can represent the degree of error in an output node

j

in the

n

th data point (training example) by

e_j(n)=d_j(n)-y_j(n)

, where

d_j(n)

is the desired target value for

n

th data point at node

j

, and

y_j(n)

is the value produced at node

j

when the

n

th data point is given as an input. The node weights can then be adjusted based on corrections that minimize the error in the entire output for the

n

th data point, given by :

\mathcal(n)=\frac\sum_ e_j^2(n)

. Using

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

, the change in each weight

w_

is :

\Delta w_ (n) = -\eta\frac y_i(n)

where

y_i(n)

is the output of the previous neuron

i

, and

\eta

is the ''

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

'', which is selected to ensure that the weights quickly converge to a response, without oscillations. In the previous expression,

\frac

denotes the partial derivate of the error

\mathcal(n)

according to the weighted sum

v_j(n)

of the input connections of neuron

i

. The derivative to be calculated depends on the induced local field

v_j

, which itself varies. It is easy to prove that for an output node this derivative can be simplified to :

-\frac = e_j(n)\phi^\prime (v_j(n))

where

\phi^\prime

is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is :

-\frac = \phi^\prime (v_j(n))\sum_k -\frac w_(n)

. This depends on the change in weights of the

k

th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

History

Timeline

* Circa 1800, Legendre (1805) and

Gauss Johann Carl Friedrich Gauss (; ; ; 30 April 177723 February 1855) was a German mathematician, astronomer, Geodesy, geodesist, and physicist, who contributed to many fields in mathematics and science. He was director of the Göttingen Observat ...

(1795) created the simplest feedforward network which consists of a single weight layer with linear activation functions. It was trained by the least squares method for minimising

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

, also known as

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

. Legendre and Gauss used it for the prediction of planetary movement from training data.Merriman, Mansfield. ''A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes''. Vol. 4. Academy, 1877. * In 1943,

Warren McCulloch Warren Sturgis McCulloch (November 16, 1898 – September 24, 1969) was an American neurophysiologist and cybernetician known for his work on the foundation for certain brain theories and his contribution to the cybernetics movement.Ken Aizawa ...

and

Walter Pitts Walter Harry Pitts, Jr. (April 23, 1923 – May 14, 1969) was an American logician who worked in the field of computational neuroscience.Smalheiser, Neil R"Walter Pitts", ''Perspectives in Biology and Medicine'', Volume 43, Number 2, Wint ...

proposed the binary

artificial neuron An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an ''artificial neural network''. The design of the artificial neuron was inspired ...

as a logical model of biological neural networks. * In 1958,

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...

proposed the multilayered

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

model, consisting of an input layer, a hidden layer with randomized weights that did not learn, and an output layer with learnable connections. R. D. Joseph (1960) mentions an even earlier perceptron-like device: "Farley and Clark of MIT Lincoln Laboratory actually preceded Rosenblatt in the development of a perceptron-like device." However, "they dropped the subject." * In 1960, Joseph also discussed multilayer perceptrons with an adaptive hidden layer. Rosenblatt (1962) cited and adopted these ideas, also crediting work by H. D. Block and B. W. Knight. Unfortunately, these early efforts did not lead to a working learning algorithm for hidden units, i.e.,

. * In 1965, Alexey Grigorevich Ivakhnenko and Valentin Lapa published

Group Method of Data Handling A group is a number of persons or things that are located, gathered, or classed together. Groups of people * Cultural group, a group whose members share the same cultural identity * Ethnic group, a group whose members share the same ethnic iden ...

, the first working

algorithm, a method to train arbitrarily deep neural networks. It is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates." It was used to train an eight-layer neural net in 1971. * In 1967,

Shun'ichi Amari , is a Japanese engineer and neuroscientist born in 1936 in Tokyo, Japan. Overviews He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo. His Ma ...

reported the first multilayered neural network trained by

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

, which was able to classify non-linearily separable pattern classes. Amari's student Saito conducted the computer experiments, using a five-layered feedforward network with two learning layers. *In 1970,

Seppo Linnainmaa Seppo Ilmari Linnainmaa (born 28 September 1945) is a Finnish mathematician and computer scientist known for creating the modern version of backpropagation. Biography He was born in Pori. He received his MSc in 1970 and introduced a reverse mo ...

published the modern form of

in his master thesis (1970). G.M. Ostrovski et al. republished it in 1971.Ostrovski, G.M., Volin,Y.M., and Boris, W.W. (1971). On the computation of derivatives. Wiss. Z. Tech. Hochschule for Chemistry, 13:382–384.

Paul Werbos Paul John Werbos (born September 4, 1947) is an American social scientist and machine learning pioneer. He is best known for his 1974 dissertation, which first described the process of training artificial neural networks through backpropagation o ...

applied backpropagation to neural networks in 1982 (his 1974 PhD thesis, reprinted in a 1994 book, did not yet describe the algorithm). In 1986, David E. Rumelhart et al. popularised backpropagation but did not cite the original work. * In 2003, interest in backpropagation networks returned due to the successes of

being applied to

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

ling by

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian-French computer scientist, and a pioneer of artificial neural networks and deep learning. He is a professor at the Université de Montréal and scientific director of the AI institute Montreal In ...

with co-authors.

Linear regression

Perceptron

If using a threshold, i.e. a linear

activation In chemistry and biology, activation is the process whereby something is prepared or excited for a subsequent reaction. Chemistry In chemistry, "activation" refers to the reversible transition of a molecule into a nearly identical chemical or ...

function, the resulting ''linear threshold unit'' is called a ''

''. (Often the term is used to denote just one of these units.) Multiple parallel non-linear units are able to approximate any continuous function from a compact interval of the real numbers into the interval ��1,1despite the limited computational power of single unit with a linear threshold function. Perceptrons can be trained by a simple learning algorithm that is usually called the ''

delta rule In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It can be derived as the backpropagation algorithm for a single-layer neural ...

''. It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of

Multilayer perceptron

A multilayer perceptron (MLP) is a

misnomer A misnomer is a name that is incorrectly or unsuitably applied. Misnomers often arise because something was named long before its correct nature was known, or because an earlier form of something has been replaced by a later form to which the nam ...

for a modern feedforward artificial neural network, consisting of fully connected neurons (hence the synonym sometimes used of ''fully connected network'' (''FCN'')), often with a nonlinear kind of activation function, organized in at least three layers, notable for being able to distinguish data that is not linearly separable.Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function '' Mathematics of Control, Signals, and Systems'', 2(4), 303–314.

Other feedforward networks

1D Convolutional Neural Network feed forward example

Examples of other feedforward networks include

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s and

radial basis function network In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the in ...

s, which use a different activation function.

References

External links

Feedforward neural networks tutorial

Feedforward Neural Network: Example

Feedforward Neural Networks: An Introduction
{{Differentiable computing Neural network architectures