HOME

TheInfoList



OR:

In
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
s, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a
digital network A computer network is a set of computers sharing resources located on or provided by network nodes. The computers use common communication protocols over digital interconnections to communicate with each other. These interconnections are m ...
of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in
neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
. However, only ''nonlinear'' activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities.


Classification of activation functions

The most common activation functions can be divided in three categories:
ridge function In mathematics, a ridge function is any function f:\R^d\rightarrow\R that can be written as the composition of a univariate function with an affine transformation, that is: f(\boldsymbol) = g(\boldsymbol\cdot \boldsymbol) for some g:\R\rightarrow\ ...
s, radial functions and
fold function In functional programming, fold (also termed reduce, accumulate, aggregate, compress, or inject) refers to a family of higher-order functions that analyze a recursive data structure and through use of a given combining operation, recombine the res ...
s. An activation function f is saturating if \lim_ , \nabla f(v), = 0. It is nonsaturating if it is not saturating. Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, as they don't suffer from vanishing gradient.


Ridge activation functions

Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include: *
Linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
activation: \phi (\mathbf v)=a +\mathbf v'\mathbf b, * ReLU activation: \phi (\mathbf v)=\max(0,a +\mathbf v'\mathbf b), * Heaviside activation: \phi (\mathbf v)=1_, * Logistic activation: \phi(\mathbf v) = (1+\exp(-a-\mathbf v'\mathbf b))^. In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of
action potential An action potential occurs when the membrane potential of a specific cell location rapidly rises and falls. This depolarization then causes adjacent locations to similarly depolarize. Action potentials occur in several types of animal cells, ...
firing in the cell. In its simplest form, this function is binary—that is, either the
neuron A neuron, neurone, or nerve cell is an membrane potential#Cell excitability, electrically excitable cell (biology), cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous ...
is firing or not. The function looks like \phi(\mathbf v)=U(a + \mathbf v'\mathbf b), where U is the
Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argume ...
. A line of positive
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is used ...
may be used to reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form \phi(\mathbf v)=a+\mathbf v'\mathbf b. Neurons also cannot fire faster than a certain rate, motivating sigmoid activation functions whose range is a finite interval.


Radial activation functions

A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks, which are extremely efficient as universal function approximators. These activation functions can take many forms, but they are usually found as one of the following functions: * Gaussian: \,\phi(\mathbf v)=\exp\left(-\frac\right) * Multiquadratics: \,\phi(\mathbf v) = \sqrt * Inverse multiquadratics: \,\phi(\mathbf v) = \left(\, \mathbf v-\mathbf c\, ^2 + a^2\right)^ *
Polyharmonic spline In applied mathematics, polyharmonic splines are used for function approximation and data interpolation. They are very useful for interpolating and fitting scattered data in many dimensions. Special cases include thin plate splines and natural cu ...
s where \mathbf c is the vector representing the function ''center'' and a and \sigma are parameters affecting the spread of the radius.


Folding activation functions

Folding activation functions are extensively used in the pooling layers in
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
s, and in output layers of multiclass classification networks. These activations perform aggregation over the inputs, such as taking the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ari ...
,
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
or
maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
. In multiclass classification the
softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
activation is often used.


Comparison of activation functions

There are numerous activation functions. Hinton et al.'s seminal 2012 paper on automatic speech recognition uses a logistic sigmoid activation function. The seminal 2012
AlexNet AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor. AlexNet competed in the ImageNet Large Scale Visu ...
computer vision architecture uses the ReLU activation function, as did the seminal 2015 computer vision architecture ResNet. The seminal 2018 language processing model BERT uses a smooth version of the ReLU, the GELU. Aside from their empirical performance, activation functions also have different mathematical properties: ; Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. This is known as the Universal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model. ; Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller
learning rate In machine learning and statistics, the learning rate is a Hyperparameter (machine learning), tuning parameter in an Mathematical optimization, optimization algorithm that determines the step size at each iteration while moving toward a minimum of ...
s are typically necessary. ; Continuously differentiable: This property is desirable ( ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it. These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.


Table of activation functions

The following table compares the properties of several activation functions that are functions of one
fold Fold, folding or foldable may refer to: Arts, entertainment, and media * ''Fold'' (album), the debut release by Australian rock band Epicure *Fold (poker), in the game of poker, to discard one's hand and forfeit interest in the current pot *Above ...
from the previous layer or layers: The following table lists activation functions that are not functions of a single
fold Fold, folding or foldable may refer to: Arts, entertainment, and media * ''Fold'' (album), the debut release by Australian rock band Epicure *Fold (poker), in the game of poker, to discard one's hand and forfeit interest in the current pot *Above ...
from the previous layer or layers: : Here, \delta_ is the
Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 ...
. : For instance, j could be iterating through the number of kernels of the previous neural network layer while i iterates through the number of kernels of the current layer.


See also

*
Logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid function, sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is ...
*
Rectifier (neural networks) In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a ne ...
* Stability (learning theory) * Softmax function


References

{{Differentiable computing Artificial neural networks