Activation function
   HOME

TheInfoList



OR:

In
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
s, the activation function of a node defines the output of that node given an input or set of inputs. A standard
integrated circuit An integrated circuit or monolithic integrated circuit (also referred to as an IC, a chip, or a microchip) is a set of electronic circuits on one small flat piece (or "chip") of semiconductor material, usually silicon. Large numbers of tiny ...
can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in
neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
. However, only ''nonlinear'' activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities.


Classification of activation functions

The most common activation functions can be divided in three categories: ridge functions, radial functions and fold functions. An activation function f is saturating if \lim_ , \nabla f(v), = 0. It is nonsaturating if it is not saturating. Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, as they don't suffer from vanishing gradient.


Ridge activation functions

Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include: *
Linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...
activation: \phi (\mathbf v)=a +\mathbf v'\mathbf b, * ReLU activation: \phi (\mathbf v)=\max(0,a +\mathbf v'\mathbf b), *
Heaviside Oliver Heaviside FRS (; 18 May 1850 – 3 February 1925) was an English self-taught mathematician and physicist who invented a new technique for solving differential equations (equivalent to the Laplace transform), independently developed ...
activation: \phi (\mathbf v)=1_, * Logistic activation: \phi(\mathbf v) = (1+\exp(-a-\mathbf v'\mathbf b))^. In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of
action potential An action potential occurs when the membrane potential of a specific cell location rapidly rises and falls. This depolarization then causes adjacent locations to similarly depolarize. Action potentials occur in several types of animal cells ...
firing in the cell. In its simplest form, this function is binary—that is, either the
neuron A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa ...
is firing or not. The function looks like \phi(\mathbf v)=U(a + \mathbf v'\mathbf b), where U is the
Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argum ...
. A line of positive
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
may be used to reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form \phi(\mathbf v)=a+\mathbf v'\mathbf b. Neurons also cannot fire faster than a certain rate, motivating
sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...
activation functions whose range is a finite interval.


Radial activation functions

A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks, which are extremely efficient as universal function approximators. These activation functions can take many forms, but they are usually found as one of the following functions: *
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponym ...
: \,\phi(\mathbf v)=\exp\left(-\frac\right) * Multiquadratics: \,\phi(\mathbf v) = \sqrt * Inverse multiquadratics: \,\phi(\mathbf v) = \left(\, \mathbf v-\mathbf c\, ^2 + a^2\right)^ *
Polyharmonic spline In applied mathematics, polyharmonic splines are used for function approximation and data interpolation. They are very useful for interpolating and fitting scattered data in many dimensions. Special cases include thin plate splines and natural cu ...
s where \mathbf c is the vector representing the function ''center'' and a and \sigma are parameters affecting the spread of the radius.


Folding activation functions

Folding activation functions are extensively used in the pooling layers in
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
s, and in output layers of multiclass classification networks. These activations perform aggregation over the inputs, such as taking the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ar ...
,
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
or
maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given r ...
. In multiclass classification the softmax activation is often used.


Comparison of activation functions

There are numerous activation functions. Hinton et al.'s seminal 2012 paper on automatic speech recognition uses a logistic sigmoid activation function. The seminal 2012
AlexNet AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor. AlexNet competed in the ImageNet Large Scale Vi ...
computer vision architecture uses the ReLU activation function, as did the seminal 2015 computer vision architecture ResNet. The seminal 2018 language processing model BERT uses a smooth version of the ReLU, the GELU. Aside from their empirical performance, activation functions also have different mathematical properties: ; Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. This is known as the Universal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model. ; Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...
s are typically necessary. ; Continuously differentiable: This property is desirable ( ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it. These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.


Table of activation functions

The following table compares the properties of several activation functions that are functions of one fold from the previous layer or layers: The following table lists activation functions that are not functions of a single fold from the previous layer or layers: : Here, \delta_ is the
Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 & ...
. : For instance, j could be iterating through the number of kernels of the previous neural network layer while i iterates through the number of kernels of the current layer.


See also

*
Logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
* Rectifier (neural networks) * Stability (learning theory) *
Softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...


References

{{Differentiable computing Artificial neural networks