The activation function of a node in an

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is ''nonlinear''. Modern activation functions include the logistic ( sigmoid) function used in the 2012

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

model developed by Hinton et al; the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model; and the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model.

Comparison of activation functions

Aside from their empirical performance, activation functions also have different mathematical properties: ; Nonlinear: When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. This is known as the Universal Approximation Theorem. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model. ; Range: When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary. ; Continuously differentiable: This property is desirable ( ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it. These properties do not decisively influence performance, nor are they the only mathematical properties that may be useful. For instance, the strictly positive range of the softplus makes it suitable for predicting variances in variational autoencoders.

Mathematical details

The most common activation functions can be divided into three categories: ridge functions, radial functions and fold functions. An activation function

f

is saturating if

\lim_ , \nabla f(v),  = 0

. It is nonsaturating if it is

\lim_ , \nabla f(v),  \neq 0

. Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, because they are less likely to suffer from the vanishing gradient problem.

Ridge activation functions

Ridge functions are multivariate functions acting on a linear combination of the input variables. Often used examples include: *

Linear In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

activation:

\phi (\mathbf v)=a +\mathbf v'\mathbf b

, * ReLU activation:

\phi (\mathbf v)=\max(0,a +\mathbf v'\mathbf b)

, * Heaviside activation:

\phi (\mathbf v)=1_

, * Logistic activation:

\phi(\mathbf v) = (1+\exp(-a-\mathbf v'\mathbf b))^

. In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of

action potential An action potential (also known as a nerve impulse or "spike" when in a neuron) is a series of quick changes in voltage across a cell membrane. An action potential occurs when the membrane potential of a specific Cell (biology), cell rapidly ri ...

firing in the cell. In its simplest form, this function is binary—that is, either the

neuron A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

is firing or not. Neurons also cannot fire faster than a certain rate, motivating sigmoid activation functions whose range is a finite interval. The function looks like

\phi(\mathbf v)=U(a + \mathbf v'\mathbf b)

, where

U

is the

Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function named after Oliver Heaviside, the value of which is zero for negative arguments and one for positive arguments. Differen ...

. If a line has a positive

slope In mathematics, the slope or gradient of a Line (mathematics), line is a number that describes the direction (geometry), direction of the line on a plane (geometry), plane. Often denoted by the letter ''m'', slope is calculated as the ratio of t ...

, on the other hand, it may reflect the increase in firing rate that occurs as input current increases. Such a function would be of the form

\phi(\mathbf v)=a+\mathbf v'\mathbf b

Radial activation functions

A special class of activation functions known as radial basis functions (RBFs) are used in RBF networks. These activation functions can take many forms, but they are usually found as one of the following functions: * Gaussian:

\,\phi(\mathbf v)=\exp\left(-\frac\right)

* Multiquadratics:

\,\phi(\mathbf v) = \sqrt

* Inverse multiquadratics:

\,\phi(\mathbf v) = \left(\, \mathbf v-\mathbf c\, ^2 + a^2\right)^

* Polyharmonic splines where

\mathbf c

is the vector representing the function ''center'' and

a

and

\sigma

are parameters affecting the spread of the radius.

Other examples

Periodic functions can serve as activation functions. Usually the sinusoid is used, as any periodic function is decomposable into sinusoids by the

Fourier transform In mathematics, the Fourier transform (FT) is an integral transform that takes a function as input then outputs another function that describes the extent to which various frequencies are present in the original function. The output of the tr ...

. Quadratic activation maps

x \mapsto x^2

Folding activation functions

Folding activation functions are extensively used in the pooling layers in

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s, and in output layers of multiclass classification networks. These activations perform aggregation over the inputs, such as taking the

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

, minimum or

maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...

. In multiclass classification the softmax activation is often used.

Table of activation functions

The following table compares the properties of several activation functions that are functions of one fold from the previous layer or layers: The following table lists activation functions that are not functions of a single fold from the previous layer or layers: : Here,

\delta_

is the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 &\ ...

. : For instance,

j

could be iterating through the number of kernels of the previous neural network layer while

i

iterates through the number of kernels of the current layer.

Quantum activation functions

In quantum neural networks programmed on gate-model quantum computers, based on quantum perceptrons instead of variational quantum circuits, the non-linearity of the activation function can be implemented with no need of measuring the output of each

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

at each layer. The quantum properties loaded within the circuit such as superposition can be preserved by creating the

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

of the argument computed by the perceptron itself, with suitable quantum circuits computing the powers up to a wanted approximation degree. Because of the flexibility of such quantum circuits, they can be designed in order to approximate any arbitrary classical activation function.