In the
mathematical
Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
theory of
artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
, universal approximation theorems are theorems
of the following form: Given a family of neural networks, for each function
from a certain
function space
In mathematics, a function space is a set of functions between two fixed sets. Often, the domain and/or codomain will have additional structure which is inherited by the function space. For example, the set of functions from any set into a ve ...
, there exists a sequence of neural networks
from the family, such that
according to some criterion. That is, the family of neural networks is
dense
Density (volumetric mass density or specific mass) is the ratio of a substance's mass to its volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' (or ''d'') can also be use ...
in the function space.
The most popular version states that
feedforward networks with non-
polynomial
In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...
activation function
The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...
s are dense in the space of
continuous function
In mathematics, a continuous function is a function such that a small variation of the argument induces a small variation of the value of the function. This implies there are no abrupt changes in value, known as '' discontinuities''. More preci ...
s between two
Euclidean space
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
s, with respect to the
compact convergence
In mathematics compact convergence (or uniform convergence on compact sets) is a type of convergence that generalizes the idea of uniform convergence. It is associated with the compact-open topology.
Definition
Let (X, \mathcal) be a topological ...
topology
Topology (from the Greek language, Greek words , and ) is the branch of mathematics concerned with the properties of a Mathematical object, geometric object that are preserved under Continuous function, continuous Deformation theory, deformat ...
.
Universal approximation theorems are existence theorems: They simply state that there ''exists'' such a sequence
, and do not provide any way to actually find such a sequence. They also do not guarantee any method, such as
backpropagation
In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates.
It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...
, might actually find such a sequence. Any method for searching the space of neural networks, including backpropagation, might find a converging sequence, or not (i.e. the backpropagation might get stuck in a local optimum).
Universal approximation theorems are limit theorems: They simply state that for any
and a criterion of closeness
, if there are ''enough'' neurons in a neural network, then there exists a neural network with that many neurons that does approximate
to within
. There is no guarantee that any finite size, say, 10000 neurons, is enough.
Setup
Artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued
vectors to real-valued
vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.
Most universal approximation theorems are in one of two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case).
History
Arbitrary width
The first examples were the ''arbitrary width'' case.
George Cybenko in 1989 proved it for
sigmoid
Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include:
* Sigmoid function, a mathematical function
* Sigmoid colon, part of the l ...
activation functions.
, Maxwell Stinchcombe, and
Halbert White showed in 1989 that multilayer
feed-forward networks with as few as one hidden layer are universal approximators.
Hornik also showed in 1991
that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993
and later Allan Pinkus in 1999
showed that the universal approximation property is equivalent to having a nonpolynomial activation function.
Arbitrary depth
The ''arbitrary depth'' case was also studied by a number of authors such as Gustaf Gripenberg in 2003,
Dmitry Yarotsky, Zhou Lu ''et al'' in 2017,
Boris Hanin and Mark Sellke in 2018
who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons
extended those results to neural networks with ''general activation functions'' such, e.g. tanh or GeLU.
One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai
constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings.
Bounded depth and bounded width
The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.
They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators.
In 2018, Guliyev and Ismailov
constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. In 2018, they also constructed
single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions.
In 2022, Shen ''et al.'' obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.
Quantitative bounds
The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of ''
Lp'' functions using feed-forward neural networks with
ReLU
In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...
as activation functions.
Similar results that can be directly applied to
residual neural network
A residual neural network (also referred to as a residual network or ResNet) is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won ...
s were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using
control-theoretic arguments. In 2023, Cai obtained the optimal minimum width bound for the universal approximation.
For the arbitrary depth case, Leonie Papon and Anastasis Kratsios derived explicit depth estimates depending on the regularity of the target function and of the activation function.
Kolmogorov network
The
Kolmogorov–Arnold representation theorem is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem.
Robert Hecht-Nielsen showed that a three-layer neural network can approximate any continuous multivariate function. This was extended to the discontinuous case by Vugar Ismailov. In 2024, Ziming Liu and co-authors showed a practical application.
Reservoir computing and quantum reservoir computing
In reservoir computing a sparse recurrent neural network with fixed weights equipped of fading memory and echo state property is followed by a trainable output layer. Its universality has been demonstrated separately for what concerns networks of rate neurons and spiking neurons, respectively. In 2024, the framework has been generalized and extended to quantum reservoirs where the reservoir is based on qubits defined over Hilbert spaces.
Variants
Discontinuous activation functions,
noncompact domains,
certifiable networks,
random neural networks, and alternative network architectures and topologies.
The universal approximation property of width-bounded networks has been studied as a ''dual'' of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy the minimum width required for the universal approximation of the ''
Lp'' functions is exactly max (for a ReLU network). More generally this also holds if ''both'' ReLU and a
threshold activation function are used.
Universal function approximation on graphs (or rather on
graph isomorphism classes) by popular
graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.
In 2020,
a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying
-runtime method that performed at state of the art on a collection of benchmarks (where
and
are the sets of nodes and edges of the graph respectively).
There are also a variety of results between
non-Euclidean spaces
and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
(CNN) architecture,
radial basis functions, or neural networks with specific properties.
Arbitrary-width case
A spate of papers in the 1980s—1990s, from
George Cybenko and etc, established several universal approximation theorems for arbitrary width and bounded depth.
See
for reviews. The following is the most often quoted:
Also, certain non-continuous activation functions can be used to approximate a sigmoid function, which then allows the above theorem to apply to those functions. For example, the
step function
In mathematics, a function on the real numbers is called a step function if it can be written as a finite linear combination of indicator functions of intervals. Informally speaking, a step function is a piecewise constant function having on ...
works. In particular, this shows that a
perceptron
In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...
network with a single infinitely wide hidden layer can approximate arbitrary functions.
Such an
can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.
The above proof has not specified how one might use a ramp function to approximate arbitrary functions in
. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the
Dirac delta function
In mathematical analysis, the Dirac delta function (or distribution), also known as the unit impulse, is a generalized function on the real numbers, whose value is zero everywhere except at zero, and whose integral over the entire real line ...
, then use those to approximate arbitrary functions in
. The original proofs, such as the one by Cybenko, use methods from functional analysis, including the
Hahn-Banach and
Riesz–Markov–Kakutani representation theorems. Cybenko first published the theorem in a technical report in 1988, then as a paper in 1989.
Notice also that the neural network is only required to approximate within a compact set
. The proof does not describe how the function would be extrapolated outside of the region.
The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:
Arbitrary-depth case
The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017.
They showed that networks of width ''n'' + 4 with
ReLU
In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...
activation functions can approximate any
Lebesgue-integrable function on ''n''-dimensional input space with respect to
distance if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper
it was shown that
ReLU
In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...
networks with width ''n'' + 1 were sufficient to approximate any
continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous ...
function of ''n''-dimensional input variables. The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.
Together, the central result of
yields the following universal approximation theorem for networks with bounded width (see also
for the first result of this kind).
Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.
Bounded depth and bounded width case
The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus.
Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.
This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see.
The theoretical result can be formulated as follows.
Here “
is
-strictly increasing on some set
” means that there exists a strictly increasing function
such that
for all
. Clearly, a
-increasing function behaves like a usual increasing function as
gets small.
In the "''depth-width''" terminology, the above theorem says that for certain activation functions depth-
width-
networks are universal approximators for univariate functions and depth-
width-
networks are universal approximators for
-variable functions (
).
See also
*
Kolmogorov–Arnold representation theorem
*
Representer theorem
For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized Empirical risk minimization, empirical risk functional defined over a reproducing kernel Hi ...
*
No free lunch theorem
*
Stone–Weierstrass theorem
In mathematical analysis, the Weierstrass approximation theorem states that every continuous function defined on a closed interval (mathematics), interval can be uniform convergence, uniformly approximated as closely as desired by a polynomial fun ...
*
Fourier series
A Fourier series () is an Series expansion, expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series. By expressing a function as a sum of sines and cosines, many problems ...
References
{{Differentiable computing
Theorems in mathematical analysis
Artificial neural networks
Network architecture
Networks