Universal Approximator
   HOME

TheInfoList



OR:

In the
mathematical Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
theory of
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
, universal approximation theorems are results that establish the
density Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematical ...
of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two
Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's Elements, Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics ther ...
s, and the approximation is with respect to the
compact convergence In mathematics compact convergence (or uniform convergence on compact sets) is a type of convergence that generalizes the idea of uniform convergence. It is associated with the compact-open topology. Definition Let (X, \mathcal) be a topological ...
topology. However, there are also a variety of results between
non-Euclidean space In mathematics, non-Euclidean geometry consists of two geometries based on axioms closely related to those that specify Euclidean geometry. As Euclidean geometry lies at the intersection of metric geometry and affine geometry, non-Euclidean geo ...
s and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
(CNN) architecture, radial basis-functions, or neural networks with specific properties. Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("''arbitrary width''" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("''arbitrary depth''" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("''bounded depth and bounded width''" case). Universal approximation theorems imply that neural networks can ''represent'' a wide variety of interesting functions when given appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible.


History

One of the first versions of the ''arbitrary width'' case was proven by
George Cybenko George V. Cybenko is the Dorothy and Walter Gramm Professor of Engineering at Dartmouth College, Dartmouth and a fellow of the IEEE and SIAM. Education Cybenko obtained his BA in mathematics from the University of Toronto in 1974 and received ...
in 1989 for
sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...
activation functions. Kurt Hornik, Maxwell Stinchcombe, and
Halbert White Halbert Lynn White Jr. (November 19, 1950 – March 31, 2012) was the Chancellor’s Associates Distinguished Professor of Economics at the University of California, San Diego, and a Fellow of the Econometric Society and the American Academy of ...
showed in 1989 that multilayer feed-forward networks with as few as one hidden layer are universal approximators. Hornik also showed in 1991 that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993 and later Allan Pinkus in 1999 showed that the universal approximation property is equivalent to having a nonpolynomial activation function. The ''arbitrary depth'' case was also studied by a number of authors, such as Gustaf Gripenberg in 2003, Dmitry Yarotsky, Zhou Lu ''et al'' in 2017, Boris Hanin and Mark Sellke in 2018, and Patrick Kidger and Terry Lyons in 2020. The result minimal width per layer was refined in 2020 for residual networks. The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999. They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. It was constructively proved in 2018 paper that single hidden layer networks with bounded width are still universal approximators for univariate functions, but this property is no longer true for multivariable functions. Several extensions of the theorem exist, such as to discontinuous activation functions, noncompact domains, certifiable networks, random neural networks, and alternative network architectures and topologies.


Arbitrary-width case

The classical form of the universal approximation theorem for arbitrary width and bounded depth is as follows. It extends the classical results of
George Cybenko George V. Cybenko is the Dorothy and Walter Gramm Professor of Engineering at Dartmouth College, Dartmouth and a fellow of the IEEE and SIAM. Education Cybenko obtained his BA in mathematics from the University of Toronto in 1974 and received ...
and Kurt Hornik.
Universal approximation theorem: Let C(X, Y) denote the set of
continuous functions In mathematics, a continuous function is a function such that a continuous variation (that is a change without jump) of the argument induces a continuous variation of the value of the function. This means that there are no abrupt changes in value ...
from X to Y. Let \sigma \in C(\mathbb, \mathbb). Note that (\sigma \circ x)_i = \sigma(x_i), so \sigma \circ x denotes \sigma applied to each component of x. Then \sigma is not
polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...
if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is bicondi ...
for every n \in \mathbb, m \in \mathbb,
compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact * Blood compact, an ancient ritual of the Philippines * Compact government, a type of colonial rule utilized in British ...
K \subseteq \mathbb^n, f \in C(K, \mathbb^m), \varepsilon > 0 there exist k \in \mathbb, A \in \mathbb^, b \in \mathbb^k, C \in \mathbb^ such that \sup_ \, f(x) - g(x) \, < \varepsilon where g(x) = C \cdot ( \sigma \circ (A \cdot x + b) )
Such an f can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.


Arbitrary-depth case

The 'dual' versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017. They showed that networks of width ''n+4'' with
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
activation functions can approximate any Lebesgue integrable function on ''n''-dimensional input space with respect to L^ distance if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper it was shown that
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
networks with width ''n+1'' were sufficient to approximate any
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...
function of ''n''-dimensional input variables. The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.
Universal approximation theorem ''(L1 distance, ReLU activation, arbitrary depth, minimal width).'' For any Bochner–Lebesgue p-integrable function f : \mathbb ^ \rightarrow \mathbb ^ and any \epsilon > 0, there exists a fully-connected
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
network F of width exactly d _ = \max\, satisfying : \int _ \left\, f ( x ) - F _ ( x ) \right\, ^p \mathrm x < \epsilon. Moreover, there exists a function f \in L^p(\mathbb^n,\mathbb^m) and some \epsilon >0, for which there is no fully-connected
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
network of width less than d _ = \max\ satisfying the above approximation bound. ''Quantitative Refinement:'' In the case where, when \mathcal= ,1d and D=1 and where \sigma is the ReLU activation function then, the exact depth and width for a ReLU network to achive \varepsilon error is also known. If, moreover, the target function f is smooth then the required number of layer and their width can be exponentially smaller. Even if f is not smooth, the curse of dimensionality can be broken if f admits additional "compositional structure".
Together, the central result of yields the following universal approximation theorem for networks with bounded width (cf. also for the first result of this kind).
Universal approximation theorem (Uniform non-
affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a relative by marriage in law and anthropology * Affine cipher, a special case of the more general substitution cipher * Affine comb ...
activation, arbitrary depth, constrained width). Let \mathcal be a
compact subset In mathematics, specifically general topology, compactness is a property that seeks to generalize the notion of a closed and bounded subset of Euclidean space by making precise the idea of a space having no "punctures" or "missing endpoints", i ...
of \mathbb^d. Let \sigma:\mathbb\to\mathbb be any non-
affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a relative by marriage in law and anthropology * Affine cipher, a special case of the more general substitution cipher * Affine comb ...
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...
function which is
continuously differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
at at least one point, with nonzero
derivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. F ...
at that point. Let \mathcal_^ denote the space of feed-forward neural networks with d input neurons, D output neurons, and an arbitrary number of hidden layers each with d + D + 2 neurons, such that every hidden neuron has activation function \sigma and every output neuron has the
identity Identity may refer to: * Identity document * Identity (philosophy) * Identity (social science) * Identity (mathematics) Arts and entertainment Film and television * ''Identity'' (1987 film), an Iranian film * ''Identity'' (2003 film), ...
as its activation function, with input layer \phi , and output layer \rho. Then given any \varepsilon>0 and any f\in C(\mathcal,\mathbb^D), there exists \hat\in \mathcal_^ such that : \sup_\,\left\, \hat(x)-f(x)\right\, < \varepsilon. In other words, \mathcal is
dense Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematically ...
in C(\mathcal; \mathbb^D) with respect to the topology of
uniform convergence In the mathematical field of analysis, uniform convergence is a mode of convergence of functions stronger than pointwise convergence. A sequence of functions (f_n) converges uniformly to a limiting function f on a set E if, given any arbitrarily s ...
. ''Quantitative Refinement:'' The number of layers and the width of each layer required to approximate f to \varepsilon precision known; moreover, the result hold true when \mathcal and \mathbb^Dare replaced with any non-positively curved
Riemannian manifold In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real manifold, real, smooth manifold ''M'' equipped with a positive-definite Inner product space, inner product ...
.
Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.


Bounded depth and bounded width case

The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus. Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.
Universal approximation theorem: There exists an activation function \sigma which is analytic, strictly increasing and sigmoidal and has the following property: For any f\in C ,1 and \varepsilon >0 there exist constants d_, c_, \theta _, \gamma _, and vectors \mathbf^\in \mathbb^ for which \left\vert f(\mathbf)-\sum_^d_\sigma \left( \sum_^c_\sigma (\mathbf^\cdot \mathbf\theta _)-\gamma _\right) \right\vert <\varepsilon for all \mathbf=(x_,...,x_)\in ,1.
This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see. The theoretical result can be formulated as follows.
Universal approximation theorem: Let ,b/math> be a finite segment of the real line, s=b-a and \lambda be any positive number. Then one can algorithmically construct a computable sigmoidal activation function \sigma \colon \mathbb \to \mathbb, which is infinitely differentiable, strictly increasing on (-\infty, s) , \lambda -strictly increasing on ,+\infty) , and satisfies the following properties: 1) For any f \in C[a,b and \varepsilon > 0 there exist numbers c_1,c_2,\theta_1 and \theta_2 such that for all x \in [a,b] , f(x) - c_1 \sigma(x - \theta_1) - c_2 \sigma(x - \theta_2), < \varepsilon 2) For any continuous function F on the d-dimensional box ,b and \varepsilon >0, there exist constants e_p, c_, \theta_ and \zeta_p such that the inequality \left, F(\mathbf) - \sum_^ e_p \sigma \left( \sum_^ c_ \sigma(\mathbf^ \cdot \mathbf - \theta_) - \zeta_p \right) \ < \varepsilon holds for all \mathbf = (x_1, \ldots, x_d) \in , b. Here the weights \mathbf^, q = 1, \ldots, d, are fixed as follows: \mathbf^ = (1, 0, \ldots, 0), \quad \mathbf^ = (0, 1, \ldots, 0), \quad \ldots, \quad \mathbf^ = (0, 0, \ldots, 1). In addition, all the coefficients e_p, except one, are equal.
Here “ \sigma \colon \mathbb \to \mathbb is \lambda-strictly increasing on some set X” means that there exists a strictly increasing function u \colon X \to \mathbb such that , \sigma(x) - u(x), \le \lambda for all x \in X. Clearly, a \lambda-increasing function behaves like a usual increasing function as \lambda gets small. In the "''depth-width''" terminology, the above theorem says that for certain activation functions depth-2 width-2 networks are universal approximators for univariate functions and depth-3 width- (2d+2) networks are universal approximators for d-variable functions (d>1).


Graph input

Achieving useful universal function approximation on graphs (or rather on graph isomorphism classes) has been a longstanding problem. The popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test. In 2020, a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying O(#edges\times#nodes)-runtime method that performed at state of the art on a collection of benchmarks.


See also

*
Kolmogorov–Arnold representation theorem In real analysis and approximation theory, the Kolmogorov-Arnold representation theorem (or superposition theorem) states that every multivariate continuous function can be represented as a superposition of the two-argument addition and continuou ...
*
Representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represente ...
*
No free lunch theorem In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".Wolpert, D.H., Macready, W.G. (1997),No Free Lunch Theorems for ...
*
Stone–Weierstrass theorem In mathematical analysis, the Weierstrass approximation theorem states that every continuous function defined on a closed interval can be uniformly approximated as closely as desired by a polynomial function. Because polynomials are among the si ...
*
Fourier series A Fourier series () is a summation of harmonically related sinusoidal functions, also known as components or harmonics. The result of the summation is a periodic function whose functional form is determined by the choices of cycle length (or ''p ...


References

{{Differentiable computing Theorems in analysis Artificial neural networks Network architecture Networks