Universal approximation theorem

Last updated

Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.

Contents

In the mathematical theory of artificial neural networks, universal approximation theorems are results [1] [2] that put limits on what neural networks can theoretically learn. Specifically, given an algorithm that generates the networks within a class of functions, the theorems establish the density of the generated functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not extrapolate outside of the region. That applies for all non-periodic activation functions, i.e. what's in practice used and most proofs assume. In recent years neocortical pyramidal neurons with oscillating activation function that can individually learn the XOR function have been discovered in the human brain and oscillating activation functions have been explored and shown to outperform popular activation functions on a variety of benchmarks. [3]

However, there are also a variety of results between non-Euclidean spaces [4] and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture, [5] [6] radial basis functions, [7] or neural networks with specific properties. [8] [9] Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("bounded depth and bounded width" case).

Universal approximation theorems imply that neural networks can represent a wide variety of interesting functions with appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible. To construct the weight, neural networks are trained, and they may converge on the correct weights, or not (i.e. get stuck in a local optimum). If the network is too small (for the dimensions of input data) then the universal approximation theorems do not apply, i.e. the networks will not learn. What was once proven about the depth of a network, i.e. a single hidden layer enough, only applies for one dimension, in general such a network is too shallow. The width of a network is also an important hyperparameter. The choice of an activation function is also important, and some work, and proofs written about, assume e.g. ReLU (or sigmoid) used, while some, such as a linear are known to not work (nor any polynominal).

Neural networks with an unbounded (non-polynomial) activation function have the universal approximation property. [10]

The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy the minimum width required for the universal approximation of the Lp functions is exactly max{dx + 1, dy} (for a ReLU network). More generally this also holds if both ReLU and a threshold activation function are used. [11]

History

One of the first versions of the arbitrary width case was proven by George Cybenko in 1989 for sigmoid activation functions. [12] Kurt Hornik  [ de ], Maxwell Stinchcombe, and Halbert White showed in 1989 that multilayer feed-forward networks with as few as one hidden layer are universal approximators. [1] Hornik also showed in 1991 [13] that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993 [14] and later Allan Pinkus in 1999 [15] showed that the universal approximation property is equivalent to having a nonpolynomial activation function. In 2022, Shen Zuowei, Haizhao Yang, and Shijun Zhang [16] obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.

The arbitrary depth case was also studied by a number of authors such as Gustaf Gripenberg in 2003, [17] Dmitry Yarotsky, [18] Zhou Lu et al in 2017, [19] Boris Hanin and Mark Sellke in 2018 [20] who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons [21] extended those results to neural networks with general activation functions such, e.g. tanh, GeLU, or Swish, and in 2022, their result was made quantitative by Leonie Papon and Anastasis Kratsios [22] who derived explicit depth estimates depending on the regularity of the target function and of the activation function.

The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of Lp functions using feed-forward neural networks with ReLU as activation functions. [11] Similar results that can be directly applied to residual neural networks were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using control-theoretic arguments. [23] [24] In 2023, Cai [25] obtained the optimal minimum width bound for the universal approximation.

The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999. [26] They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. [27] It was constructively proved in 2018 paper [28] that single hidden layer networks with bounded width are still universal approximators for univariate functions, but this property is no longer true for multivariable functions.

Several extensions of the theorem exist, such as to discontinuous activation functions, [14] noncompact domains, [21] certifiable networks, [29] random neural networks, [30] and alternative network architectures and topologies. [21] [31]

In 2023 it was published that a three-layer neural network can approximate any function (continuous and discontinuous), [32] however, the publication came without efficient learning algorithms for approximating discontinuous functions.

Arbitrary-width case

A spate of papers in the 1980s—1990s, from George Cybenko and Kurt Hornik  [ de ] etc, established several universal approximation theorems for arbitrary width and bounded depth. [33] [12] [34] [13] See [35] [36] [15] for reviews. The following is the most often quoted:

Universal approximation theorem  Let denote the set of continuous functions from a subset of a Euclidean space to a Euclidean space . Let . Note that , so denotes applied to each component of .

Then is not polynomial if and only if for every , , compact , there exist , , , such that

where

Also, certain non-continuous activation functions can be used to approximate a sigmoid function, which then allows the above theorem to apply to those functions. For example, the step function works. In particular, this shows that a perceptron network with a single infinitely wide hidden layer can approximate arbitrary functions.

Such an can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.

Proof sketch

It suffices to prove the case where , since uniform convergence in is just uniform convergence in each coordinate.

Let be the set of all one-hidden-layer neural networks constructed with . Let be the set of all with compact support.

If the function is a polynomial of degree , then is contained in the closed subspace of all polynomials of degree , so its closure is also contained in it, which is not all of .

Otherwise, we show that 's closure is all of . Suppose we can construct arbitrarily good approximations of the ramp function

then it can be combined to construct arbitrary compactly-supported continuous function to arbitrary precision. It remains to approximate the ramp function.

Any of the commonly used activation functions used in machine learning can obviously be used to approximate the ramp function, or first approximate the ReLU, then the ramp function.

if is "squashing", that is, it has limits , then one can first affinely scale down its x-axis so that its graph looks like a step-function with two sharp "overshoots", then make a linear sum of enough of them to make a "staircase" approximation of the ramp function. With more steps of the staircase, the overshoots smooth out and we get arbitrarily good approximation of the ramp function.

The case where is a generic non-polynomial function is harder, and the reader is directed to. [15]

The above proof has not specified how one might use a ramp function to approximate arbitrary functions in . A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the Dirac delta function, then use those to approximate arbitrary functions in . [37] The original proofs, such as the one by Cybenko, use methods from functional analysis, including the Hahn-Banach and Riesz representation theorems.

The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization: [34]

Universal approximation theorem for pi-sigma networks  With any nonconstant activation function, a one-hidden-layer pi-sigma network is a universal approximator.

Arbitrary-depth case

The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017. [19] They showed that networks of width n + 4 with ReLU activation functions can approximate any Lebesgue-integrable function on n-dimensional input space with respect to distance if network depth is allowed to grow. It was also shown that if the width was less than or equal to n, this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper [19] it was shown that ReLU networks with width n + 1 were sufficient to approximate any continuous function of n-dimensional input variables. [38] The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to. [39]

Universal approximation theorem (L1 distance, ReLU activation, arbitrary depth, minimal width)  For any Bochner–Lebesgue p-integrable function and any , there exists a fully connected ReLU network of width exactly , satisfying

Moreover, there exists a function and some , for which there is no fully connected ReLU network of width less than satisfying the above approximation bound.

Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact domain, then the exact minimum width is [25] .

Quantitative refinement: In the case where, when and and where is the ReLU activation function, the exact depth and width for a ReLU network to achieve error is also known. [40] If, moreover, the target function is smooth, then the required number of layer and their width can be exponentially smaller. [41] Even if is not smooth, the curse of dimensionality can be broken if admits additional "compositional structure". [42] [43]

Together, the central result of [21] yields the following universal approximation theorem for networks with bounded width (see also [17] for the first result of this kind).

Universal approximation theorem (Uniform non-affine activation, arbitrary depth, constrained width).  Let be a compact subset of . Let be any non-affine continuous function which is continuously differentiable at at least one point, with nonzero derivative at that point. Let denote the space of feed-forward neural networks with input neurons, output neurons, and an arbitrary number of hidden layers each with neurons, such that every hidden neuron has activation function and every output neuron has the identity as its activation function, with input layer and output layer . Then given any and any , there exists such that

In other words, is dense in with respect to the topology of uniform convergence.

Quantitative refinement: The number of layers and the width of each layer required to approximate to precision known; [22] moreover, the result hold true when and are replaced with any non-positively curved Riemannian manifold.

Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions. [19] [20] [44]

Bounded depth and bounded width case

The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus. [26] Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.

Universal approximation theorem: [26]   There exists an activation function which is analytic, strictly increasing and sigmoidal and has the following property: For any and there exist constants , and vectors for which

for all .

This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see. [27] The theoretical result can be formulated as follows.

Universal approximation theorem: [27] [28]   Let be a finite segment of the real line, and be any positive number. Then one can algorithmically construct a computable sigmoidal activation function , which is infinitely differentiable, strictly increasing on , -strictly increasing on , and satisfies the following properties:

  1. For any and there exist numbers and such that for all
  2. For any continuous function on the -dimensional box and , there exist constants , , and such that the inequality
    holds for all . Here the weights , , are fixed as follows:
    In addition, all the coefficients , except one, are equal.

Here “ is -strictly increasing on some set ” means that there exists a strictly increasing function such that for all . Clearly, a -increasing function behaves like a usual increasing function as gets small. In the "depth-width" terminology, the above theorem says that for certain activation functions depth- width- networks are universal approximators for univariate functions and depth- width- networks are universal approximators for -variable functions ().

Graph input

Achieving useful universal function approximation on graphs (or rather on graph isomorphism classes) has been a longstanding problem. The popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test. [45] In 2020, [46] a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying -runtime method that performed at state of the art on a collection of benchmarks (where and are the sets of nodes and edges of the graph respectively).

See also

Related Research Articles

<span class="mw-page-title-main">Null set</span> Measurable set whose measure is zero

In mathematical analysis, a null set is a Lebesgue measurable set of real numbers that has measure zero. This can be characterized as a set that can be covered by a countable union of intervals of arbitrarily small total length.

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

In mathematics, an infinite series of numbers is said to converge absolutely if the sum of the absolute values of the summands is finite. More precisely, a real or complex series is said to converge absolutely if for some real number Similarly, an improper integral of a function, is said to converge absolutely if the integral of the absolute value of the integrand is finite—that is, if A convergent series that is not absolutely convergent is called conditionally convergent.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In probability theory, the central limit theorem states that, under certain circumstances, the probability distribution of the scaled mean of a random sample converges to a normal distribution as the sample size increases to infinity. Under stronger assumptions, the Berry–Esseen theorem, or Berry–Esseen inequality, gives a more quantitative result, because it also specifies the rate at which this convergence takes place by giving a bound on the maximal error of approximation between the normal distribution and the true distribution of the scaled sample mean. The approximation is measured by the Kolmogorov–Smirnov distance. In the case of independent samples, the convergence rate is n−1/2, where n is the sample size, and the constant is estimated in terms of the third absolute normalized moment.

<span class="mw-page-title-main">Reproducing kernel Hilbert space</span> In functional analysis, a Hilbert space

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The converse does not need to be true. Informally, this can be shown by looking at the supremum norm: the sequence of functions converges pointwise, but does not converge uniformly i.e. does not converge with respect to the supremum norm.

<span class="mw-page-title-main">Zeta function universality</span>

In mathematics, the universality of zeta functions is the remarkable ability of the Riemann zeta function and other similar functions to approximate arbitrary non-vanishing holomorphic functions arbitrarily well.

In the mathematical field of mathematical analysis, Lusin's theorem or Lusin's criterion states that an almost-everywhere finite function is measurable if and only if it is a continuous function on nearly all its domain. In the informal formulation of J. E. Littlewood, "every measurable function is nearly continuous".

String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector spaces and linear maps with the tensor product, string diagrams are called tensor networks or Penrose graphical notation. This has led to the development of categorical quantum mechanics where the axioms of quantum theory are expressed in the language of monoidal categories.

<span class="mw-page-title-main">Exponential type</span> Type of complex function with growth bounded by an exponential function

In complex analysis, a branch of mathematics, a holomorphic function is said to be of exponential type C if its growth is bounded by the exponential function for some real-valued constant as . When a function is bounded in this way, it is then possible to express it as certain kinds of convergent summations over a series of other complex functions, as well as understanding when it is possible to apply techniques such as Borel summation, or, for example, to apply the Mellin transform, or to perform approximations using the Euler–Maclaurin formula. The general case is handled by Nachbin's theorem, which defines the analogous notion of -type for a general function as opposed to .

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

In machine learning, a Hyper basis function network, or HyperBF network, is a generalization of radial basis function (RBF) networks concept, where the Mahalanobis-like distance is used instead of Euclidean distance measure. Hyper basis function networks were first introduced by Poggio and Girosi in the 1990 paper “Networks for Approximation and Learning”.

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

Neural operators are a class of deep learning architectures designed to learn maps between infinite-dimensional function spaces. Neural operators represent an extension of traditional artificial neural networks, marking a departure from the typical focus on learning mappings between finite-dimensional Euclidean spaces or finite sets. Neural operators directly learn operators between function spaces; they can receive input functions, and the output function can be evaluated at any discretization.

References

  1. 1 2 Hornik, Kurt; Stinchcombe, Maxwell; White, Halbert (January 1989). "Multilayer feedforward networks are universal approximators". Neural Networks. 2 (5): 359–366. doi:10.1016/0893-6080(89)90020-8.
  2. Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary
  3. Gidon, Albert; Zolnik, Timothy Adam; Fidzinski, Pawel; Bolduan, Felix; Papoutsi, Athanasia; Poirazi, Panayiota; Holtkamp, Martin; Vida, Imre; Larkum, Matthew Evan (3 January 2020). "Dendritic action potentials and computation in human layer 2/3 cortical neurons". Science. 367 (6473): 83–87. Bibcode:2020Sci...367...83G. doi:10.1126/science.aax6239. PMID   31896716.
  4. Kratsios, Anastasis; Bilokopytov, Eugene (2020). Non-Euclidean Universal Approximation (PDF). Advances in Neural Information Processing Systems. Vol. 33. Curran Associates.
  5. Zhou, Ding-Xuan (2020). "Universality of deep convolutional neural networks". Applied and Computational Harmonic Analysis . 48 (2): 787–794. arXiv: 1805.10769 . doi:10.1016/j.acha.2019.06.004. S2CID   44113176.
  6. Heinecke, Andreas; Ho, Jinn; Hwang, Wen-Liang (2020). "Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets". IEEE Signal Processing Letters. 27: 1175–1179. Bibcode:2020ISPL...27.1175H. doi:10.1109/LSP.2020.3005051. S2CID   220669183.
  7. Park, J.; Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks". Neural Computation. 3 (2): 246–257. doi:10.1162/neco.1991.3.2.246. PMID   31167308. S2CID   34868087.
  8. Yarotsky, Dmitry (2021). "Universal Approximations of Invariant Maps by Neural Networks". Constructive Approximation. 55: 407–474. arXiv: 1804.10306 . doi:10.1007/s00365-021-09546-1. S2CID   13745401.
  9. Zakwan, Muhammad; d’Angelo, Massimiliano; Ferrari-Trecate, Giancarlo (2023). "Universal Approximation Property of Hamiltonian Deep Neural Networks". IEEE Control Systems Letters: 1. arXiv: 2303.12147 . doi:10.1109/LCSYS.2023.3288350. S2CID   257663609.
  10. Sonoda, Sho; Murata, Noboru (September 2017). "Neural Network with Unbounded Activation Functions is Universal Approximator". Applied and Computational Harmonic Analysis. 43 (2): 233–268. arXiv: 1505.03654 . doi:10.1016/j.acha.2015.12.005. S2CID   12149203.
  11. 1 2 Park, Sejun; Yun, Chulhee; Lee, Jaeho; Shin, Jinwoo (2021). Minimum Width for Universal Approximation. International Conference on Learning Representations. arXiv: 2006.08859 .
  12. 1 2 Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function". Mathematics of Control, Signals, and Systems. 2 (4): 303–314. CiteSeerX   10.1.1.441.7873 . doi:10.1007/BF02551274. S2CID   3958369.
  13. 1 2 Hornik, Kurt (1991). "Approximation capabilities of multilayer feedforward networks". Neural Networks. 4 (2): 251–257. doi:10.1016/0893-6080(91)90009-T. S2CID   7343126.
  14. 1 2 Leshno, Moshe; Lin, Vladimir Ya.; Pinkus, Allan; Schocken, Shimon (January 1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function". Neural Networks. 6 (6): 861–867. doi:10.1016/S0893-6080(05)80131-5. S2CID   206089312.
  15. 1 2 3 Pinkus, Allan (January 1999). "Approximation theory of the MLP model in neural networks". Acta Numerica. 8: 143–195. Bibcode:1999AcNum...8..143P. doi:10.1017/S0962492900002919. S2CID   16800260.
  16. Shen, Zuowei; Yang, Haizhao; Zhang, Shijun (January 2022). "Optimal approximation rate of ReLU networks in terms of width and depth". Journal de Mathématiques Pures et Appliquées. 157: 101–135. arXiv: 2103.00502 . doi:10.1016/j.matpur.2021.07.009. S2CID   232075797.
  17. 1 2 Gripenberg, Gustaf (June 2003). "Approximation by neural networks with a bounded number of nodes at each level". Journal of Approximation Theory. 122 (2): 260–266. doi:10.1016/S0021-9045(03)00078-9.
  18. Yarotsky, Dmitry (October 2017). "Error bounds for approximations with deep ReLU networks". Neural Networks. 94: 103–114. arXiv: 1610.01145 . doi:10.1016/j.neunet.2017.07.002. PMID   28756334. S2CID   426133.
  19. 1 2 3 4 Lu, Zhou; Pu, Hongming; Wang, Feicheng; Hu, Zhiqiang; Wang, Liwei (2017). "The Expressive Power of Neural Networks: A View from the Width". Advances in Neural Information Processing Systems. 30. Curran Associates: 6231–6239. arXiv: 1709.02540 .
  20. 1 2 Hanin, Boris; Sellke, Mark (2018). "Approximating Continuous Functions by ReLU Nets of Minimal Width". arXiv: 1710.11278 [stat.ML].
  21. 1 2 3 4 Kidger, Patrick; Lyons, Terry (July 2020). Universal Approximation with Deep Narrow Networks. Conference on Learning Theory. arXiv: 1905.08539 .
  22. 1 2 Kratsios, Anastasis; Papon, Léonie (2022). "Universal Approximation Theorems for Differentiable Geometric Deep Learning". Journal of Machine Learning Research. 23 (196): 1–73. arXiv: 2101.05390 .
  23. Tabuada, Paulo; Gharesifard, Bahman (2021). Universal approximation power of deep residual neural networks via nonlinear control theory. International Conference on Learning Representations. arXiv: 2007.06007 .
  24. Tabuada, Paulo; Gharesifard, Bahman (May 2023). "Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control". IEEE Transactions on Automatic Control. 68 (5): 2715–2728. doi:10.1109/TAC.2022.3190051. S2CID   250512115. (Erratum:  doi:10.1109/TAC.2024.3390099)
  25. 1 2 Cai, Yongqiang (2023-02-01). "Achieve the Minimum Width of Neural Networks for Universal Approximation". ICLR. arXiv: 2209.11395 .
  26. 1 2 3 Maiorov, Vitaly; Pinkus, Allan (April 1999). "Lower bounds for approximation by MLP neural networks". Neurocomputing. 25 (1–3): 81–91. doi:10.1016/S0925-2312(98)00111-8.
  27. 1 2 3 Guliyev, Namig; Ismailov, Vugar (November 2018). "Approximation capability of two hidden layer feedforward neural networks with fixed weights". Neurocomputing. 316: 262–269. arXiv: 2101.09181 . doi:10.1016/j.neucom.2018.07.075. S2CID   52285996.
  28. 1 2 Guliyev, Namig; Ismailov, Vugar (February 2018). "On the approximation by single hidden layer feedforward neural networks with fixed weights". Neural Networks. 98: 296–304. arXiv: 1708.06219 . doi:10.1016/j.neunet.2017.12.007. PMID   29301110. S2CID   4932839.
  29. Baader, Maximilian; Mirman, Matthew; Vechev, Martin (2020). Universal Approximation with Certified Networks. ICLR.
  30. Gelenbe, Erol; Mao, Zhi Hong; Li, Yan D. (1999). "Function approximation with spiked random networks". IEEE Transactions on Neural Networks. 10 (1): 3–9. doi:10.1109/72.737488. PMID   18252498.
  31. Lin, Hongzhou; Jegelka, Stefanie (2018). ResNet with one-neuron hidden layers is a Universal Approximator. Advances in Neural Information Processing Systems. Vol. 30. Curran Associates. pp. 6169–6178.
  32. Ismailov, Vugar E. (July 2023). "A three layer neural network can represent any multivariate function". Journal of Mathematical Analysis and Applications. 523 (1): 127096. arXiv: 2012.03016 . doi:10.1016/j.jmaa.2023.127096. S2CID   265100963.
  33. Funahashi, Ken-Ichi (January 1989). "On the approximate realization of continuous mappings by neural networks". Neural Networks. 2 (3): 183–192. doi:10.1016/0893-6080(89)90003-8.
  34. 1 2 Hornik, Kurt; Stinchcombe, Maxwell; White, Halbert (January 1989). "Multilayer feedforward networks are universal approximators". Neural Networks. 2 (5): 359–366. doi:10.1016/0893-6080(89)90020-8.
  35. Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation, Volume 2, Prentice Hall. ISBN   0-13-273350-1.
  36. Hassoun, M. (1995) Fundamentals of Artificial Neural Networks MIT Press, p. 48
  37. Nielsen, Michael A. (2015). "Neural Networks and Deep Learning".{{cite journal}}: Cite journal requires |journal= (help)
  38. Hanin, B. (2018). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278.
  39. Park, Yun, Lee, Shin, Sejun, Chulhee, Jaeho, Jinwoo (2020-09-28). "Minimum Width for Universal Approximation". ICLR. arXiv: 2006.08859 .{{cite journal}}: CS1 maint: multiple names: authors list (link)
  40. Shen, Zuowei; Yang, Haizhao; Zhang, Shijun (January 2022). "Optimal approximation rate of ReLU networks in terms of width and depth". Journal de Mathématiques Pures et Appliquées. 157: 101–135. arXiv: 2103.00502 . doi:10.1016/j.matpur.2021.07.009. S2CID   232075797.
  41. Lu, Jianfeng; Shen, Zuowei; Yang, Haizhao; Zhang, Shijun (January 2021). "Deep Network Approximation for Smooth Functions". SIAM Journal on Mathematical Analysis. 53 (5): 5465–5506. arXiv: 2001.03040 . doi:10.1137/20M134695X. S2CID   210116459.
  42. Juditsky, Anatoli B.; Lepski, Oleg V.; Tsybakov, Alexandre B. (2009-06-01). "Nonparametric estimation of composite functions". The Annals of Statistics. 37 (3). doi: 10.1214/08-aos611 . ISSN   0090-5364. S2CID   2471890.
  43. Poggio, Tomaso; Mhaskar, Hrushikesh; Rosasco, Lorenzo; Miranda, Brando; Liao, Qianli (2017-03-14). "Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review". International Journal of Automation and Computing. 14 (5): 503–519. arXiv: 1611.00740 . doi: 10.1007/s11633-017-1054-2 . ISSN   1476-8186. S2CID   15562587.
  44. Johnson, Jesse (2019). Deep, Skinny Neural Networks are not Universal Approximators. International Conference on Learning Representations.
  45. Xu, Keyulu; Hu, Weihua; Leskovec, Jure; Jegelka, Stefanie (2019). How Powerful are Graph Neural Networks?. International Conference on Learning Representations.
  46. Brüel-Gabrielsson, Rickard (2020). Universal Function Approximation on Graphs. Advances in Neural Information Processing Systems. Vol. 33. Curran Associates.