Frequency principle/spectral bias

Last updated

The frequency principle/spectral bias is a phenomenon observed in the study of artificial neural networks (ANNs), specifically deep neural networks (DNNs). It describes the tendency of deep neural networks to fit target functions from low to high frequencies during the training process.

Contents

This phenomenon is referred to as the frequency principle (F-Principle) by Zhi-Qin John Xu et al. [1] [2] or spectral bias by Nasim Rahaman et al. [3] The F-Principle can be robustly observed in DNNs, regardless of overparametrization. A key mechanism of the F-Principle is that the regularity of the activation function translates into the decay rate of the loss function in the frequency domain.

The discovery of the frequency principle has inspired the design of DNNs that can quickly learn high-frequency functions. This has applications in scientific computing, image classification, and point cloud fitting problems. Furthermore, it provides a means to comprehend phenomena in practical applications and has inspired numerous studies on deep learning from the frequency perspective. [4]

Main results (informal)

Experimental results

Fig 1: This image illustrates the frequency principle in one-dimension. The abscissa represents the frequency and the ordinate represents the amplitude to the corresponding frequency. The red dash line is the DFT of the one-dimension target function. The blue solid line is the DFT of the DNNs output. F-Principle one dim.gif
Fig 1: This image illustrates the frequency principle in one-dimension. The abscissa represents the frequency and the ordinate represents the amplitude to the corresponding frequency. The red dash line is the DFT of the one-dimension target function. The blue solid line is the DFT of the DNNs output.

In one-dimensional problems, the Discrete Fourier Transform (DFT) of the target function and the output of DNNs can be obtained, and we can observe from Fig.1 that the blue line fits the low-frequency faster than the high-frequency.

Fig 2: The picture illustrates the frequency in the two-dimension. The first subfigure is the real data of the camera man. And the following three subfigures are the outputs of DNNs at step 80, 2000, 58000. Frequency experiment two dimension.png
Fig 2: The picture illustrates the frequency in the two-dimension. The first subfigure is the real data of the camera man. And the following three subfigures are the outputs of DNNs at step 80, 2000, 58000.

In two-dimensional problems, Fig.2 utilises DNN to fit an image of the camera man. The DNN starts learning from a coarse image and produces a more detailed image as training progresses. This demonstrates learning from low to high frequencies, which is analogous to how the biological brain remembers an image. This example shows the 2D frequency principle, which utilises DNNs for image restoration by leveraging preferences for low frequencies, such as in inpainting tasks. However, it is important to account for insufficient learning of high-frequency structures. To address this limitation, certain algorithms have been developed, which are introduced in the Applications section.

In high-dimensional problems, one can use projection method to visualize the frequency convergence in one particular direction or use Gaussian filter to roughly see the convergence of the low-frequency part and the high-frequency part. [4]

Theoretical results

Based on the following assumptions, i.e., i) certain regularity of target function, sample distribution function and activation function; ii) bounded training trajectory with loss convergence, Luo et al. [5] prove that the change of high-frequency loss over the total loss decays with the separated frequency with a certain power, which is determined by the regularity assumption. A key aspect of the proof is that composite functions maintain a certain regularity, causing decay in the frequency domain. Thus this result can be applied to general network structures with multiple layers. While this characterization of the F-Principle is very general, it is too coarse-grained to differentiate the effects of network structure or special properties of DNNs. It provides only a qualitative understanding rather than quantitatively characterizing differences.

There is a continuous framework [6] to study machine learning and suggest gradient flows of neural networks are nice flows and obey the F-Principle. This is because they are integral equations which have higher regularity. The increased regularity of integral equations leads to faster decay in the Fourier domain.

Applications

Algorithms designed to overcome the challenge of high-frequency

Phase shift DNN: PhaseDNN [7] converts high-frequency component of the data downward to a low-frequency spectrum for learning, and then converts the learned one back to the original high frequency.

Adaptive activation functions: Adaptive activation functions [8] replace the activation function by , where is a fixed scale factor with and is a trainable variable shared for all neurons.

Fig 3: Illustration of two MscaleDNN structures. MscaleDNN.png
Fig 3: Illustration of two MscaleDNN structures.

Multi-scale DNN: To alleviate the high-frequency difficulty for high-dimensional problems, a Multi-scale DNN (MscaleDNN) method [9] considers the frequency conversion only in the radial direction. The conversion in the frequency space can be done by scaling, which is equivalent to an inverse scaling in the spatial space.

For the first a MscaleDNN takes the following form where , , is the neuron number of -th hidden layer, , , is a scalar function and means entry-wise operation, is the Hadamard product and where , or . This structure is called Multi-scale DNN-1 (MscaleDNN-1).

The second kind of MscaleDNN which is denoted as MscaleDNN-2 in Fig.3 is a sum of subnetworks, in which each scale input goes through a subnetwork. In MscaleDNN-2, weight matrices from to are block diagonal. Again, the scale coefficient or .

Fourier feature network: Fourier feature network [10] map input to for imaging reconstruction tasks. is then used as the input to neural network. An extended Fourier feature network [11] for PDE problem, where the selection for is from different ranges. Ben Mildenhall et al. successfully apply this multiscale Fourier feature input in the neural radiance fields for view synthesis [12]

Multi-stage neural network: Multi-stage neural networks (MSNN) [13] use a superposition of DNNs, where sequential neural networks are optimized to fit the residuals from previous neural networks, boosting approximation accuracy. MSNNs have been applied to both regression problems and physics-informed neural networks, effectively addressing spectral bias and achieving accuracy close to the machine precision of double-floating point.

Frequency perspective for understanding experimental phenomena

Compression phase: The F-Principle explains the compression phase in information plane. [14] [1] The entropy or information quantifies the possibility of output values, i.e., more possible output values lead to a higher entropy. In learning a discretized function, the DNN first fits the continuous low-frequency components of the discretized function, i.e., large entropy state. Then, the DNN output tends to be discretized as the network gradually captures the high-frequency components, i.e., entropy decreasing. Thus, the compression phase appears in the information plane.

Increasing complexity: The F-Principle also explains the increasing complexity of DNN output during the training.

Strength and limitation: The F-Principle points out that deep neural networks are good at learning low-frequency functions but difficult to learn high-frequency functions.

Early-stopping trick: As noise is often dominated by high-frequency, with early-stopping, a neural network with spectral bias can avoid learn high-frequency noise.

Related Research Articles

<span class="mw-page-title-main">Discrete Fourier transform</span> Type of Fourier transform in discrete mathematics

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT (IDFT) is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

<span class="mw-page-title-main">Uncertainty principle</span> Foundational principle in quantum physics

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

<span class="mw-page-title-main">Fourier transform</span> Mathematical transform that expresses a function of time as a function of frequency

In mathematics, the Fourier transform (FT) is an integral transform that takes a function as input and outputs another function that describes the extent to which various frequencies are present in the original function. The output of the transform is a complex-valued function of frequency. The term Fourier transform refers to both this complex-valued function and the mathematical operation. When a distinction needs to be made, the output of the operation is sometimes called the frequency domain representation of the original function. The Fourier transform is analogous to decomposing the sound of a musical chord into the intensities of its constituent pitches.

<span class="mw-page-title-main">Fourier series</span> Decomposition of periodic functions into sums of simpler sinusoidal forms

A Fourier series is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation. This application is possible because the derivatives of trigonometric functions fall into simple patterns. Fourier series cannot be used to approximate arbitrary functions, because most functions have infinitely many terms in their Fourier series, and the series do not always converge. Well-behaved functions, for example smooth functions, have Fourier series that converge to the original function. The coefficients of the Fourier series are determined by integrals of the function multiplied by trigonometric functions, described in Fourier series§Definition.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Sinc function</span> Special mathematical function defined as sin(x)/x

In mathematics, physics and engineering, the sinc function, denoted by sinc(x), has two forms, normalized and unnormalized.

A light field, or lightfield, is a vector function that describes the amount of light flowing in every direction through every point in a space. The space of all possible light rays is given by the five-dimensional plenoptic function, and the magnitude of each ray is given by its radiance. Michael Faraday was the first to propose that light should be interpreted as a field, much like the magnetic fields on which he had been working. The term light field was coined by Andrey Gershun in a classic 1936 paper on the radiometric properties of light in three-dimensional space.

<span class="mw-page-title-main">Tomographic reconstruction</span> Estimate object properties from a finite number of projections

Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function from a certain function space, there exists a sequence of neural networks from the family, such that according to some criterion. That is, the family of neural networks is dense in the function space.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Rectifier (neural networks)</span> Type of activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function:

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight. The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further learning. As one example of this problem, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients using the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes need to be tuned. These hidden nodes can be randomly assigned and never updated, or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

Neural operators are a class of deep learning architectures designed to learn maps between infinite-dimensional function spaces. Neural operators represent an extension of traditional artificial neural networks, marking a departure from the typical focus on learning mappings between finite-dimensional Euclidean spaces or finite sets. Neural operators directly learn operators between function spaces; they can receive input functions, and the output function can be evaluated at any discretization.

References

  1. 1 2 Xu, Zhi-Qin John; Zhang, Yaoyu; Xiao, Yanyang (2019). "Training Behavior of Deep Neural Network in Frequency Domain". In Tom Gedeon; Kok Wai Wong; Minho Lee (eds.). Neural Information Processing. Vol. 11953. Cham: Springer International Publishing. pp. 264–274. arXiv: 1807.01251 . doi:10.1007/978-3-030-36708-4_22. ISBN   978-3-030-36707-7. S2CID   49562099.
  2. Xu, Zhi-Qin John; Zhang, Yaoyu; Luo, Tao; Xiao, Yanyang; Ma, Zheng (2020). "Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks". Communications in Computational Physics. 28 (5): 1746–1767. arXiv: 1901.06523 . Bibcode:2020CCoPh..28.1746X. doi:10.4208/cicp.OA-2020-0085. ISSN   1815-2406. S2CID   58981616.
  3. Rahaman, Nasim; Baratin, Aristide; Arpit, Devansh; Draxler, Felix; Lin, Min; Hamprecht, Fred; Bengio, Yoshua; Courville, Aaron (2019-05-24). "On the Spectral Bias of Neural Networks". Proceedings of the 36th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 5301–5310. Retrieved 2023-07-14.
  4. 1 2 Xu, Zhi-Qin John; Zhang, Yaoyu; Luo, Tao (2022-10-18). "Overview frequency principle/spectral bias in deep learning". arXiv: 2201.07395 [cs.LG].
  5. Luo, Tao; Ma, Zheng; Xu, Zhi-Qin John; Zhang, Yaoyu (2021). "Theory of the Frequency Principle for General Deep Neural Networks". CSIAM Transactions on Applied Mathematics. 2 (3): 484–507. arXiv: 1906.09235 . doi:10.4208/csiam-am.SO-2020-0005. ISSN   2708-0560. S2CID   195317121 . Retrieved 2023-07-14.
  6. E, Weinan; Ma, Chao; Wu, Lei (2020-11-01). "Machine learning from a continuous viewpoint, I". Science China Mathematics. 63 (11): 2233–2266. arXiv: 1912.12777 . doi:10.1007/s11425-020-1773-8. ISSN   1869-1862. S2CID   209515941 . Retrieved 2023-07-14.
  7. Cai, Wei; Li, Xiaoguang; Liu, Lizuo (2020). "A Phase Shift Deep Neural Network for High Frequency Approximation and Wave Problems". SIAM Journal on Scientific Computing. 42 (5): –3285–A3312. arXiv: 1909.11759 . Bibcode:2020SJSC...42A3285C. doi:10.1137/19M1310050. ISSN   1064-8275. S2CID   209376162 . Retrieved 2023-07-16.
  8. Jagtap, Ameya D.; Kawaguchi, Kenji; Karniadakis, George Em (2020-03-01). "Adaptive activation functions accelerate convergence in deep and physics-informed neural networks". Journal of Computational Physics. 404: 109136. arXiv: 1906.01170 . Bibcode:2020JCoPh.40409136J. doi:10.1016/j.jcp.2019.109136. ISSN   0021-9991. S2CID   174797885 . Retrieved 2023-07-16.
  9. Liu, Ziqi; Cai, Wei; Xu, Zhi-Qin John (2020). "Multi-Scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains". Communications in Computational Physics. 28 (5): 1970–2001. arXiv: 2007.11207 . Bibcode:2020CCoPh..28.1970L. doi:10.4208/cicp.OA-2020-0179. ISSN   1815-2406. S2CID   220686331 . Retrieved 2023-07-16.
  10. Tancik, Matthew; Srinivasan, Pratul P.; Mildenhall, Ben; Fridovich-Keil, Sara; Raghavan, Nithin; Singhal, Utkarsh; Ramamoorthi, Ravi; Barron, Jonathan T.; Ng, Ren (2020-06-18). "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains". arXiv: 2006.10739 [cs.CV].
  11. Wang, Sifan; Wang, Hanwen; Perdikaris, Paris (2021-10-01). "On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks". Computer Methods in Applied Mechanics and Engineering. 384: 113938. arXiv: 2012.10047 . Bibcode:2021CMAME.384k3938W. doi:10.1016/j.cma.2021.113938. ISSN   0045-7825. S2CID   229331851 . Retrieved 2023-07-17.
  12. Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020-08-03). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis". arXiv: 2003.08934 [cs.CV].
  13. Wang, Yongji; Lai, Ching-Yao (1 May 2024). "Multi-stage neural networks: Function approximator of machine precision". Journal of Computational Physics. 504: 112865. arXiv: 2307.08934 . doi:10.1016/j.jcp.2024.112865. ISSN   0021-9991.
  14. Shwartz-Ziv, Ravid; Tishby, Naftali (2017-04-29). "Opening the Black Box of Deep Neural Networks via Information". arXiv: 1703.00810 [cs.LG].