Deep backward stochastic differential equation method

Last updated
The neural network architecture of the Deep Backward Differential Equation method Deep BSDE Method.png
The neural network architecture of the Deep Backward Differential Equation method

Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial derivatives pricing and risk management. By leveraging the powerful function approximation capabilities of deep neural networks, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings. [1]

Contents

History

Backwards stochastic differential equations

BSDEs were first introduced by Pardoux and Peng in 1990 and have since become essential tools in stochastic control and financial mathematics. In the 1990s, Étienne Pardoux and Shige Peng established the existence and uniqueness theory for BSDE solutions, applying BSDEs to financial mathematics and control theory. For instance, BSDEs have been widely used in option pricing, risk measurement, and dynamic hedging. [2]

Deep learning

Introduction to Deep Learning

Deep Learning is a machine learning method based on multilayer neural networks. Its core concept can be traced back to the neural computing models of the 1940s. In the 1980s, the proposal of the backpropagation algorithm made the training of multilayer neural networks possible. In 2006, the Deep Belief Networks proposed by Geoffrey Hinton and others rekindled interest in deep learning. Since then, deep learning has made groundbreaking advancements in image processing, speech recognition, natural language processing, and other fields. [3]

Limitations of traditional numerical methods

Traditional numerical methods for solving stochastic differential equations [4] include the Euler–Maruyama method, Milstein method, Runge–Kutta method (SDE) and methods based on different representations of iterated stochastic integrals. [5] [6]

But as financial problems become more complex, traditional numerical methods for BSDEs (such as the Monte Carlo method, finite difference method, etc.) have shown limitations such as high computational complexity and the curse of dimensionality. [1]

  1. In high-dimensional scenarios, the Monte Carlo method requires numerous simulation paths to ensure accuracy, resulting in lengthy computation times. In particular, for nonlinear BSDEs, the convergence rate is slow, making it challenging to handle complex financial derivative pricing problems. [7] [8]
    Monte Carlo method applied to approximating the value of p Pi monte carlo all.gif
    Monte Carlo method applied to approximating the value of π
  2. The finite difference method, on the other hand, experiences exponential growth in the number of computation grids with increasing dimensions, leading to significant computational and storage demands. This method is generally suitable for simple boundary conditions and low-dimensional BSDEs, but it is less effective in complex situations. [9]

Deep BSDE method

The combination of deep learning with BSDEs, known as deep BSDE, was proposed by Han, Jentzen, and E in 2018 as a solution to the high-dimensional challenges faced by traditional numerical methods. The Deep BSDE approach leverages the powerful nonlinear fitting capabilities of deep learning, approximating the solution of BSDEs by constructing neural networks. The specific idea is to represent the solution of a BSDE as the output of a neural network and train the network to approximate the solution. [1]

Model

Mathematical method

Backward Stochastic Differential Equations (BSDEs) represent a powerful mathematical tool extensively applied in fields such as stochastic control, financial mathematics, and beyond. Unlike traditional Stochastic differential equations (SDEs), which are solved forward in time, BSDEs are solved backward, starting from a future time and moving backwards to the present. This unique characteristic makes BSDEs particularly suitable for problems involving terminal conditions and uncertainties. [2]

A backward stochastic differential equation (BSDE) can be formulated as: [10]

In this equation:

The goal is to find adapted processes and that satisfy this equation. Traditional numerical methods struggle with BSDEs due to the curse of dimensionality, which makes computations in high-dimensional spaces extremely challenging. [1]

Methodology overview

Source: [1]

1. Semilinear parabolic PDEs

We consider a general class of PDEs represented by

In this equation:

  • is the terminal condition specified at time .
  • and represent the time and -dimensional space variable, respectively.
  • is a known vector-valued function, denotes the transpose associated to , and denotes the Hessian of function with respect to .
  • is a known vector-valued function, and is a known nonlinear function.

2. Stochastic process representation

Let be a -dimensional Brownian motion and be a -dimensional stochastic process which satisfies

3. Backward stochastic differential equation (BSDE)

Then the solution of the PDE satisfies the following BSDE:

4. Temporal discretization

Discretize the time interval into steps :

where and .

5. Neural network approximation

Use a multilayer feedforward neural network to approximate:

for , where are parameters of the neural network approximating at .

6. Training the neural network

Stack all sub-networks in the approximation step to form a deep neural network. Train the network using paths and as input data, minimizing the loss function:

where is the approximation of .

Neural network architecture

Source: [1]

Deep learning encompass a class of machine learning techniques that have transformed numerous fields by enabling the modeling and interpretation of intricate data structures. These methods, often referred to as deep learning, are distinguished by their hierarchical architecture comprising multiple layers of interconnected nodes, or neurons. This architecture allows deep neural networks to autonomously learn abstract representations of data, making them particularly effective in tasks such as image recognition, natural language processing, and financial modeling. The core of this method lies in designing an appropriate neural network structure (such as fully connected networks or recurrent neural networks) and selecting effective optimization algorithms. [3]

The choice of deep BSDE network architecture, the number of layers, and the number of neurons per layer are crucial hyperparameters that significantly impact the performance of the deep BSDE method. The deep BSDE method constructs neural networks to approximate the solutions for and , and utilizes stochastic gradient descent and other optimization algorithms for training. [1]

The fig illustrates the network architecture for the deep BSDE method. Note that denotes the variable approximated directly by subnetworks, and denotes the variable computed iteratively in the network. There are three types of connections in this network: [1]

i) is the multilayer feedforward neural network approximating the spatial gradients at time . The weights of this subnetwork are the parameters optimized.

ii) is the forward iteration providing the final output of the network as an approximation of , characterized by Eqs. 5 and 6. There are no parameters optimized in this type of connection.

iii) is the shortcut connecting blocks at different times, characterized by Eqs. 4 and 6. There are also no parameters optimized in this type of connection.

Algorithms

Gradient descent vs Monte Carlo Gradient descent Hamiltonian Monte Carlo comparison.gif
Gradient descent vs Monte Carlo

Adam optimizer

This function implements the Adam [11] algorithm for minimizing the target function .

Function: ADAM(, , , , , ) is// Initialize the first moment vector// Initialize the second moment vector// Initialize timestep// Step 1: Initialize parameters// Step 2: Optimization loopwhile has not converged do// Compute gradient of  at timestep // Update biased first moment estimate// Update biased second raw moment estimate// Compute bias-corrected first moment estimate// Compute bias-corrected second moment estimate// Update parametersreturn

Backpropagation algorithm

This function implements the backpropagation algorithm for training a multi-layer feedforward neural network.

Function: BackPropagation(set) is// Step 1: Random initialization// Step 2: Optimization looprepeat until termination condition is met:         for each:             // Compute output// Compute gradientsfor each output neuron :                 // Gradient of output neuronfor each hidden neuron :                 // Gradient of hidden neuron// Update weightsfor each weight :                 // Update rule for weightfor each weight :                 // Update rule for weight// Update parametersfor each parameter :                 // Update rule for parameterfor each parameter :                 // Update rule for parameter// Step 3: Construct the trained multi-layer feedforward neural networkreturn trained neural network

Numerical solution for optimal investment portfolio

Source: [1]

This function calculates the optimal investment portfolio using the specified parameters and stochastic processes.

function OptimalInvestment(, , ) is// Step 1: Initializationforto maxstep do, // Parameter initializationfortodo// Update feedforward neural network unit// Step 2: Compute loss function// Step 3: Update parameters using ADAM optimization// Step 4: Return terminal statereturn

Application

The dynamically changing loss function Loss function.png
The dynamically changing loss function

Deep BSDE is widely used in the fields of financial derivatives pricing, risk management, and asset allocation. It is particularly suitable for:

Advantages and disadvantages

Advantages

Sources: [1] [12]

  1. High-dimensional capability: Compared to traditional numerical methods, deep BSDE performs exceptionally well in high-dimensional problems.
  2. Flexibility: The incorporation of deep neural networks allows this method to adapt to various types of BSDEs and financial models.
  3. Parallel computing: Deep learning frameworks support GPU acceleration, significantly improving computational efficiency.

Disadvantages

Sources: [1] [12]

  1. Training time: Training deep neural networks typically requires substantial data and computational resources.
  2. Parameter sensitivity: The choice of neural network architecture and hyperparameters greatly impacts the results, often requiring experience and trial-and-error.

See also

Related Research Articles

<span class="mw-page-title-main">Laplace's equation</span> Second-order partial differential equation

In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as or where is the Laplace operator, is the divergence operator, is the gradient operator, and is a twice-differentiable real-valued function. The Laplace operator therefore maps a scalar function to another scalar function.

<span class="mw-page-title-main">Navier–Stokes equations</span> Equations describing the motion of viscous fluid substances

The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

<span class="mw-page-title-main">Green's function</span> Impulse response of an inhomogeneous linear differential operator

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics.

In mathematics, a Killing vector field, named after Wilhelm Killing, is a vector field on a Riemannian manifold that preserves the metric. Killing fields are the infinitesimal generators of isometries; that is, flows generated by Killing fields are continuous isometries of the manifold. More simply, the flow generates a symmetry, in the sense that moving each point of an object the same distance in the direction of the Killing vector will not distort distances on the object.

In mathematics, the Helmholtz equation is the eigenvalue problem for the Laplace operator. It corresponds to the elliptic partial differential equation: where 2 is the Laplace operator, k2 is the eigenvalue, and f is the (eigen)function. When the equation is applied to waves, k is known as the wave number. The Helmholtz equation has a variety of applications in physics and other sciences, including the wave equation, the diffusion equation, and the Schrödinger equation for a free particle.

<span class="mw-page-title-main">Ornstein–Uhlenbeck process</span> Stochastic process modeling random walk with friction

In mathematics, the Ornstein–Uhlenbeck process is a stochastic process with applications in financial mathematics and the physical sciences. Its original application in physics was as a model for the velocity of a massive Brownian particle under the influence of friction. It is named after Leonard Ornstein and George Eugene Uhlenbeck.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation of the current parental individuals, usually in a stochastic way. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, individuals with better and better -values are generated over the generation sequence.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

<span class="mw-page-title-main">Stochastic gradient Langevin dynamics</span> Optimization and sampling technique

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

<span class="mw-page-title-main">Forward problem of electrocardiology</span>

The forward problem of electrocardiology is a computational and mathematical approach to study the electrical activity of the heart through the body surface. The principal aim of this study is to computationally reproduce an electrocardiogram (ECG), which has important clinical relevance to define cardiac pathologies such as ischemia and infarction, or to test pharmaceutical intervention. Given their important functionalities and the relative small invasiveness, the electrocardiography techniques are used quite often as clinical diagnostic tests. Thus, it is natural to proceed to computationally reproduce an ECG, which means to mathematically model the cardiac behaviour inside the body.

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

In statistics, the Innovation Method provides an estimator for the parameters of stochastic differential equations given a time series of observations of the state variables. In the framework of continuous-discrete state space models, the innovation estimator is obtained by maximizing the log-likelihood of the corresponding discrete-time innovation process with respect to the parameters. The innovation estimator can be classified as a M-estimator, a quasi-maximum likelihood estimator or a prediction error estimator depending on the inferential considerations that want to be emphasized. The innovation method is a system identification technique for developing mathematical models of dynamical systems from measured data and for the optimal design of experiments.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.

Neural operators are a class of deep learning architectures designed to learn maps between infinite-dimensional function spaces. Neural operators represent an extension of traditional artificial neural networks, marking a departure from the typical focus on learning mappings between finite-dimensional Euclidean spaces or finite sets. Neural operators directly learn operators between function spaces; they can receive input functions, and the output function can be evaluated at any discretization.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 Han, J.; Jentzen, A.; E, W. (2018). "Solving high-dimensional partial differential equations using deep learning". Proceedings of the National Academy of Sciences. 115 (34): 8505–8510. doi: 10.1073/pnas.1718942115 . PMC   6112690 . PMID   30082389.
  2. 1 2 Pardoux, E.; Peng, S. (1990). "Adapted solution of a backward stochastic differential equation". Systems & Control Letters. 14 (1): 55–61. doi:10.1016/0167-6911(90)90082-6.
  3. 1 2 LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep Learning" (PDF). Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID   26017442. S2CID   3074096.
  4. Kloeden, P.E., Platen E. (1992). Numerical Solution of Stochastic Differential Equations. Springer, Berlin, Heidelberg. DOI: https://doi.org/10.1007/978-3-662-12616-5
  5. Kuznetsov, D.F. (2023). Strong approximation of iterated Itô and Stratonovich stochastic integrals: Method of generalized multiple Fourier series. Application to numerical integration of Itô SDEs and semilinear SPDEs. Differ. Uravn. Protsesy Upr., no. 1. DOI: https://doi.org/10.21638/11701/spbu35.2023.110
  6. Rybakov, K.A. (2023). Spectral representations of iterated stochastic integrals and their application for modeling nonlinear stochastic dynamics. Mathematics, vol. 11, 4047. DOI: https://doi.org/10.3390/math11194047
  7. "Real Options with Monte Carlo Simulation". Archived from the original on 2010-03-18. Retrieved 2010-09-24.
  8. "Monte Carlo Simulation". Palisade Corporation. 2010. Retrieved 2010-09-24.
  9. Christian Grossmann; Hans-G. Roos; Martin Stynes (2007). Numerical Treatment of Partial Differential Equations . Springer Science & Business Media. p.  23. ISBN   978-3-540-71584-9.
  10. Ma, Jin; Yong, Jiongmin (2007). Forward-Backward Stochastic Differential Equations and their Applications. Lecture Notes in Mathematics. Vol. 1702. Springer Berlin, Heidelberg. doi:10.1007/978-3-540-48831-6. ISBN   978-3-540-65960-0.
  11. Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization". arXiv: 1412.6980 [cs.LG].
  12. 1 2 3 4 5 6 Beck, C.; E, W.; Jentzen, A. (2019). "Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations". Journal of Nonlinear Science. 29 (4): 1563–1619. doi:10.1007/s00332-018-9525-3.

Further reading