Stein's method

Last updated August 13, 2023

Stein's method is a general method in probability theory to obtain bounds on the distance between two probability distributions with respect to a probability metric. It was introduced by Charles Stein, who first published it in 1972,^[1] to obtain a bound between the distribution of a sum of $m$ -dependent sequence of random variables and a standard normal distribution in the Kolmogorov (uniform) metric and hence to prove not only a central limit theorem, but also bounds on the rates of convergence for the given metric.

History

At the end of the 1960s, unsatisfied with the by-then known proofs of a specific central limit theorem, Charles Stein developed a new way of proving the theorem for his statistics lecture.^[2] His seminal paper was presented in 1970 at the sixth Berkeley Symposium and published in the corresponding proceedings.^[1]

Later, his Ph.D. student Louis Chen Hsiao Yun modified the method so as to obtain approximation results for the Poisson distribution;^[3] therefore the Stein method applied to the problem of Poisson approximation is often referred to as the Stein–Chen method.

Probably the most important contributions are the monograph by Stein (1986), where he presents his view of the method and the concept of auxiliary randomisation, in particular using exchangeable pairs, and the articles by Barbour (1988) and Götze (1991), who introduced the so-called generator interpretation, which made it possible to easily adapt the method to many other probability distributions. An important contribution was also an article by Bolthausen (1984) on the so-called combinatorial central limit theorem.^{[ citation needed ]}

In the 1990s the method was adapted to a variety of distributions, such as Gaussian processes by Barbour (1990), the binomial distribution by Ehm (1991), Poisson processes by Barbour and Brown (1992), the Gamma distribution by Luk (1994), and many others.

The method gained further popularity in the machine learning community in the mid 2010s, following the development of computable Stein discrepancies and the diverse applications and algorithms based on them.

The basic approach

Probability metrics

Stein's method is a way to bound the distance between two probability distributions using a specific probability metric.

Let the metric be given in the form

(1.1)\quad d(P,Q)=\sup _{h\in {\mathcal {H}}}\left|\int h\,dP-\int h\,dQ\right|=\sup _{h\in {\mathcal {H}}}\left|Eh(W)-Eh(Y)\right|

Here, $P$ and $Q$ are probability measures on a measurable space ${\mathcal {X}}$ , $W$ and $Y$ are random variables with distribution $P$ and $Q$ respectively, $E$ is the usual expectation operator and ${\mathcal {H}}$ is a set of functions from ${\mathcal {X}}$ to the set of real numbers. Set ${\mathcal {H}}$ has to be large enough, so that the above definition indeed yields a metric.

Important examples are the total variation metric, where we let ${\mathcal {H}}$ consist of all the indicator functions of measurable sets, the Kolmogorov (uniform) metric for probability measures on the real numbers, where we consider all the half-line indicator functions, and the Lipschitz (first order Wasserstein; Kantorovich) metric, where the underlying space is itself a metric space and we take the set ${\mathcal {H}}$ to be all Lipschitz-continuous functions with Lipschitz-constant 1. However, note that not every metric can be represented in the form (1.1).

In what follows $P$ is a complicated distribution (e.g., the distribution of a sum of dependent random variables), which we want to approximate by a much simpler and tractable distribution $Q$ (e.g., the standard normal distribution).

The Stein operator

We assume now that the distribution $Q$ is a fixed distribution; in what follows we shall in particular consider the case where $Q$ is the standard normal distribution, which serves as a classical example.

First of all, we need an operator ${\mathcal {A}}$ , which acts on functions $f$ from ${\mathcal {X}}$ to the set of real numbers and 'characterizes' distribution $Q$ in the sense that the following equivalence holds:

(2.1)\quad E(({\mathcal {A}}f)(Y))=0{\text{ for all }}f\quad \iff \quad Y{\text{ has distribution }}Q.

We call such an operator the Stein operator.

For the standard normal distribution, Stein's lemma yields such an operator:

(2.2)\quad E\left(f'(Y)-Yf(Y)\right)=0{\text{ for all }}f\in C_{b}^{1}\quad \iff \quad Y{\text{ has standard normal distribution.}}

Thus, we can take

(2.3)\quad ({\mathcal {A}}f)(x)=f'(x)-xf(x).

There are in general infinitely many such operators and it still remains an open question, which one to choose. However, it seems that for many distributions there is a particular good one, like (2.3) for the normal distribution.

There are different ways to find Stein operators.^[4]

The Stein equation

$P$ is close to $Q$ with respect to $d$ if the difference of expectations in (1.1) is close to 0. We hope now that the operator ${\mathcal {A}}$ exhibits the same behavior: if $P=Q$ then $E({\mathcal {A}}f)(W)=0$ , and hopefully if $P\approx Q$ we have $E({\mathcal {A}}f)(W)\approx 0$ .

It is usually possible to define a function $f=f_{h}$ such that

(3.1)\quad ({\mathcal {A}}f)(x)=h(x)-E[h(Y)]\qquad {\text{ for all }}x.

We call (3.1) the Stein equation. Replacing $x$ by $W$ and taking expectation with respect to $W$ , we get

(3.2)\quad E({\mathcal {A}}f)(W)=E[h(W)]-E[h(Y)].

Now all the effort is worthwhile only if the left-hand side of (3.2) is easier to bound than the right hand side. This is, surprisingly, often the case.

If $Q$ is the standard normal distribution and we use (2.3), then the corresponding Stein equation is

(3.3)\quad f'(x)-xf(x)=h(x)-E[h(Y)]\qquad {\text{for all }}x.

If probability distribution Q has an absolutely continuous (with respect to the Lebesgue measure) density q, then^[4]

(3.4)\quad ({\mathcal {A}}f)(x)=f'(x)+f(x)q'(x)/q(x).

Solving the Stein equation

Analytic methods. Equation (3.3) can be easily solved explicitly:

(4.1)\quad f(x)=e^{x^{2}/2}\int _{-\infty }^{x}[h(s)-Eh(Y)]e^{-s^{2}/2}\,ds.

Generator method. If ${\mathcal {A}}$ is the generator of a Markov process $(Z_{t})_{t\geq 0}$ (see Barbour (1988), Götze (1991)), then the solution to (3.2) is

(4.2)\quad f(x)=-\int _{0}^{\infty }[E^{x}h(Z_{t})-Eh(Y)]\,dt,

where $E^{x}$ denotes expectation with respect to the process $Z$ being started in $x$ . However, one still has to prove that the solution (4.2) exists for all desired functions $h\in {\mathcal {H}}$ .

Properties of the solution to the Stein equation

Usually, one tries to give bounds on $f$ and its derivatives (or differences) in terms of $h$ and its derivatives (or differences), that is, inequalities of the form

(5.1)\quad \|D^{k}f\|\leq C_{k,l}\|D^{l}h\|,

for some specific $k,l=0,1,2,\dots$ (typically $k\geq l$ or $k\geq l-1$ , respectively, depending on the form of the Stein operator), where often $\|\cdot \|$ is the supremum norm. Here, $D^{k}$ denotes the differential operator, but in discrete settings it usually refers to a difference operator. The constants $C_{k,l}$ may contain the parameters of the distribution $Q$ . If there are any, they are often referred to as Stein factors.

In the case of (4.1) one can prove for the supremum norm that

(5.2)\quad \|f\|_{\infty }\leq \min \left\{{\sqrt {\pi /2}}\|h\|_{\infty },2\|h'\|_{\infty }\right\},\quad \|f'\|_{\infty }\leq \min\{2\|h\|_{\infty },4\|h'\|_{\infty }\},\quad \|f''\|_{\infty }\leq 2\|h'\|_{\infty },

where the last bound is of course only applicable if $h$ is differentiable (or at least Lipschitz-continuous, which, for example, is not the case if we regard the total variation metric or the Kolmogorov metric!). As the standard normal distribution has no extra parameters, in this specific case the constants are free of additional parameters.

If we have bounds in the general form (5.1), we usually are able to treat many probability metrics together. One can often start with the next step below, if bounds of the form (5.1) are already available (which is the case for many distributions).

An abstract approximation theorem

We are now in a position to bound the left hand side of (3.1). As this step heavily depends on the form of the Stein operator, we directly regard the case of the standard normal distribution.

At this point we could directly plug in random variable $W$ , which we want to approximate, and try to find upper bounds. However, it is often fruitful to formulate a more general theorem. Consider here the case of local dependence.

Assume that $W=\sum _{i=1}^{n}X_{i}$ is a sum of random variables such that the $E[W]=0$ and variance $\operatorname {var} [W]=1$ . Assume that, for every $i=1,\dots ,n$ , there is a set $A_{i}\subset \{1,2,\dots ,n\}$ , such that $X_{i}$ is independent of all the random variables $X_{j}$ with $j\not \in A_{i}$ . We call this set the 'neighborhood' of $X_{i}$ . Likewise let $B_{i}\subset \{1,2,\dots ,n\}$ be a set such that all $X_{j}$ with $j\in A_{i}$ are independent of all $X_{k}$ , $k\not \in B_{i}$ . We can think of $B_{i}$ as the neighbors in the neighborhood of $X_{i}$ , a second-order neighborhood, so to speak. For a set $A\subset \{1,2,\dots ,n\}$ define now the sum $X_{A}:=\sum _{j\in A}X_{j}$ .

Using Taylor expansion, it is possible to prove that

(6.1)\quad \left|E(f'(W)-Wf(W))\right|\leq \|f''\|_{\infty }\sum _{i=1}^{n}\left({\frac {1}{2}}E|X_{i}X_{A_{i}}^{2}|+E|X_{i}X_{A_{i}}X_{B_{i}\setminus A_{i}}|+E|X_{i}X_{A_{i}}|E|X_{B_{i}}|\right)

Note that, if we follow this line of argument, we can bound (1.1) only for functions where $\|h'\|_{\infty }$ is bounded because of the third inequality of (5.2) (and in fact, if $h$ has discontinuities, so will $f''$ ). To obtain a bound similar to (6.1) which contains only the expressions $\|f\|_{\infty }$ and $\|f'\|_{\infty }$ , the argument is much more involved and the result is not as simple as (6.1); however, it can be done.

Theorem A. If $W$ is as described above, we have for the Lipschitz metric $d_{W}$ that

(6.2)\quad d_{W}({\mathcal {L}}(W),N(0,1))\leq 2\sum _{i=1}^{n}\left({\frac {1}{2}}E|X_{i}X_{A_{i}}^{2}|+E|X_{i}X_{A_{i}}X_{B_{i}\setminus A_{i}}|+E|X_{i}X_{A_{i}}|E|X_{B_{i}}|\right).

Proof. Recall that the Lipschitz metric is of the form (1.1) where the functions $h$ are Lipschitz-continuous with Lipschitz-constant 1, thus $\|h'\|\leq 1$ . Combining this with (6.1) and the last bound in (5.2) proves the theorem.

Thus, roughly speaking, we have proved that, to calculate the Lipschitz-distance between a $W$ with local dependence structure and a standard normal distribution, we only need to know the third moments of $X_{i}$ and the size of the neighborhoods $A_{i}$ and $B_{i}$ .

Application of the theorem

We can treat the case of sums of independent and identically distributed random variables with Theorem A.

Assume that $EX_{i}=0$ , $\operatorname {var} X_{i}=1$ and $W=n^{-1/2}\sum X_{i}$ . We can take $A_{i}=B_{i}=\{i\}$ . From Theorem A we obtain that

(7.1)\quad d_{W}({\mathcal {L}}(W),N(0,1))\leq {\frac {5E|X_{1}|^{3}}{n^{1/2}}}.

For sums of random variables another approach related to Steins Method is known as the zero bias transform.

Connections to other methods

Lindeberg's device. Lindeberg (1922) introduced a device, where the difference $Eh(X_{1}+\cdots +X_{n})-Eh(Y_{1}+\cdots +Y_{n})$ is represented as a sum of step-by-step differences.

Tikhomirov's method. Clearly the approach via (1.1) and (3.1) does not involve characteristic functions. However, Tikhomirov (1980) presented a proof of a central limit theorem based on characteristic functions and a differential operator similar to (2.3). The basic observation is that the characteristic function $\psi (t)$ of the standard normal distribution satisfies the differential equation $\psi '(t)+t\psi (t)=0$ for all $t$ . Thus, if the characteristic function $\psi _{W}(t)$ of $W$ is such that $\psi '_{W}(t)+t\psi _{W}(t)\approx 0$ we expect that $\psi _{W}(t)\approx \psi (t)$ and hence that $W$ is close to the normal distribution. Tikhomirov states in his paper that he was inspired by Stein's seminal paper.

Notes

1 2 Stein, C. (1972). "A bound for the error in the normal approximation to the distribution of a sum of dependent random variables". Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2. Vol. 6. University of California Press. pp. 583–602. MR 0402873. Zbl 0278.60026.
↑ Charles Stein: The Invariant, the Direct and the "Pretentious" Archived 2007-07-05 at the Wayback Machine . Interview given in 2003 in Singapore
↑ Chen, L.H.Y. (1975). "Poisson approximation for dependent trials". Annals of Probability. 3 (3): 534–545. doi: 10.1214/aop/1176996359 . JSTOR 2959474. MR 0428387. Zbl 0335.60016.
1 2 Novak, S.Y. (2011). Extreme Value Methods with Applications to Finance. Monographs on Statistics and Applied Probability. Vol. 122. CRC Press. Ch. 12. ISBN 978-1-43983-574-6.

Related Research Articles

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

In mathematics, convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. The integral is evaluated for all values of shift, producing the convolution function.

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

<span class="mw-page-title-main">Negative binomial distribution</span> Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on a dice as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.

In mathematics, the Banach fixed-point theorem is an important tool in the theory of metric spaces; it guarantees the existence and uniqueness of fixed points of certain self-maps of metric spaces, and provides a constructive method to find those fixed points. It can be understood as an abstract formulation of Picard's method of successive approximations. The theorem is named after Stefan Banach (1892–1945) who first stated it in 1922.

In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to statistics and stochastic processes. The same concepts are known in more general mathematics as stochastic convergence and they formalize the idea that a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behavior that is essentially unchanging when items far enough into the sequence are studied. The different possible notions of convergence relate to how such a behavior can be characterized: two readily understood behaviors are that the sequence eventually takes a constant value, and that values in the sequence continue to change but can be described by an unchanging probability distribution.

<span class="mw-page-title-main">Fourier transform</span> Mathematical transform that expresses a function of time as a function of frequency

In physics and mathematics, the Fourier transform (FT) is a transform that converts a function into a form that describes the frequencies present in the original function. The output of the transform is a complex-valued function of frequency. The term Fourier transform refers to both this complex-valued function and the mathematical operation. When a distinction needs to be made the Fourier transform is sometimes called the frequency domain representation of the original function. The Fourier transform is analogous to decomposing the sound of a musical chord into terms of the intensity of its constituent pitches.

<span class="mw-page-title-main">Hamiltonian mechanics</span> Formulation of classical mechanics using momenta

Hamiltonian mechanics emerged in 1833 as a reformulation of Lagrangian mechanics. Introduced by Sir William Rowan Hamilton, Hamiltonian mechanics replaces (generalized) velocities $used in Lagrangian mechanics with (generalized) momenta . Both theories provide interpretations of classical mechanics and describe the same physical phenomena.$

In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression.

In mathematics, the Poisson summation formula is an equation that relates the Fourier series coefficients of the periodic summation of a function to values of the function's continuous Fourier transform. Consequently, the periodic summation of a function is completely defined by discrete samples of the original function's Fourier transform. And conversely, the periodic summation of a function's Fourier transform is completely defined by discrete samples of the original function. The Poisson summation formula was discovered by Siméon Denis Poisson and is sometimes called Poisson resummation.

In mathematics, specifically the study of differential equations, the Picard–Lindelöf theorem gives a set of conditions under which an initial value problem has a unique solution. It is also known as Picard's existence theorem, the Cauchy–Lipschitz theorem, or the existence and uniqueness theorem.

In mathematical statistics, the Kullback–Leibler divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P . While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

<span class="mw-page-title-main">Empirical distribution function</span> Distribution function associated with the empirical measure of a sample

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

In mathematics, more specifically measure theory, there are various notions of the convergence of measures. For an intuitive general sense of what is meant by convergence of measures, consider a sequence of measures μ_n on a space, sharing a common collection of measurable sets. Such a sequence might represent an attempt to construct 'better and better' approximations to a desired measure μ that is difficult to obtain directly. The meaning of 'better and better' is subject to all the usual caveats for taking limits; for any error tolerance ε > 0 we require there be N sufficiently large for n ≥ N to ensure the 'difference' between μ_n and μ is smaller than ε. Various notions of convergence specify precisely what the word 'difference' should mean in that description; these notions are not equivalent to one another, and vary in strength.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume. It plays an important role for discrete-stable distributions.

Poisson-type random measures are a family of three random counting measures which are closed under restriction to a subspace, i.e. closed under thinning. They are the only distributions in the canonical non-negative power series family of distributions to possess this property and include the Poisson distribution, negative binomial distribution, and binomial distribution. The PT family of distributions is also known as the Katz family of distributions, the Panjer or (a,b,0) class of distributions and may be retrieved through the Conway–Maxwell–Poisson distribution.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

References

Barbour, A. D. (1988). "Stein's method and Poisson process convergence". Journal of Applied Probability . 25: 175–184. doi:10.2307/3214155. JSTOR 3214155. S2CID 121759039.
Barbour, A. D. (1990). "Stein's method for diffusion approximations". Probability Theory and Related Fields . 84 (3): 297–322. doi: 10.1007/BF01197887 . S2CID 123057547.
Barbour, A. D. & Brown, T. C. (1992). "Stein's method and point process approximation". Stochastic Processes and Their Applications . 43 (1): 9–31. doi: 10.1016/0304-4149(92)90073-Y .
Bolthausen, E. (1984). "An estimate of the remainder in a combinatorial central limit theorem". Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete . 66 (3): 379–386. doi:10.1007/BF00533704. S2CID 121725342.
Ehm, W. (1991). "Binomial approximation to the Poisson binomial distribution". Statistics & Probability Letters . 11 (1): 7–16. doi:10.1016/0167-7152(91)90170-V.
Götze, F. (1991). "On the rate of convergence in the multivariate CLT". The Annals of Probability . 19 (2): 724–739. doi: 10.1214/aop/1176990448 .
Lindeberg, J. W. (1922). "Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechung". Mathematische Zeitschrift . 15 (1): 211–225. doi:10.1007/BF01494395. S2CID 119730242.
Luk, H. M. (1994). Stein's method for the gamma distribution and related statistical applications. Dissertation.
Novak, S. Y. (2011). Extreme value methods with applications to finance. Monographs on Statistics and Applied Probability. Vol. 122. CRC Press. ISBN 978-1-43983-574-6.
Stein, C. (1986). Approximate computation of expectations. Lecture Notes-Monograph Series. Vol. 7. Institute of Mathematical Statistics. ISBN 0-940600-08-0.
Tikhomirov, A. N. (1980). "Convergence rate in the central limit theorem for weakly dependent random variables". Teoriya Veroyatnostei i ee Primeneniya . 25: 800–818. English translation in Tikhomirov, A. N. (1981). "On the Convergence Rate in the Central Limit Theorem for Weakly Dependent Random Variables". Theory of Probability & Its Applications . 25 (4): 790–809. doi:10.1137/1125092.

Literature

The following text is advanced, and gives a comprehensive overview of the normal case

Chen, L.H.Y., Goldstein, L., and Shao, Q.M (2011). Normal approximation by Stein's method. www.springer.com. ISBN 978-3-642-15006-7.{{cite book}}: CS1 maint: multiple names: authors list (link)

Another advanced book, but having some introductory character, is

ed. Barbour, A.D. and Chen, L.H.Y. (2005). An introduction to Stein's method. Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore. Vol. 4. Singapore University Press. ISBN 981-256-280-X.{{cite book}}: |author= has generic name (help)CS1 maint: multiple names: authors list (link)

A standard reference is the book by Stein,

Stein, C. (1986). Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes, Monograph Series, 7. Hayward, Calif.: Institute of Mathematical Statistics. ISBN 0-940600-08-0.

which contains a lot of interesting material, but may be a little hard to understand at first reading.

Despite its age, there are few standard introductory books about Stein's method available. The following recent textbook has a chapter (Chapter 2) devoted to introducing Stein's method:

Ross, Sheldon & Peköz, Erol (2007). A second course in probability. ISBN 978-0-9795704-0-7.

Although the book

Barbour, A. D. and Holst, L. and Janson, S. (1992). Poisson approximation. Oxford Studies in Probability. Vol. 2. The Clarendon Press Oxford University Press. ISBN 0-19-852235-5.{{cite book}}: CS1 maint: multiple names: authors list (link)

is by large parts about Poisson approximation, it contains nevertheless a lot of information about the generator approach, in particular in the context of Poisson process approximation.

The following textbook has a chapter (Chapter 10) devoted to introducing Stein's method of Poisson approximation:

Sheldon M. Ross (1995). Stochastic Processes. Wiley. ISBN 978-0471120629.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[stein1972-1] 1 2 Stein, C. (1972). "A bound for the error in the normal approximation to the distribution of a sum of dependent random variables". Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2. Vol. 6. University of California Press. pp. 583–602. MR 0402873. Zbl 0278.60026.

[2] Charles Stein: The Invariant, the Direct and the "Pretentious" Archived 2007-07-05 at the Wayback Machine . Interview given in 2003 in Singapore

[chen1975-3] Chen, L.H.Y. (1975). "Poisson approximation for dependent trials". Annals of Probability. 3 (3): 534–545. doi: 10.1214/aop/1176996359 . JSTOR 2959474. MR 0428387. Zbl 0335.60016.

[Novak-4] 1 2 Novak, S.Y. (2011). Extreme Value Methods with Applications to Finance. Monographs on Statistics and Applied Probability. Vol. 122. CRC Press. Ch. 12. ISBN 978-1-43983-574-6.

[1]

[2]

[3]

[4]