Computational statistics

Last updated
Students working in the Statistics Machine Room of the London School of Economics in 1964 London School of Economics Statistics Machine Room 1964.jpg
Students working in the Statistics Machine Room of the London School of Economics in 1964

Computational statistics, or statistical computing, is the bond between statistics and computer science, and refers to the statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computing) specific to the mathematical science of statistics. This area is also developing rapidly, leading to calls that a broader concept of computing should be taught as part of general statistical education. [1]

Contents

As in traditional statistics the goal is to transform raw data into knowledge, [2] but the focus lies on computer intensive statistical methods, such as cases with very large sample size and non-homogeneous data sets. [2]

The terms 'computational statistics' and 'statistical computing' are often used interchangeably, although Carlo Lauro (a former president of the International Association for Statistical Computing) proposed making a distinction, defining 'statistical computing' as "the application of computer science to statistics", and 'computational statistics' as "aiming at the design of algorithm for implementing statistical methods on computers, including the ones unthinkable before the computer age (e.g. bootstrap, simulation), as well as to cope with analytically intractable problems" [ sic ]. [3]

The term 'Computational statistics' may also be used to refer to computationally intensive statistical methods including resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation, artificial neural networks and generalized additive models.

History

Though computational statistics is widely used today, it actually has a relatively short history of acceptance in the statistics community. For the most part, the founders of the field of statistics relied on mathematics and asymptotic approximations in the development of computational statistical methodology. [4]

In statistical field, the first use of the term “computer” comes in an article in the Journal of the American Statistical Association archives by Robert P. Porter in 1891. The article discusses about the use of Hermann Hollerith’s machine in the 11th Census of the United States.[ citation needed ] Hermann Hollerith’s machine, also called tabulating machine, was an electromechanical machine designed to assist in summarizing information stored on punched cards. It was invented by Herman Hollerith (February 29, 1860 – November 17, 1929), an American businessman, inventor, and statistician. His invention of the punched card tabulating machine was patented in 1884, and later was used in the 1890 Census of the United States. The advantages of the technology were immediately apparent. the 1880 Census, with about 50 million people, and it took over 7 years to tabulate. While in the 1890 Census, with over 62 million people, it took less than a year. This marks the beginning of the era of mechanized computational statistics and semiautomatic data processing systems.

In 1908, William Sealy Gosset performed his now well-known Monte Carlo method simulation which led to the discovery of the Student’s t-distribution. [5] With the help of computational methods, he also has plots of the empirical distributions overlaid on the corresponding theoretical distributions. The computer has revolutionized simulation and has made the replication of Gosset’s experiment little more than an exercise. [6] [7]

Later on, the scientists put forward computational ways of generating pseudo-random deviates, performed methods to convert uniform deviates into other distributional forms using inverse cumulative distribution function or acceptance-rejection methods, and developed state-space methodology for Markov chain Monte Carlo. [8] One of the first efforts to generate random digits in a fully automated way, was undertaken by the RAND Corporation in 1947. The tables produced were published as a book in 1955, and also as a series of punch cards.

By the mid-1950s, several articles and patents for devices had been proposed for random number generators. [9] The development of these devices were motivated from the need to use random digits to perform simulations and other fundamental components in statistical analysis. One of the most well known of such devices is ERNIE, which produces random numbers that determine the winners of the Premium Bond, a lottery bond issued in the United Kingdom. In 1958, John Tukey’s jackknife was developed. It is as a method to reduce the bias of parameter estimates in samples under nonstandard conditions. [10] This requires computers for practical implementations. To this point, computers have made many tedious statistical studies feasible. [11]

Methods

Maximum likelihood estimation

Maximum likelihood estimation is used to estimate the parameters of an assumed probability distribution, given some observed data. It is achieved by maximizing a likelihood function so that the observed data is most probable under the assumed statistical model.

Monte Carlo method

Monte Carlo is a statistical method that relies on repeated random sampling to obtain numerical results. The concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

Markov chain Monte Carlo

The Markov chain Monte Carlo method creates samples from a continuous random variable, with probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its expected value or variance. The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.

Applications

Computational statistics journals

Associations

See also

Related Research Articles

The following outline is provided as an overview of and topical guide to statistics:

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

<span class="mw-page-title-main">Metropolis–Hastings algorithm</span> Monte Carlo algorithm

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. This sequence can be used to approximate the distribution or to compute an integral. Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution. Various algorithms exist for constructing chains, including the Metropolis–Hastings algorithm.

Stochastic refers to the property of being well-described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselves, these two terms are often used synonymously. Furthermore, in probability theory, the formal concept of a stochastic process is also referred to as a random process.

Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a degree of belief in an event. The degree of belief may be based on prior knowledge about the event, such as the results of previous experiments, or on personal beliefs about the event. This differs from a number of other interpretations of probability, such as the frequentist interpretation that views probability as the limit of the relative frequency of an event after many trials.

<span class="mw-page-title-main">Nicholas Metropolis</span> American mathematician

Nicholas Constantine Metropolis was a Greek-American physicist.

Computational science, also known as scientific computing, technical computing or scientific computation (SC), is a division of science that uses advanced computing capabilities to understand and solve complex physical problems. This includes

Monte Carlo methods are used in corporate finance and mathematical finance to value and analyze (complex) instruments, portfolios and investments by simulating the various sources of uncertainty affecting their value, and then determining the distribution of their value over the range of resultant outcomes. This is usually done by help of stochastic asset models. The advantage of Monte Carlo methods over other techniques increases as the dimensions of the problem increase.

Parallel tempering, in physics and statistics, is a computer simulation method typically used to find the lowest energy state of a system of many interacting particles. It addresses the problem that at high temperatures, one may have a stable state different from low temperature, whereas simulations at low temperatures may become "stuck" in a metastable state. It does this by using the fact that the high temperature simulation may visit states typical of both stable and metastable low temperature states.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

  1. Permutation tests
  2. Bootstrapping
  3. Cross validation

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

<span class="mw-page-title-main">Quantile function</span> Statistical function that defines the quantiles of a probability distribution

In probability and statistics, the quantile function outputs the value of a random variable such that its probability is less than or equal to an input probability value. Intuitively, the quantile function associates with a range at and below a probability input the likelihood that a random variable is realized in that range for some probability distribution. It is also called the percentile function, percent-point function or inverse cumulative distribution function.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

"Equation of State Calculations by Fast Computing Machines" is a scholarly article published by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller in the Journal of Chemical Physics in 1953. This paper proposed what became known as the Metropolis Monte Carlo algorithm, which forms the basis for Monte Carlo statistical mechanics simulations of atomic and molecular systems.

Non-uniform random variate generation or pseudo-random number sampling is the numerical practice of generating pseudo-random numbers (PRN) that follow a given probability distribution. Methods are typically based on the availability of a uniformly distributed PRN generator. Computational algorithms are then used to manipulate a single random variate, X, or often several such variates, into a new random variate Y such that these values have the required distribution. The first methods were developed for Monte-Carlo simulations in the Manhattan project, published by John von Neumann in the early 1950s.

James E. Gentle is an American statistician and author. He was a professor of statistics at George Mason University until his retirement in 2016. He is Co-Editor-in-Chief of Wiley Interdisciplinary Reviews: Computational Statistics and Senior Editor of Communications in Statistics.

References

  1. Nolan, D. & Temple Lang, D. (2010). "Computing in the Statistics Curricula", The American Statistician 64 (2), pp.97-107.
  2. 1 2 Wegman, Edward J. “Computational Statistics: A New Agenda for Statistical Theory and Practice. Journal of the Washington Academy of Sciences , vol. 78, no. 4, 1988, pp. 310–322. JSTOR
  3. Lauro, Carlo (1996), "Computational statistics or statistical computing, is that the question?", Computational Statistics & Data Analysis, 23 (1): 191–193, doi:10.1016/0167-9473(96)88920-1
  4. Watnik, Mitchell (2011). "Early Computational Statistics". Journal of Computational and Graphical Statistics. 20 (4): 811–817. doi:10.1198/jcgs.2011.204b. ISSN   1061-8600. S2CID   120111510.
  5. "Student" [William Sealy Gosset] (1908). "The probable error of a mean" (PDF). Biometrika . 6 (1): 1–25. doi:10.1093/biomet/6.1.1. hdl:10338.dmlcz/143545. JSTOR   2331554.
  6. Trahan, Travis John (2019-10-03). "Recent Advances in Monte Carlo Methods at Los Alamos National Laboratory". doi:10.2172/1569710. OSTI   1569710.{{cite journal}}: Cite journal requires |journal= (help)
  7. Metropolis, Nicholas; Ulam, S. (1949). "The Monte Carlo Method". Journal of the American Statistical Association. 44 (247): 335–341. doi:10.1080/01621459.1949.10483310. ISSN   0162-1459. PMID   18139350.
  8. Robert, Christian; Casella, George (2011-02-01). "A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data". Statistical Science. 26 (1). doi: 10.1214/10-sts351 . ISSN   0883-4237. S2CID   2806098.
  9. Pierre L'Ecuyer (2017). "History of uniform random number generation" (PDF). 2017 Winter Simulation Conference (WSC). pp. 202–230. doi:10.1109/WSC.2017.8247790. ISBN   978-1-5386-3428-8. S2CID   4567651.
  10. QUENOUILLE, M. H. (1956). "Notes on Bias in Estimation". Biometrika. 43 (3–4): 353–360. doi:10.1093/biomet/43.3-4.353. ISSN   0006-3444.
  11. Teichroew, Daniel (1965). "A History of Distribution Sampling Prior to the Era of the Computer and its Relevance to Simulation". Journal of the American Statistical Association. 60 (309): 27–49. doi:10.1080/01621459.1965.10480773. ISSN   0162-1459.

Further reading

Articles

Books

Associations

Journals