Square root biased sampling

Last updated

Square root biased sampling is a sampling method proposed by William H. Press, a computer scientist and computational biologist, for use in airport screenings. It is the mathematically optimal compromise between simple random sampling and strong profiling that most quickly finds a rare malfeasor, given fixed screening resources. [1] [2]

Contents

Using this method, if a group is times as likely as the average to be a security risk, then persons from that group will be times as likely to undergo additional screening. [1] For example, if someone from a profiled group is nine times more likely than the average person to be a security risk, then when using square root biased sampling, people from the profiled group would be screened three times more often than the average person.

History

Press developed square root biased sampling as a way to sample long sequences of DNA. [3] It had also been developed independently by Ruben Abagyan, a professor at TSRI in La Jolla, California, for use in a different biological context. [4] [5] An even earlier discovery was by Martin L. Shooman, who used square root biased sampling in a test apportionment model for software reliability. [6]

Press' later proposal to use square root biased sampling for airport security was published in 2009. [1] There, he argued that this method would be a more efficient use of the limited resources possessed for screening, as compared to the current practice, which can lead to screening the same persons frequently and repeatedly. [2] [3] However, use of this method presupposes that those doing the screening have accurate statistical information on who is more likely to be a security risk, which is not necessarily the case. [7]

See also

Related Research Articles

A histogram is a visual representation of the distribution of quantitative data. To construct a histogram, the first step is to "bin" the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically of equal size.

<span class="mw-page-title-main">Supervised learning</span> Machine learning paradigm

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not. Standard deviation may be abbreviated SD or std dev, and is most commonly represented in mathematical texts and equations by the lowercase Greek letter σ (sigma), for the population standard deviation, or the Latin letter s, for the sample standard deviation.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the true value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of one parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim to provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

The Mahalanobis distance is a measure of the distance between a point and a distribution , introduced by P. C. Mahalanobis in 1933. The mathematical details of Mahalanobis distance first appeared in the Journal of The Asiatic Society of Bengal in 1933. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements. R.C. Bose later obtained the sampling distribution of Mahalanobis distance, under the assumption of equal dispersion.

A computer experiment or simulation experiment is an experiment used to study a computer simulation, also referred to as an in silico system. This area includes computational physics, computational chemistry, computational biology and other similar disciplines.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, OR will operate in a defined environment without failure. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

<span class="mw-page-title-main">Regression dilution</span> Statistical bias in linear regressions

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

Internal Coordinate Mechanics (ICM) is a software program and algorithm to predict low-energy conformations of molecules by sampling the space of internal coordinates defining molecular geometry. In ICM each molecule is constructed as a tree from an entry atom where each next atom is built iteratively from the preceding three atoms via three internal variables. The rings kept rigid or imposed via additional restraints. ICM is used for modelling peptides and interactions with substrates and coenzymes.

<span class="mw-page-title-main">Local regression</span> Moving average and polynomial regression method for smoothing data

Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced LOH-ess. They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.

Uncertainty quantification (UQ) is the science of quantitative characterization and estimation of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. An example would be to predict the acceleration of a human body in a head-on crash with another car: even if the speed was exactly known, small differences in the manufacturing of individual cars, how tightly every bolt has been tightened, etc., will lead to different results that can only be predicted in a statistical sense.

The root mean square deviation (RMSD) or root mean square error (RMSE) is either one of two closely related and frequently used measures of the differences between true or predicted values on the one hand and observed values or an estimator on the other. The deviation is typically simply a differences of scalars; it can also be generalized to the vector lengths of a displacement, as in the bioinformatics concept of root mean square deviation of atomic positions.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

<span class="mw-page-title-main">Fast inverse square root</span> Root-finding algorithm

Fast inverse square root, sometimes referred to as Fast InvSqrt or by the hexadecimal constant 0x5F3759DF, is an algorithm that estimates , the reciprocal of the square root of a 32-bit floating-point number in IEEE 754 floating-point format. The algorithm is best known for its implementation in 1999 in Quake III Arena, a first-person shooter video game heavily based on 3D graphics. With subsequent hardware advancements, especially the x86 SSE instruction rsqrtss, this algorithm is not generally the best choice for modern computers, though it remains an interesting historical example.

<span class="mw-page-title-main">OptiSLang</span>

optiSLang is a software platform for CAE-based sensitivity analysis, multi-disciplinary optimization (MDO) and robustness evaluation. It was originally developed by Dynardo GmbH and provides a framework for numerical Robust Design Optimization (RDO) and stochastic analysis by identifying variables which contribute most to a predefined optimization goal. This includes also the evaluation of robustness, i.e. the sensitivity towards scatter of design variables or random fluctuations of parameters. In 2019, Dynardo GmbH was acquired by Ansys.

References

  1. 1 2 3 Press, William H. (February 10, 2009). "Strong profiling is not mathematically optimal for discovering rare malfeasors". Proceedings of the National Academy of Sciences. 106 (6): 1716–1719. Bibcode:2009PNAS..106.1716P. doi: 10.1073/pnas.0813202106 . PMC   2634801 . PMID   19188610.
  2. 1 2 "Square root bias and airport security screening". Homeland Security Newswire. 2009-02-03. Retrieved 2009-11-28.
  3. 1 2 "Researcher Proposes Statistical Method to Enhance Secondary Security Screenings". University of Texas at Austin News. 2009-02-03. Retrieved 2009-11-28.
  4. Abagyan RA, Totrov M (1999) "Ab initio folding of peptides by the optimal-bias Monte Carlo minimization procedure", Journal of Computational Physics, vol. 151, pp. 402-421.
  5. Zhou Y, Abagyan R (2002) "Efficient stochastic global optimization for protein structure prediction", Rigidity Theory and Applications, eds. Thorpe MF, Duxbury PM (New York, Springer).
  6. M.L. Shooman, "A micro software reliability model for prediction and test apportionment," Proceedings 1991 International Symposium on Software Reliability Engineering (IEEE, 1991), pp. 52-59.
  7. William Press, "To catch a terrorist: can ethnic profiling work?", Significance, December 2010, p. 164.

Derivation: https://www.researchgate.net/publication/309809428_An_optimal_sampling_application_of_Cauchy's_inequality