Square root biased sampling

Last updated

Square root biased sampling is a sampling method proposed by William H. Press, a computer scientist and computational biologist, for use in airport screenings. It is the mathematically optimal compromise between simple random sampling and strong profiling that most quickly finds a rare malfeasor, given fixed screening resources. [1] [2]

Contents

Using this method, if a group is times as likely as the average to be a security risk, then persons from that group will be times as likely to undergo additional screening. [1] For example, if someone from a profiled group is nine times more likely than the average person to be a security risk, then when using square root biased sampling, people from the profiled group would be screened three times more often than the average person.

History

Press developed square root biased sampling as a way to sample long sequences of DNA. [3] It had also been developed independently by Ruben Abagyan, a professor at TSRI in La Jolla, California, for use in a different biological context. [4] [5] An even earlier discovery was by Martin L. Shooman, who used square root biased sampling in a test apportionment model for software reliability. [6]

Press' later proposal to use square root biased sampling for airport security was published in 2009. [1] There, he argued that this method would be a more efficient use of the limited resources possessed for screening, as compared to the current practice, which can lead to screening the same persons frequently and repeatedly. [2] [3] However, use of this method presupposes that those doing the screening have accurate statistical information on who is more likely to be a security risk, which is not necessarily the case. [7]

See also

Related Research Articles

Histogram Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

Supervised learning Machine learning task

Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Numerical analysis Field of mathematics

Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis. Numerical analysis finds application in all fields of engineering and the physical sciences, and in the 21st century also the life and social sciences, medicine, business and even the arts. Current growth in computing power has enabled the use of more complex numerical analysis, providing detailed and realistic mathematical models in science and engineering. Examples of numerical analysis include: ordinary differential equations as found in celestial mechanics, numerical linear algebra in data analysis, and stochastic differential equations and Markov chains for simulating living cells in medicine and biology.

Standard deviation In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.

In mathematics and computing, a root-finding algorithm is an algorithm for finding zeros, also called "roots", of continuous functions. A zero of a function f, from the real numbers to real numbers or from the complex numbers to the complex numbers, is a number x such that f(x) = 0. As, generally, the zeros of a function cannot be computed exactly nor expressed in closed form, root-finding algorithms provide approximations to zeros, expressed either as floating point numbers or as small isolating intervals, or disks for complex roots.

Cross-validation (statistics) Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run, and a dataset of unknown data against which the model is tested. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.

Decision tree learning Machine learning algorithm

Decision tree learning or induction of decision trees is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree to go from observations about an item to conclusions about the item's target value. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

A computer experiment or simulation experiment is an experiment used to study a computer simulation, also referred to as an in silico system. This area includes computational physics, computational chemistry, computational biology and other similar disciplines.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Regression dilution

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

Data assimilation is a mathematical discipline that seeks to optimally combine theory with observations. There may be a number of different goals sought – for example, to determine the optimal state estimate of a system, to determine initial conditions for a numerical forecast model, to interpolate sparse observation data using knowledge of the system being observed, to set numerical parameters based on training a model from observed data. Depending on the goal, different solution methods may be used. Data assimilation is distinguished from other forms of machine learning, image analysis, and statistical methods in that it utilizes a dynamical model of the system being analyzed.

In statistics, resampling is any of a variety of methods for doing one of the following:

  1. Estimating the precision of sample statistics by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
  2. Permutation tests are exact tests: Exchanging labels on data points when performing significance tests
  3. Validating models by using random subsets

Internal Coordinate Mechanics (ICM) is a software program and algorithm to predict low-energy conformations of molecules by sampling the space of internal coordinates defining molecular geometry. In ICM each molecule is constructed as a tree from an entry atom where each next atom is built iteratively from the preceding three atoms via three internal variables. The rings kept rigid or imposed via additional restraints. ICM is used for modelling peptides and interactions with substrates and coenzymes.

Local regression Moving average and polynomial regression method for smoothing data

Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced. They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.

Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. An example would be to predict the acceleration of a human body in a head-on crash with another car: even if the speed was exactly known, small differences in the manufacturing of individual cars, how tightly every bolt has been tightened, etc., will lead to different results that can only be predicted in a statistical sense.

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.

OptiSLang

optiSLang is a software platform for CAE-based sensitivity analysis, multi-disciplinary optimization (MDO) and robustness evaluation. It is developed by Dynardo GmbH and provides a framework for numerical Robust Design Optimization (RDO) and stochastic analysis by identifying variables which contribute most to a predefined optimization goal. This includes also the evaluation of robustness, i.e. the sensitivity towards scatter of design variables or random fluctuations of parameters. In 2019, Dynardo GmbH was acquired by Ansys.

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a case management and decision support tool developed and owned by Northpointe used by U.S. courts to assess the likelihood of a defendant becoming a recidivist.

References

  1. 1 2 3 Press, William H. (February 10, 2009). "Strong profiling is not mathematically optimal for discovering rare malfeasors". Proceedings of the National Academy of Sciences. 106 (6): 1716–1719. doi: 10.1073/pnas.0813202106 . PMC   2634801 . PMID   19188610.
  2. 1 2 "Square root bias and airport security screening". Homeland Security Newswire. 2009-02-03. Retrieved 2009-11-28.
  3. 1 2 "Researcher Proposes Statistical Method to Enhance Secondary Security Screenings". University of Texas at Austin News. 2009-02-03. Retrieved 2009-11-28.
  4. Abagyan RA, Totrov M (1999) "Ab initio folding of peptides by the optimal-bias Monte Carlo minimization procedure", Journal of Computational Physics, vol. 151, pp. 402-421.
  5. Zhou Y, Abagyan R (2002) "Efficient stochastic global optimization for protein structure prediction", Rigidity Theory and Applications, eds. Thorpe MF, Duxbury PM (New York, Springer).
  6. M.L. Shooman, "A micro software reliability model for prediction and test apportionment," Proceedings 1991 International Symposium on Software Reliability Engineering (IEEE, 1991), pp. 52-59.
  7. William Press, "To catch a terrorist: can ethnic profiling work?", Significance, December 2010, p. 164.

Derivation: https://www.researchgate.net/publication/309809428_An_optimal_sampling_application_of_Cauchy's_inequality