Peirce's criterion

Last updated

In robust statistics, Peirce's criterion is a rule for eliminating outliers from data sets, which was devised by Benjamin Peirce.

Contents

Outliers removed by Peirce's criterion

The problem of outliers

In data sets containing real-numbered measurements, the suspected outliers are the measured values that appear to lie outside the cluster of most of the other data values. The outliers would greatly change the estimate of location if the arithmetic average were to be used as a summary statistic of location. The problem is that the arithmetic mean is very sensitive to the inclusion of any outliers; in statistical terminology, the arithmetic mean is not robust.

In the presence of outliers, the statistician has two options. First, the statistician may remove the suspected outliers from the data set and then use the arithmetic mean to estimate the location parameter. Second, the statistician may use a robust statistic, such as the median statistic.

Peirce's criterion is a statistical procedure for eliminating outliers.

Uses of Peirce's criterion

The statistician and historian of statistics Stephen M. Stigler wrote the following about Benjamin Peirce: [1]

"In 1852 he published the first significance test designed to tell an investigator whether an outlier should be rejected (Peirce 1852, 1878). The test, based on a likelihood ratio type of argument, had the distinction of producing an international debate on the wisdom of such actions (Anscombe, 1960, Rider, 1933, Stigler, 1973a)."

Peirce's criterion is derived from a statistical analysis of the Gaussian distribution. Unlike some other criteria for removing outliers, Peirce's method can be applied to identify two or more outliers.

"It is proposed to determine in a series of observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations." [2]

Hawkins [3] provides a formula for the criterion.

Peirce's criterion was used for decades at the United States Coast Survey. [4]

"From 1852 to 1867 he served as the director of the longitude determinations of the U. S. Coast Survey and from 1867 to 1874 as superintendent of the Survey. During these years his test was consistently employed by all the clerks of this, the most active and mathematically inclined statistical organization of the era." [1]

Peirce's criterion was discussed in William Chauvenet's book. [2]

Applications

An application for Peirce's criterion is removing poor data points from observation pairs in order to perform a regression between the two observations (e.g., a linear regression). Peirce's criterion does not depend on observation data (only characteristics of the observation data), therefore making it a highly repeatable process that can be calculated independently of other processes. This feature makes Peirce's criterion for identifying outliers ideal in computer applications because it can be written as a call function.

Previous attempts

In 1855, B. A. Gould attempted to make Peirce's criterion easier to apply by creating tables of values representing values from Peirce's equations. [5] A disconnect still exists between Gould's algorithm and the practical application of Peirce's criterion.

In 2003, S. M. Ross (University of New Haven) re-presented Gould's algorithm (now called "Peirce's method") with a new example data set and work-through of the algorithm. This methodology still relies on using look-up tables, which have been updated in this work (Peirce's criterion table). [6]

In 2008, an attempt to write a pseudo-code was made by a Danish geologist K. Thomsen. [7] While this code provided some framework for Gould's algorithm, users were unsuccessful in calculating values reported by either Peirce or Gould.

In 2012, C. Dardis released the R package "Peirce" with various methodologies (Peirce's criterion and the Chauvenet method) with comparisons of outlier removals. Dardis and fellow contributor Simon Muller successfully implemented Thomsen's pseudo-code into a function called "findx". The code is presented in the R implementation section below. References for the R package are available online [8] as well as an unpublished review of the R package results. [9]

In 2013, a re-examination of Gould's algorithm and the utilisation of advanced Python programming modules (i.e., numpy and scipy) has made it possible to calculate the squared-error threshold values for identifying outliers.

Python implementation

In order to use Peirce's criterion, one must first understand the input and return values. Regression analysis (or the fitting of curves to data) results in residual errors (or the difference between the fitted curve and the observation points). Therefore, each observation point has a residual error associated with a fitted curve. By taking the square (i.e., residual error raised to the power of two), residual errors are expressed as positive values. If the squared error is too large (i.e., due to a poor observation) it can cause problems with the regression parameters (e.g., slope and intercept for a linear curve) retrieved from the curve fitting.

It was Peirce's idea to statistically identify what constituted an error as "too large" and therefore being identified as an "outlier" which could be removed from the observations to improve the fit between the observations and a curve. K. Thomsen identified that three parameters were needed to perform the calculation: the number of observation pairs (N), the number of outliers to be removed (n), and the number of regression parameters (e.g., coefficients) used in the curve-fitting to get the residuals (m). The end result of this process is to calculate a threshold value (of squared error) whereby observations with a squared error smaller than this threshold should be kept and observations with a squared error larger than this value should be removed (i.e., as an outlier).

Because Peirce's criterion does not take observations, fitting parameters, or residual errors as an input, the output must be re-associated with the data. Taking the average of all the squared errors (i.e., the mean-squared error) and multiplying it by the threshold squared error (i.e., the output of this function) will result in the data-specific threshold value used to identify outliers.

The following Python code returns x-squared values for a given N (first column) and n (top row) in Table 1 (m = 1) and Table 2 (m = 2) of Gould 1855. [5] Due to the Newton-method of iteration, look-up tables, such as N versus log Q (Table III in Gould, 1855) and x versus log R (Table III in Peirce, 1852 and Table IV in Gould, 1855) are no longer necessary.

Python code

#!/usr/bin/env python3importnumpyimportscipy.specialdefpeirce_dev(N:int,n:int,m:int)->float:"""Peirce's criterion    Returns the squared threshold error deviation for outlier identification    using Peirce's criterion based on Gould's methodology.    Arguments:        - int, total number of observations (N)        - int, number of outliers to be removed (n)        - int, number of model unknowns (m)    Returns:        float, squared error threshold (x2)    """# Assign floats to input variables:N=float(N)n=float(n)m=float(m)# Check number of observations:ifN>1:# Calculate Q (Nth root of Gould's equation B):Q=(n**(n/N)*(N-n)**((N-n)/N))/N## Initialize R values (as floats)r_new=1.0r_old=0.0# <- Necessary to prompt while loop## Start iteration to converge on R:whileabs(r_new-r_old)>(N*2.0e-16):# Calculate Lamda# (1/(N-n)th root of Gould's equation A'):ldiv=r_new**nifldiv==0:ldiv=1.0e-6Lamda=((Q**N)/(ldiv))**(1.0/(N-n))# Calculate x-squared (Gould's equation C):x2=1.0+(N-m-n)/n*(1.0-Lamda**2.0)# If x2 goes negative, return 0:ifx2<0:x2=0.0r_old=r_newelse:# Use x-squared to update R (Gould's equation D):r_old=r_newr_new=numpy.exp((x2-1)/2.0)*scipy.special.erfc(numpy.sqrt(x2)/numpy.sqrt(2.0))else:x2=0.0returnx2

Java code

importorg.apache.commons.math3.special.Erf;publicclassPierceCriterion{/**   * Peirce's criterion   * <p>   * Returns the squared threshold error deviation for outlier identification   * using Peirce's criterion based on Gould's methodology.   * <p>   * Arguments:   * - int, total number of observations (N)   * - int, number of outliers to be removed (n)   * - int, number of model unknowns (m)   * Returns:   * float, squared error threshold (x2)   **/publicstaticfinaldoublepeirce_dev(doubleN,doublen,doublem){// Check number of observations:doublex2=0.0;if(N>1){// Calculate Q (Nth root of Gould 's equation B):doubleQ=(Math.pow(n,(n/N))*Math.pow((N-n),((N-n)/N)))/N;// Initialize R values(as floats)doubler_new=1.0;doubler_old=0.0;// <-Necessary to prompt while loop// Start iteration to converge on R:while(Math.abs(r_new-r_old)>(N*2.0e-16)){// Calculate Lamda// (1 / (N - n) th root of Gould 's equation A'):doubleldiv=Math.pow(r_new,n);if(ldiv==0){ldiv=1.0e-6;}doubleLamda=Math.pow((Math.pow(Q,N)/(ldiv)),(1.0/(N-n)));// Calculate x -squared(Gould 's equation C):x2=1.0+(N-m-n)/n*(1.0-Math.pow(Lamda,2.0));// If x2 goes negative, return 0:if(x2<0){x2=0.0;r_old=r_new;}else{// Use x -squared to update R(Gould 's equation D):r_old=r_new;r_new=Math.exp((x2-1)/2.0)*Erf.erfc(Math.sqrt(x2)/Math.sqrt(2.0));}}}else{x2=0.0;}returnx2;}}

R implementation

Thomsen's code has been successfully written into the following function call, "findx" by C. Dardis and S. Muller in 2012 which returns the maximum error deviation, . To complement the Python code presented in the previous section, the R equivalent of "peirce_dev" is also presented here which returns the squared maximum error deviation, . These two functions return equivalent values by either squaring the returned value from the "findx" function or by taking the square-root of the value returned by the "peirce_dev" function. Differences occur with error handling. For example, the "findx" function returns NaNs for invalid data while "peirce_dev" returns 0 (which allows for computations to continue without additional NA value handling). Also, the "findx" function does not support any error handling when the number of potential outliers increases towards the number of observations (throws missing value error and NaN warning).

Just as with the Python version, the squared-error (i.e., ) returned by the "peirce_dev" function must be multiplied by the mean-squared error of the model fit to get the squared-delta value (i.e., Δ2). Use Δ2 to compare the squared-error values of the model fit. Any observation pairs with a squared-error greater than Δ2 are considered outliers and can be removed from the model. An iterator should be written to test increasing values of n until the number of outliers identified (comparing Δ2 to model-fit squared-errors) is less than those assumed (i.e., Peirce's n).

R code

findx<-function(N,k,m){# method by K. Thomsen (2008)# written by C. Dardis and S. Muller (2012)# Available online: https://r-forge.r-project.org/R/?group_id=1473## Variable definitions:# N :: number of observations# k :: number of potential outliers to be removed# m :: number of unknown quantities## Requires the complementary error function, erfc:erfc<-function(x)2*pnorm(x*sqrt(2),lower.tail=FALSE)#x<-1if ((N-m-k)<=0){return(NaN)print(NaN)}else{x<-min(x,sqrt((N-m)/k)-1e-10)## Log of Gould's equation B:LnQN<-k*log(k)+(N-k)*log(N-k)-N*log(N)## Gould's equation D:R1<-exp((x^2-1)/2)*erfc(x/sqrt(2))## Gould's equation A' solved for R w/ Lambda substitution:R2<-exp((LnQN-0.5*(N-k)*log((N-m-k*x^2)/(N-m-k)))/k)## Equate the two R equations:R1d<-x*R1-sqrt(2/pi/exp(1))R2d<-x*(N-k)/(N-m-k*x^2)*R2## Update x:oldx<-xx<-oldx-(R1-R2)/(R1d-R2d)## Loop until convergence:while (abs(x-oldx)>=N*2e-16){R1<-exp((x^2-1)/2)*erfc(x/sqrt(2))R2<-exp((LnQN-0.5*(N-k)*log((N-m-k*x^2)/(N-m-k)))/k)R1d<-x*R1-sqrt(2/pi/exp(1))R2d<-x*(N-k)/(N-m-k*x^2)*R2oldx<-xx<-oldx-(R1-R2)/(R1d-R2d)}}return(x)}
peirce_dev<-function(N,n,m){# N :: total number of observations# n :: number of outliers to be removed# m :: number of model unknowns (e.g., regression parameters)## Check number of observations:if (N>1){# Calculate Q (Nth root of Gould's equation B):Q=(n^(n/N)*(N-n)^((N-n)/N))/N## Initialize R values:Rnew=1.0Rold=0.0# <- Necessary to prompt while loop#while (abs(Rnew-Rold)>(N*2.0e-16)){# Calculate Lamda (1/(N-n)th root of Gould's equation A'):ldiv=Rnew^nif (ldiv==0){ldiv=1.0e-6}Lamda=((Q^N)/(ldiv))^(1.0/(N-n))## Calculate x-squared (Gould's equation C):x2=1.0+(N-m-n)/n*(1.0-Lamda^2.0)## If x2 goes negative, set equal to zero:if (x2<0){x2=0Rold=Rnew}else{## Use x-squared to update R (Gould's equation D):# NOTE: error function (erfc) is replaced with pnorm (Rbasic):# source: # http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.htmlRold=RnewRnew=exp((x2-1)/2.0)*(2*pnorm(sqrt(x2)/sqrt(2)*sqrt(2),lower=FALSE))}}}else{x2=0}x2}

Notes

  1. 1 2 S.M. Stigler, "Mathematical statistics in the early states," The Annals of Statistics, vol. 6, no. 2, p. 246, 1978. Available online: https://www.jstor.org/stable/2958876
  2. 1 2 Quoted in the editorial note on page 516 of the Collected Writings of Peirce (1982 edition). The quotation cites A Manual of Astronomy (2:558) by Chauvenet.
  3. D.M. Hawkins (1980). "Brief early history in outlier rejection," Identification of Outliers (Monographs on Applied Probability and Statistics). Chapman & Hall, page 10.
  4. Peirce (1878)
  5. 1 2 Gould, B.A., "On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application", Astronomical Journal, issue 83, vol. 4, no. 11, pp. 81–87, 1855. DOI: 10.1086/100480.
  6. Ross, S.M., "Peirce's criterion for the elimination of suspect experimental data", Journal of Engineering Technology, vol. 2, no. 2, pp. 1–12, 2003.
  7. Thomsen, K., "Topic: Computing tables for use with Peirce's Criterion - in 1855 and 2008", The Math Forum @ Drexel, posted 5 October 2008. Accessed 15 July 2013.
  8. C. Dardis, "Package: Peirce," R-forge, accessed online: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/Peirce/Peirce-manual.pdf?root=peirce
  9. C. Dardis, "Peirce's criterion for the rejection of non-normal outliers; defining the range of applicability," Journal of Statistical Software (unpublished). Available online: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/Peirce/PeirceSub.pdf?root=peirce

Related Research Articles

In complex analysis, an entire function, also called an integral function, is a complex-valued function that is holomorphic on the whole complex plane. Typical examples of entire functions are polynomials and the exponential function, and any finite sums, products and compositions of these, such as the trigonometric functions sine and cosine and their hyperbolic counterparts sinh and cosh, as well as derivatives and integrals of entire functions such as the error function. If an entire function f(z) has a root at w, then f(z) / (zw), taking the limit value at w, is an entire function. On the other hand, the natural logarithm, the reciprocal function, and the square root are all not entire functions, nor can they be continued analytically to an entire function.

<span class="mw-page-title-main">Maxwell–Boltzmann distribution</span> Specific probability distribution function, important in physics

In physics, the Maxwell–Boltzmann distribution, or Maxwell(ian) distribution, is a particular probability distribution named after James Clerk Maxwell and Ludwig Boltzmann.

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

In mathematics and its applications, the root mean square of a set of numbers is defined as the square root of the mean square of the set. The RMS is also known as the quadratic mean and is a particular case of the generalized mean. The RMS of a continuously varying function can be defined in terms of an integral of the squares of the instantaneous values during a cycle.

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of each individual equation.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

<span class="mw-page-title-main">Error function</span> Sigmoid shape special function

In mathematics, the error function, often denoted by erf, is a complex function of a complex variable defined as:

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The error of an observation is the deviation of the observed value from the true value of a quantity of interest. The residual is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals. In econometrics, "errors" are also called disturbances.

<span class="mw-page-title-main">Tetration</span> Repeated or iterated exponentiation

In mathematics, tetration is an operation based on iterated, or repeated, exponentiation. There is no standard notation for tetration, though and the left-exponent xb are common.

<span class="mw-page-title-main">Standard error</span> Statistical property

The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM).

The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.

Amplitude-shift keying (ASK) is a form of amplitude modulation that represents digital data as variations in the amplitude of a carrier wave. In an ASK system, a symbol, representing one or more bits, is sent by transmitting a fixed-amplitude carrier wave at a fixed frequency for a specific time duration. For example, if each symbol represents a single bit, then the carrier signal could be transmitted at nominal amplitude when the input value is 1, but transmitted at reduced amplitude or not at all when the input value is 0.

In statistical theory, Chauvenet's criterion is a means of assessing whether one piece of experimental data — an outlier — from a set of observations, is likely to be spurious.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of generalized least squares.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors ().

In statistics, Grubbs's test or the Grubbs test, also known as the maximum normalized residual test or extreme studentized deviate test, is a test used to detect outliers in a univariate data set assumed to come from a normally distributed population.

<span class="mw-page-title-main">Exponentially modified Gaussian distribution</span> Describes the sum of independent normal and exponential random variables

In probability theory, an exponentially modified Gaussian distribution describes the sum of independent normal and exponential random variables. An exGaussian random variable Z may be expressed as Z = X + Y, where X and Y are independent, X is Gaussian with mean μ and variance σ2, and Y is exponential of rate λ. It has a characteristic positive skew from the exponential component.

References