Deviation (statistics)

Last updated
Plot of standard deviation of a random distribution Deviation from Mean of a Random Distribution.jpg
Plot of standard deviation of a random distribution

In mathematics and statistics, deviation serves as a measure to quantify the disparity between an observed value of a variable and another designated value, frequently the mean of that variable. Deviations with respect to the sample mean and the population mean (or "true value") are called errors and residuals, respectively. The sign of the deviation reports the direction of that difference: the deviation is positive when the observed value exceeds the reference value. The absolute value of the deviation indicates the size or magnitude of the difference. In a given sample, there are as many deviations as sample points. Summary statistics can be derived from a set of deviations, such as the standard deviation and the mean absolute deviation , measures of dispersion, and the mean signed deviation , a measure of bias. [1]

Contents

The deviation of each data point is calculated by subtracting the mean of the data set from the individual data point. Mathematically, the deviation d of a data point x in a data set is given by

This calculation represents the "distance" of a data point from the mean and provides information about how much individual values vary from the average. Positive deviations indicate values above the mean, while negative deviations indicate values below the mean. [1]

The sum of squared deviations is a key component in the calculation of variance, another measure of the spread or dispersion of a data set. Variance is calculated by averaging the squared deviations. Deviation is a fundamental concept in understanding the distribution and variability of data points in statistical analysis. [1]

Types

A deviation that is a difference between an observed value and the true value of a quantity of interest (where true value denotes the Expected Value, such as the population mean) is an error. [2]

Signed deviations

A deviation that is a difference between an observed value and the true value of a quantity of interest (such as the population mean) is an error.

A deviation that is the difference between the observed value and an estimate of the true value (e.g. the sample mean) is a residual. These concepts are applicable for data at the interval and ratio levels of measurement. [3]

Unsigned or absolute deviation

where


The average absolute deviation (AAD) in statistics is a measure of the dispersion or spread of a set of data points around a central value, usually the mean or median. It is calculated by taking the average of the absolute differences between each data point and the chosen central value. AAD provides a measure of the typical magnitude of deviations from the central value in a dataset, giving insights into the overall variability of the data. [5]

Least absolute deviation (LAD) is a statistical method used in regression analysis to estimate the coefficients of a linear model. Unlike the more common least squares method, which minimizes the sum of squared vertical distances (residuals) between the observed and predicted values, the LAD method minimizes the sum of the absolute vertical distances.

In the context of linear regression, if (x1,y1), (x2,y2), ... are the data points, and a and b are the coefficients to be estimated for the linear model

the least absolute deviation estimates (a and b) are obtained by minimizing the sum.

The LAD method is less sensitive to outliers compared to the least squares method, making it a robust regression technique in the presence of skewed or heavy-tailed residual distributions. [6]

Summary statistics

Mean signed deviation

For an unbiased estimator, the average of the signed deviations across the entire set of all observations from the unobserved population parameter value averages zero over an arbitrarily large number of samples. However, by construction the average of signed deviations of values from the sample mean value is always zero, though the average signed deviation from another measure of central tendency, such as the sample median, need not be zero.

Mean Signed Deviation is a statistical measure used to assess the average deviation of a set of values from a central point, usually the mean. It is calculated by taking the arithmetic mean of the signed differences between each data point and the mean of the dataset.

The term "signed" indicates that the deviations are considered with their respective signs, meaning whether they are above or below the mean. Positive deviations (above the mean) and negative deviations (below the mean) are included in the calculation. The mean signed deviation provides a measure of the average distance and direction of data points from the mean, offering insights into the overall trend and distribution of the data. [3]

Dispersion

Statistics of the distribution of deviations are used as measures of statistical dispersion.

A distribution with different standard deviations reflects varying degrees of dispersion among its data points. The first standard deviation from the mean in a normal distribution encompasses approximately 68% of the data. The second standard deviation from the mean in a normal distribution encompasses a larger portion of the data, covering approximately 95% of the observations. Standard deviation graph.jpg
A distribution with different standard deviations reflects varying degrees of dispersion among its data points. The first standard deviation from the mean in a normal distribution encompasses approximately 68% of the data. The second standard deviation from the mean in a normal distribution encompasses a larger portion of the data, covering approximately 95% of the observations.

Normalization

Deviations, which measure the difference between observed values and some reference point, inherently carry units corresponding to the measurement scale used. For example, if lengths are being measured, deviations would be expressed in units like meters or feet. To make deviations unitless and facilitate comparisons across different datasets, one can nondimensionalize.

One common method involves dividing deviations by a measure of scale(statistical dispersion), with the population standard deviation used for standardizing or the sample standard deviation for studentizing (e.g., Studentized residual).

Another approach to nondimensionalization focuses on scaling by location rather than dispersion. The percent deviation offers an illustration of this method, calculated as the difference between the observed value and the accepted value, divided by the accepted value, and then multiplied by 100%. By scaling the deviation based on the accepted value, this technique allows for expressing deviations in percentage terms, providing a clear perspective on the relative difference between the observed and accepted values. Both methods of nondimensionalization serve the purpose of making deviations comparable and interpretable beyond the specific measurement units. [10]

Examples

In one example, a series of measurements of the speed are taken of sound in a particular medium. The accepted or expected value for the speed of sound in this medium, based on theoretical calculations, is 343 meters per second.

Now, during an experiment, multiple measurements are taken by different researchers. Researcher A measures the speed of sound as 340 meters per second, resulting in a deviation of −3 meters per second from the expected value. Researcher B, on the other hand, measures the speed as 345 meters per second, resulting in a deviation of +2 meters per second.

In this scientific context, deviation helps quantify how individual measurements differ from the theoretically predicted or accepted value. It provides insights into the accuracy and precision of experimental results, allowing researchers to assess the reliability of their data and potentially identify factors contributing to discrepancies.

In another example, suppose a chemical reaction is expected to yield 100 grams of a specific compound based on stoichiometry. However, in an actual laboratory experiment, several trials are conducted with different conditions.

In Trial 1, the actual yield is measured to be 95 grams, resulting in a deviation of −5 grams from the expected yield. In Trial 2, the actual yield is measured to be 102 grams, resulting in a deviation of +2 grams. These deviations from the expected value provide valuable information about the efficiency and reproducibility of the chemical reaction under different conditions.

Scientists can analyze these deviations to optimize reaction conditions, identify potential sources of error, and improve the overall yield and reliability of the process. The concept of deviation is crucial in assessing the accuracy of experimental results and making informed decisions to enhance the outcomes of scientific experiments.

See also

Related Research Articles

In mathematics and statistics, the arithmetic mean, arithmetic average, or just the mean or average is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results from an experiment, an observational study, or a survey. The term "arithmetic mean" is preferred in some mathematics and statistics contexts because it helps distinguish it from other types of means, such as geometric and harmonic.

In statistics, a central tendency is a central or typical value for a probability distribution.

<span class="mw-page-title-main">Interquartile range</span> Measure of statistical dispersion

In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 − Q1.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value when data is orderly arranged. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values (outlier), and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of an income distribution because increases in the largest incomes alone have no effect on median while the average of the distribution is influenced. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

The average absolute deviation (AAD) of a data set is the average of the absolute deviations from a central point. It is a summary statistic of statistical dispersion or variability. In the general form, the central point can be a mean, median, mode, or the result of any other measure of central tendency or any reference value related to the given data set. AAD includes the mean absolute deviation and the median absolute deviation.

In statistics and optimization, errors and residuals are two closely related and easily confused measures of the deviation of an observed value of an element of a statistical sample from its "true value". The error of an observation is the deviation of the observed value from the true value of a quantity of interest. The residual is the difference between the observed value and the estimated value of the quantity of interest. The distinction is most important in regression analysis, where the concepts are sometimes called the regression errors and regression residuals and where they lead to the concept of studentized residuals. In econometrics, "errors" are also called disturbances.

In probability theory and statistics, the coefficient of variation (CV), also known as Normalized Root-Mean-Square Deviation (NRMSD), Percent RMS, and relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation to the mean , and often expressed as a percentage ("%RSD"). The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R, by economists and investors in economic models, and in psychology/neuroscience.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, the mid-range or mid-extreme is a measure of central tendency of a sample defined as the arithmetic mean of the maximum and minimum values of the data set:

Robust statistics are statistics which maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

The mean absolute difference (univariate) is a measure of statistical dispersion equal to the average absolute difference of two independent values drawn from a probability distribution. A related statistic is the relative mean absolute difference, which is the mean absolute difference divided by the arithmetic mean, and equal to twice the Gini coefficient. The mean absolute difference is also known as the absolute mean difference and the Gini mean difference (GMD). The mean absolute difference is sometimes denoted by Δ or as MD.

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. MAE is calculated as the sum of absolute errors divided by the sample size:

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is either one of two closely related and frequently used measures of the differences between true or predicted values on the one hand and observed values or an estimator on the other.

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.

In probability theory and statistics, the index of dispersion, dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

<span class="mw-page-title-main">L-estimator</span>

In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements. This can be as little as a single point, as in the median, or as many as all points, as in the mean.

In statistics, robust measures of scale are methods that quantify the statistical dispersion in a sample of numerical data while resisting outliers. The most common such robust statistics are the interquartile range (IQR) and the median absolute deviation (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample standard deviation, which are greatly influenced by outliers.

<span class="mw-page-title-main">Statistical dispersion</span> Statistical property quantifying how much a collection of data is spread out

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range. For instance, when the variance of data in a set is large, the data is widely scattered. On the other hand, when the variance is small, the data in the set is clustered.

References

  1. 1 2 3 Lee, Dong Kyu; In, Junyong; Lee, Sangseok (2015). "Standard deviation and standard error of the mean". Korean Journal of Anesthesiology. 68 (3): 220. doi: 10.4097/kjae.2015.68.3.220 . ISSN   2005-6419. PMC   4452664 .
  2. Livingston, Edward H. (June 2004). "The mean and standard deviation: what does it all mean?". Journal of Surgical Research. 119 (2): 117–123. doi:10.1016/j.jss.2004.02.008. ISSN   0022-4804.
  3. 1 2 Dodge, Yadolah, ed. (2003-08-07). The Oxford Dictionary Of Statistical Terms. Oxford University Press, Oxford. ISBN   978-0-19-850994-3.
  4. Konno, Hiroshi; Koshizuka, Tomoyuki (2005-10-01). "Mean-absolute deviation model". IIE Transactions. 37 (10): 893–900. doi:10.1080/07408170591007786. ISSN   0740-817X.
  5. Pham-Gia, T.; Hung, T. L. (2001-10-01). "The mean and median absolute deviations". Mathematical and Computer Modelling. 34 (7): 921–936. doi:10.1016/S0895-7177(01)00109-1. ISSN   0895-7177.
  6. Chen, Kani; Ying, Zhiliang (1996-04-01). "A counterexample to a conjecture concerning the Hall-Wellner band". The Annals of Statistics. 24 (2). doi: 10.1214/aos/1032894456 . ISSN   0090-5364.
  7. "2. Mean and standard deviation | The BMJ". The BMJ | The BMJ: leading general medical journal. Research. Education. Comment. 2020-10-28. Retrieved 2022-11-02.
  8. 1 2 Pham-Gia, T.; Hung, T. L. (2001-10-01). "The mean and median absolute deviations". Mathematical and Computer Modelling. 34 (7): 921–936. doi:10.1016/S0895-7177(01)00109-1. ISSN   0895-7177.
  9. Jones, Alan R. (2018-10-09). Probability, Statistics and Other Frightening Stuff. Routledge. p. 73. ISBN   978-1-351-66138-6.
  10. Freedman, David; Pisani, Robert; Purves, Roger (2007). Statistics (4 ed.). New York: Norton. ISBN   978-0-393-93043-6.