Estimation

Last updated

The exact number of candies in this jar cannot be determined by looking at it, because most of the candies are not visible. It can be estimated by assuming that the density of the unseen candies is the same as that of the visible candies. Candy corn contest jar.jpg
The exact number of candies in this jar cannot be determined by looking at it, because most of the candies are not visible. It can be estimated by assuming that the density of the unseen candies is the same as that of the visible candies.

Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is derived from the best information available. [1] Typically, estimation involves "using the value of a statistic derived from a sample to estimate the value of a corresponding population parameter". [2] The sample provides information that can be projected, through various formal or informal processes, to determine a range most likely to describe the missing information. An estimate that turns out to be incorrect will be an overestimate if the estimate exceeds the actual result [3] and an underestimate if the estimate falls short of the actual result. [4]

Contents

The confidence in an estimate is quantified as a confidence interval, the likelihood that the estimate is in a certain range. Human estimators systematically suffer from overconfidence, believing that their estimates are more accurate than they actually are. [5]

How estimation is done

Estimation is often done by sampling, which is counting a small number of examples something, and projecting that number onto a larger population. [1] An example of estimation would be determining how many candies of a given size are in a glass jar. Because the distribution of candies inside the jar may vary, the observer can count the number of candies visible through the glass, consider the size of the jar, and presume that a similar distribution can be found in the parts that can not be seen, thereby making an estimate of the total number of candies that could be in the jar if that presumption were true. Estimates can similarly be generated by projecting results from polls or surveys onto the entire population.

In making an estimate, the goal is often most useful to generate a range of possible outcomes that is precise enough to be useful but not so precise that it is likely to be inaccurate. [2] For example, in trying to guess the number of candies in the jar, if fifty were visible, and the total volume of the jar seemed to be about twenty times as large as the volume containing the visible candies, then one might simply project that there were a thousand candies in the jar. Such a projection, intended to pick the single value that is believed to be closest to the actual value, is called a point estimate. [2] However, a point estimation is likely to be incorrect, because the sample size—in this case, the number of candies that are visible—is too small a number to be sure that it does not contain anomalies that differ from the population as a whole. [2] A corresponding concept is an interval estimate, which captures a much larger range of possibilities, but is too broad to be useful. [2] For example, if one were asked to estimate the percentage of people who like candy, it would clearly be correct that the number falls between zero and one hundred percent. [2] Such an estimate would provide no guidance, however, to somebody who is trying to determine how many candies to buy for a party to be attended by a hundred people.

Uses of estimation

In mathematics, approximation describes the process of finding estimates in the form of upper or lower bounds for a quantity that cannot readily be evaluated precisely, and approximation theory deals with finding simpler functions that are close to some complicated function and that can provide useful estimates. In statistics, an estimator is the formal name for the rule by which an estimate is calculated from data, and estimation theory deals with finding estimates with good properties. This process is used in signal processing, for approximating an unobserved signal on the basis of an observed signal containing noise. For estimation of yet-to-be observed quantities, forecasting and prediction are applied. A Fermi problem, in physics, is one concerning estimation in problems that typically involve making justified guesses about quantities that seem impossible to compute given limited available information.

Estimation is important in business and economics because too many variables exist to figure out how large-scale activities will develop. Estimation in project planning can be particularly significant, because plans for the distribution of labor and purchases of raw materials must be made, despite the inability to know every possible problem that may come up. A certain amount of resources will be available for carrying out a particular project, making it important to obtain or generate a cost estimate as one of the vital elements of entering into the project. [6] The U.S. Government Accountability Office defines a cost estimate as, "the summation of individual cost elements, using established methods and valid data, to estimate the future costs of a program, based on what is known today", and reports that "realistic cost estimating was imperative when making wise decisions in acquiring new systems". [7] Furthermore, project plans must not underestimate the needs of the project, which can result in delays while unmet needs are fulfilled, nor must they greatly overestimate the needs of the project, or else the unneeded resources may go to waste.

An informal estimate when little information is available is called a guesstimate because the inquiry becomes closer to purely guessing the answer. The "estimated" sign, ℮, is used to designate that package contents are close to the nominal contents.

See also

Related Research Articles

<span class="mw-page-title-main">Cluster sampling</span> Sampling methodology in statistics

In statistics, cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research.

<span class="mw-page-title-main">Estimator</span> Rule for calculating an estimate of a given quantity based on observed data

In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter. More formally, it is the application of a point estimator to the data to obtain a point estimate.

Statistical bias, in the mathematical field of statistics, is a systematic tendency in which the methods used to gather data and generate statistics present an inaccurate, skewed or biased depiction of reality. Statistical bias exists in numerous stages of the data collection and analysis process, including: the source of the data, the methods used to collect the data, the estimator chosen, and the methods used to analyze the data. Data analysts can take various measures at each stage of the process to reduce the impact of statistical bias in their work. Understanding the source of statistical bias can help to assess whether the observed results are close to actuality. Issues of statistical bias has been argued to be closely linked to issues of statistical validity.

In physics or engineering education, a Fermi problem, also known as a order-of-magnitude problem, is an estimation problem designed to teach dimensional analysis or approximation of extreme scientific calculations, and such a problem is usually a back-of-the-envelope calculation. The estimation technique is named after physicist Enrico Fermi as he was known for his ability to make good approximate calculations with little or no actual data. Fermi problems typically involve making justified guesses about quantities and their variance or lower and upper bounds. In some cases, order-of-magnitude estimates can also be derived using dimensional analysis.

<span class="mw-page-title-main">Standard error</span> Statistical property

The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error of the mean (SEM). The standard error is a key ingredient in producing confidence intervals.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

  1. Permutation tests
  2. Bootstrapping
  3. Cross validation

A cost estimate is the approximation of the cost of a program, project, or operation. The cost estimate is the product of the cost estimating process. The cost estimate has a single total value and may have identifiable component values.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In statistics and in particular statistical theory, unbiased estimation of a standard deviation is the calculation from a statistical sample of an estimated value of the standard deviation of a population of values, in such a way that the expected value of the calculation equals the true value. Except in some important situations, outlined later, the task has little relevance to applications of statistics since its need is avoided by standard procedures, such as the use of significance tests and confidence intervals, or by using Bayesian analysis.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

The wisdom of the crowd is the collective opinion of a diverse and independent group of individuals rather than that of a single expert. This process, while not new to the Information Age, has been pushed into the mainstream spotlight by social information sites such as Quora, Reddit, Stack Exchange, Wikipedia, Yahoo! Answers, and other web resources which rely on collective human knowledge. An explanation for this phenomenon is that there is idiosyncratic noise associated with each individual judgment, and taking the average over a large number of responses will go some way toward canceling the effect of this noise.

In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.

In statistics, the mean signed difference (MSD), also known as mean signed deviation and mean signed error, is a sample statistic that summarises how well a set of estimates match the quantities that they are supposed to estimate. It is one of a number of statistics that can be used to assess an estimation procedure, and it would often be used in conjunction with a sample version of the mean square error.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

References

  1. 1 2 C. Lon Enloe, Elizabeth Garnett, Jonathan Miles, Physical Science: What the Technology Professional Needs to Know (2000), p. 47.
  2. 1 2 3 4 5 6 Raymond A. Kent, "Estimation", Data Construction and Data Analysis for Survey Research (2001), p. 157.
  3. James Tate, John Schoonbeck, Reviewing Mathematics (2003), page 27: "An overestimate is an estimate you know is greater than the exact answer".
  4. James Tate, John Schoonbeck, Reviewing Mathematics (2003), page 27: "An underestimate is an estimate you know is less than the exact answer".
  5. Alpert, Marc; Raiffa, Howard (1982). "A progress report on the training of probability assessors". In Kahneman, Daniel; Slovic, Paul; Tversky, Amos (eds.). Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press. pp. 294–305. ISBN   978-0-521-28414-1.
  6. A Guide to the Project Management Body of Knowledge (PMBOK Guide) Third Edition, An American National Standard, ANSI/PMI 99-001-2004, Project Management Institute, Inc, 2004, ISBN   1-930699-45-X.
  7. GAO Cost Estimating and Assessment Guide, Best Practices for Developing and Managing Capital Program Costs, GAO-09-3SP, United States Government Accountabity Office, March 2009, Preface p. i.