Cumulative frequency analysis

Last updated

Cumulative frequency distribution, adapted cumulative probability distribution, and confidence intervals GohanaCum.png
Cumulative frequency distribution, adapted cumulative probability distribution, and confidence intervals

Cumulative frequency analysis is the analysis of the frequency of occurrence of values of a phenomenon less than a reference value. The phenomenon may be time- or space-dependent. Cumulative frequency is also called frequency of non-exceedance.

Contents

Cumulative frequency analysis is performed to obtain insight into how often a certain phenomenon (feature) is below a certain value. This may help in describing or explaining a situation in which the phenomenon is involved, or in planning interventions, for example in flood protection. [1]

This statistical technique can be used to see how likely an event like a flood is going to happen again in the future, based on how often it happened in the past. It can be adapted to bring in things like climate change causing wetter winters and drier summers.

Principles

Definitions

Frequency analysis [2] is the analysis of how often, or how frequently, an observed phenomenon occurs in a certain range.

Frequency analysis applies to a record of length N of observed data X1, X2, X3 . . . XN on a variable phenomenon X. The record may be time-dependent (e.g. rainfall measured in one spot) or space-dependent (e.g. crop yields in an area) or otherwise.

The cumulative frequencyMXr of a reference value Xr is the frequency by which the observed values X are less than or equal to Xr.

The relative cumulative frequencyFc can be calculated from:

Fc = MXr / N

where N is the number of data

Briefly this expression can be noted as:

Fc = M / N

When Xr = Xmin, where Xmin is the unique minimum value observed, it is found that Fc = 1/N, because M = 1. On the other hand, when Xr = Xmax, where Xmax is the unique maximum value observed, it is found that Fc = 1, because M = N. Hence, when Fc = 1 this signifies that Xr is a value whereby all data are less than or equal to Xr.

In percentage the equation reads:

Fc (%) = 100 M / N

Probability estimate

From cumulative frequency

The cumulative probability Pc of X to be smaller than or equal to Xr can be estimated in several ways on the basis of the cumulative frequency M.

One way is to use the relative cumulative frequency Fc as an estimate.

Another way is to take into account the possibility that in rare cases X may assume values larger than the observed maximum Xmax. This can be done dividing the cumulative frequency M by N+1 instead of N. The estimate then becomes:

Pc = M / (N+1)

There exist also other proposals for the denominator (see plotting positions).

By ranking technique

Ranked cumulative probabilities CumulativeFrequency.PNG
Ranked cumulative probabilities

The estimation of probability is made easier by ranking the data.

When the observed data of X are arranged in ascending order (X1X2X3 ≤ ⋯ ≤ XN, the minimum first and the maximum last), and Ri is the rank number of the observation Xi, where the adfix i indicates the serial number in the range of ascending data, then the cumulative probability may be estimated by:

Pc = Ri / (N + 1)

When, on the other hand, the observed data from X are arranged in descending order, the maximum first and the minimum last, and Rj is the rank number of the observation Xj, the cumulative probability may be estimated by:

Pc = 1 − Rj / (N + 1)

Fitting of probability distributions

Continuous distributions

Different cumulative normal probability distributions with their parameters Normal distribution cdf.png
Different cumulative normal probability distributions with their parameters

To present the cumulative frequency distribution as a continuous mathematical equation instead of a discrete set of data, one may try to fit the cumulative frequency distribution to a known cumulative probability distribution,. [2] [3]
If successful, the known equation is enough to report the frequency distribution and a table of data will not be required. Further, the equation helps interpolation and extrapolation. However, care should be taken with extrapolating a cumulative frequency distribution, because this may be a source of errors. One possible error is that the frequency distribution does not follow the selected probability distribution any more beyond the range of the observed data.

Any equation that gives the value 1 when integrated from a lower limit to an upper limit agreeing well with the data range, can be used as a probability distribution for fitting. A sample of probability distributions that may be used can be found in probability distributions .

Probability distributions can be fitted by several methods, [2] for example:

Application of both types of methods using for example

often shows that a number of distributions fit the data well and do not yield significantly different results, while the differences between them may be small compared to the width of the confidence interval. [2] This illustrates that it may be difficult to determine which distribution gives better results. For example, approximately normally distributed data sets can be fitted to a large number of different probability distributions. [4] while negatively skewed distributions can be fitted to square normal and mirrored Gumbel distributions. [5]

Cumulative frequency distribution with a discontinuity SanLor.png
Cumulative frequency distribution with a discontinuity

Discontinuous distributions

Sometimes it is possible to fit one type of probability distribution to the lower part of the data range and another type to the higher part, separated by a breakpoint, whereby the overall fit is improved.

The figure gives an example of a useful introduction of such a discontinuous distribution for rainfall data in northern Peru, where the climate is subject to the behavior Pacific Ocean current El Niño. When the Niño extends to the south of Ecuador and enters the ocean along the coast of Peru, the climate in Northern Peru becomes tropical and wet. When the Niño does not reach Peru, the climate is semi-arid. For this reason, the higher rainfalls follow a different frequency distribution than the lower rainfalls. [6]

Prediction

Uncertainty

When a cumulative frequency distribution is derived from a record of data, it can be questioned if it can be used for predictions. [7] For example, given a distribution of river discharges for the years 1950–2000, can this distribution be used to predict how often a certain river discharge will be exceeded in the years 2000–50? The answer is yes, provided that the environmental conditions do not change. If the environmental conditions do change, such as alterations in the infrastructure of the river's watershed or in the rainfall pattern due to climatic changes, the prediction on the basis of the historical record is subject to a systematic error. Even when there is no systematic error, there may be a random error, because by chance the observed discharges during 1950 − 2000 may have been higher or lower than normal, while on the other hand the discharges from 2000 to 2050 may by chance be lower or higher than normal. Issues around this have been explored in the book The Black Swan.

Confidence intervals

Binomial distributions for Pc = 0.1 (blue), 0.5 (green) and 0.8 (red) in a sample of size N = 20. The distribution is symmetrical only when Pc = 0.5 Binomial Distribution.PNG
Binomial distributions for Pc = 0.1 (blue), 0.5 (green) and 0.8 (red) in a sample of size N = 20. The distribution is symmetrical only when Pc = 0.5
90% binomial confidence belts on a log scale. BinomialConfBelts.jpg
90% binomial confidence belts on a log scale.

Probability theory can help to estimate the range in which the random error may be. In the case of cumulative frequency there are only two possibilities: a certain reference value X is exceeded or it is not exceeded. The sum of frequency of exceedance and cumulative frequency is 1 or 100%. Therefore, the binomial distribution can be used in estimating the range of the random error.

According to the normal theory, the binomial distribution can be approximated and for large N standard deviation Sd can be calculated as follows:

Sd =Pc(1 − Pc)/N

where Pc is the cumulative probability and N is the number of data. It is seen that the standard deviation Sd reduces at an increasing number of observations N.

The determination of the confidence interval of Pc makes use of Student's t-test (t). The value of t depends on the number of data and the confidence level of the estimate of the confidence interval. Then, the lower (L) and upper (U) confidence limits of Pc in a symmetrical distribution are found from:

L = PctSd
U = Pc + tSd

This is known as Wald interval. [8] However, the binomial distribution is only symmetrical around the mean when Pc = 0.5, but it becomes asymmetrical and more and more skew when Pc approaches 0 or 1. Therefore, by approximation, Pc and 1−Pc can be used as weight factors in the assignation of t.Sd to L and U :

L = Pc − 2⋅PctSd
U = Pc + 2⋅(1−Pc)⋅tSd

where it can be seen that these expressions for Pc = 0.5 are the same as the previous ones.

Example
N = 25, Pc = 0.8, Sd = 0.08, confidence level is 90%, t = 1.71, L = 0.58, U = 0.85
Thus, with 90% confidence, it is found that 0.58 < Pc < 0.85
Still, there is 10% chance that Pc < 0.58, or Pc > 0.85

Notes

Return period

Return periods and confidence belt. The curve of the return periods increases exponentially. Normal-Return.jpg
Return periods and confidence belt. The curve of the return periods increases exponentially.

The cumulative probability Pc can also be called probability of non-exceedance. The probability of exceedance Pe (also called survival function) is found from:

Pe = 1 − Pc

The return period T defined as:

T = 1/Pe

and indicates the expected number of observations that have to be done again to find the value of the variable in study greater than the value used for T.
The upper (TU) and lower (TL) confidence limits of return periods can be found respectively as:

TU = 1/(1−U)
TL = 1/(1−L)

For extreme values of the variable in study, U is close to 1 and small changes in U originate large changes in TU. Hence, the estimated return period of extreme values is subject to a large random error. Moreover, the confidence intervals found hold for a long-term prediction. For predictions at a shorter run, the confidence intervals UL and TUTL may actually be wider. Together with the limited certainty (less than 100%) used in the t−test, this explains why, for example, a 100-year rainfall might occur twice in 10 years.

Nine return-period curves of 50-year samples from a theoretical 1000-year record (base line) SampleFreqCurves.tif
Nine return-period curves of 50-year samples from a theoretical 1000-year record (base line)

The strict notion of return period actually has a meaning only when it concerns a time-dependent phenomenon, like point rainfall. The return period then corresponds to the expected waiting time until the exceedance occurs again. The return period has the same dimension as the time for which each observation is representative. For example, when the observations concern daily rainfalls, the return period is expressed in days, and for yearly rainfalls it is in years.

Need for confidence belts

The figure shows the variation that may occur when obtaining samples of a variate that follows a certain probability distribution. The data were provided by Benson. [1]

The confidence belt around an experimental cumulative frequency or return period curve gives an impression of the region in which the true distribution may be found.

Also, it clarifies that the experimentally found best fitting probability distribution may deviate from the true distribution.

Histogram

Histogram derived from the adapted cumulative probability distribution Gohana-Interval.png
Histogram derived from the adapted cumulative probability distribution
Histogram and probability density function, derived from the cumulative probability distribution, for a logistic distribution. Logistic2.png
Histogram and probability density function, derived from the cumulative probability distribution, for a logistic distribution.

The observed data can be arranged in classes or groups with serial number k. Each group has a lower limit (Lk) and an upper limit (Uk). When the class (k) contains mk data and the total number of data is N, then the relative class or group frequency is found from:

Fg(Lk < XUk) = mk / N

or briefly:

Fgk = m/N

or in percentage:

Fg(%) = 100m/N

The presentation of all class frequencies gives a frequency distribution, or histogram. Histograms, even when made from the same record, are different for different class limits.

The histogram can also be derived from the fitted cumulative probability distribution:

Pgk = Pc(Uk) − Pc(Lk)

There may be a difference between Fgk and Pgk due to the deviations of the observed data from the fitted distribution (see blue figure).

Often it is desired to combine the histogram with a probability density function as depicted in the black and white picture.

See also

Related Research Articles

Histogram Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

100-year flood Flood event that has a 1 in 100 chance of being equaled or exceeded in any given year

A 100-year flood is a flood event that has a 1 in 100 chance of being equaled or exceeded in any given year.

In statistics, point estimation involves the use of sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter. More formally, it is the application of a point estimator to the data to obtain a point estimate.

Confidence interval Range of estimates for an unknown parameter

In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level; the 95% confidence level is most common, but other levels, such as 90% or 99%, are sometimes used. The confidence level represents the long-run proportion of corresponding CIs that contain the true value of the parameter. For example, out of all intervals computed at the 95% level, 95% of them should contain the parameter's true value.

A return period, also known as a recurrence interval or repeat interval, is an average time or an estimated average time between events such as earthquakes, floods, landslides, or river discharge flows to occur.

Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complicated studies there may be several different sample sizes: for example, in a stratified survey there would be different sizes for each stratum. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments. In other words, a binomial proportion confidence interval is an interval estimate of a success probability p when only the number of experiments n and the number of successes nS are known.

Q–Q plot Plot of the empirical distribution of p-values against the theoretical one

In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution plotted against the same quantile of the first distribution. Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.

In statistics, the frequency of an event is the number of times the observation occurred/recorded in an experiment or study. These frequencies are often depicted graphically or in tabular form.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Data transformation (statistics)

In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point zi is replaced with the transformed value yi = f(zi), where f is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist-inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

In probability theory and statistics, the index of dispersion, dispersion index,coefficient of dispersion,relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. Given a set of observation counts from a -dimensional multinomial distribution with trials, a "smoothed" version of the counts gives the estimator:

CumFreq Software tool for data analysis and statistics

In statistics and data analysis the application software CumFreq is a tool for cumulative frequency analysis of a single variable and for probability distribution fitting.

Probability distribution fitting or simply distribution fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon.

References

  1. 1 2 Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000-year record. In: T.Dalrymple (ed.), Flood frequency analysis. U.S. Geological Survey Water Supply paper 1543-A, pp. 51–71
  2. 1 2 3 4 Frequency and Regression Analysis. Chapter 6 in: H.P. Ritzema (ed., 1994), Drainage Principles and Applications, Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. ISBN   90-70754-33-9 . Free download from the webpage under nr. 12, or directly as PDF :
  3. David Vose, Fitting distributions to data
  4. Example of an approximately normally distributed data set to which a large number of different probability distributions can be fitted,
  5. Left (negatively) skewed frequency histograms can be fitted to square normal or mirrored Gumbel probability functions.
  6. CumFreq, a program for cumulative frequency analysis with confidence bands, return periods, and a discontinuity option. Free download from :
  7. Silvia Masciocchi, 2012, Statistical Methods in Particle Physics, Lecture 11, Winter Semester 2012 / 13, GSI Darmstadt.
  8. Wald, A.; J. Wolfowitz (1939). "Confidence limits for continuous distribution functions". The Annals of Mathematical Statistics. 10 (2): 105–118. doi: 10.1214/aoms/1177732209 .
  9. Ghosh, B.K (1979). "A comparison of some approximate confidence intervals for the binomial parameter". Journal of the American Statistical Association. 74 (368): 894–900. doi:10.1080/01621459.1979.10481051.
  10. Blyth, C.R.; H.A. Still (1983). "Binomial confidence intervals". Journal of the American Statistical Association. 78 (381): 108–116. doi:10.1080/01621459.1983.10477938.
  11. Agresti, A.; B. Caffo (2000). "Simple and effective confidence intervals for pro- portions and differences of proportions result from adding two successes and two failures". The American Statistician. 54 (4): 280–288. doi:10.1080/00031305.2000.10474560. S2CID   18880883.
  12. Wilson, E.B. (1927). "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association. 22 (158): 209–212. doi:10.1080/01621459.1927.10502953.
  13. Hogg, R.V. (2001). Probability and statistical inference (6th ed.). Prentice Hall, NJ: Upper Saddle River.