Probability plot correlation coefficient plot

Last updated

The probability plot correlation coefficient (PPCC) plot is a graphical technique for identifying the shape parameter for a distributional family that best describes the data set. This technique is appropriate for families, such as the Weibull, that are defined by a single shape parameter and location and scale parameters, and it is not appropriate or even possible for distributions, such as the normal, that are defined only by location and scale parameters.

Contents

Many statistical analyses are based on distributional assumptions about the population from which the data have been obtained. However, distributional families can have radically different shapes depending on the value of the shape parameter. Therefore, finding a reasonable choice for the shape parameter is a necessary step in the analysis. In many analyses, finding a good distributional model for the data is the primary focus of the analysis.

The technique is simply "plot the probability plot correlation coefficients for different values of the shape parameter, and choose whichever value yields the best fit".

Definition

The PPCC plot is formed by:

That is, for a series of values of the shape parameter, the correlation coefficient is computed for the probability plot associated with a given value of the shape parameter. These correlation coefficients are plotted against their corresponding shape parameters. The maximum correlation coefficient corresponds to the optimal value of the shape parameter. For better precision, two iterations of the PPCC plot can be generated; the first is for finding the right neighborhood and the second is for fine tuning the estimate.

The PPCC plot is used first to find a good value of the shape parameter. The probability plot is then generated to find estimates of the location and scale parameters and in addition to provide a graphical assessment of the adequacy of the distributional fit.

The PPCC plot answers the following questions:

  1. What is the best-fit member within a distributional family?
  2. Does the best-fit member provide a good fit (in terms of generating a probability plot with a high correlation coefficient)?
  3. Does this distributional family provide a good fit compared to other distributions?
  4. How sensitive is the choice of the shape parameter?

Comparing distributions

In addition to finding a good choice for estimating the shape parameter of a given distribution, the PPCC plot can be useful in deciding which distributional family is most appropriate. For example, given a set of reliability data, one might generate PPCC plots for a Weibull, lognormal, gamma, and inverse Gaussian distributions, and possibly others, on a single page. This one page would show the best value for the shape parameter for several distributions and would additionally indicate which of these distributional families provides the best fit (as measured by the maximum probability plot correlation coefficient). That is, if the maximum PPCC value for the Weibull is 0.99 and only 0.94 for the lognormal, then one could reasonably conclude that the Weibull family is the better choice.

When comparing distributional models, one should not simply choose the one with the maximum PPCC value. In many cases, several distributional fits provide comparable PPCC values. For example, a lognormal and Weibull may both fit a given set of reliability data quite well. Typically, one would consider the complexity of the distribution. That is, a simpler distribution with a marginally smaller PPCC value may be preferred over a more complex distribution. Likewise, there may be theoretical justification in terms of the underlying scientific model for preferring a distribution with a marginally smaller PPCC value in some cases. In other cases, one may not need to know if the distributional model is optimal, only that it is adequate for our purposes. That is, one may be able to use techniques designed for normally distributed data even if other distributions fit the data somewhat better.

Tukey-lambda PPCC plot for symmetric distributions

The Tukey lambda PPCC plot, with shape parameter λ, is particularly useful for symmetric distributions. It indicates whether a distribution is short or long tailed and it can further indicate several common distributions. Specifically,

  1. λ = 1: distribution is approximately Cauchy
  2. λ = 0: distribution is exactly logistic
  3. λ = 0.14: distribution is approximately normal
  4. λ = 0.5: distribution is U-shaped
  5. λ = 1: distribution is exactly uniform(1, 1)

If the Tukey lambda PPCC plot gives a maximum value near 0.14, one can reasonably conclude that the normal distribution is a good model for the data. If the maximum value is less than 0.14, a long-tailed distribution such as the double exponential or logistic would be a better choice. If the maximum value is near 1, this implies the selection of very long-tailed distribution, such as the Cauchy. If the maximum value is greater than 0.14, this implies a short-tailed distribution such as the Beta or uniform.

The Tukey-lambda PPCC plot is used to suggest an appropriate distribution. One should follow-up with PPCC and probability plots of the appropriate alternatives.

See also

Related Research Articles

A parameter, generally, is any characteristic that can help in defining or classifying a particular system. That is, a parameter is an element of a system that is useful, or critical, when identifying the system, or when evaluating its performance, status, condition, etc.

Negative binomial distribution Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures occurs. For example, we can define rolling a 6 on a die as a success, and rolling any other number as a failure, and ask how many failed rolls will occur before we see the third success. In such a case, the probability distribution of the number of non-6s that appear will be a negative binomial distribution.

Exponential distribution Probability distribution

In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

Weibull distribution probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Fréchet (1927) and first applied by Rosin & Rammler (1933) to describe a particle size distribution.

A phase-type distribution is a probability distribution constructed by a convolution or mixture of exponential distributions. It results from a system of one or more inter-related Poisson processes occurring in sequence, or phases. The sequence in which each of the phases occur may itself be a stochastic process. The distribution can be represented by a random variable describing the time until absorption of a Markov process with one absorbing state. Each of the states of the Markov process represents one of the phases.

Q–Q plot graphical method in statistics for comparing two probability distributions

In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. First, the set of intervals for the quantiles is chosen. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution plotted against the same quantile of the first distribution. Thus the line is a parametric curve with the parameter which is the number of the interval for the quantile.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

The Unistat computer program is a statistical data analysis tool featuring two modes of operation: The stand-alone user interface is a complete workbench for data input, analysis and visualization while the Microsoft Excel add-in mode extends the features of the mainstream spreadsheet application with powerful analytical capabilities.

In probability theory and statistics, a shape parameter is a kind of numerical parameter of a parametric family of probability distributions.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

The Weibull modulus is a dimensionless parameter of the Weibull distribution which is used to describe variability in measured material strength of brittle materials.

In the statistical area of survival analysis, an accelerated failure time model is a parametric model that provides an alternative to the commonly used proportional hazards models. Whereas a proportional hazards model assumes that the effect of a covariate is to multiply the hazard by some constant, an AFT model assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant. This is especially appealing in a technical context where the 'disease' is a result of some mechanical process with a known sequence of intermediary stages.

Tukey lambda distribution

Formalized by John Tukey, the Tukey lambda distribution is a continuous, symmetric probability distribution defined in terms of its quantile function. It is typically used to identify an appropriate distribution and not used in statistical models directly.

Conway–Maxwell–Poisson distribution

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

Poisson distribution discrete probability distribution

In probability theory and statistics, the Poisson distribution, named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

In statistics, the exponentiated Weibull family of probability distributions was introduced by Mudholkar and Srivastava (1993) as an extension of the Weibull family obtained by adding a second shape parameter.

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive beyond any specified time.

Log-Cauchy distribution

In probability theory, a log-Cauchy distribution is a probability distribution of a random variable whose logarithm is distributed in accordance with a Cauchy distribution. If X is a random variable with a Cauchy distribution, then Y = exp(X) has a log-Cauchy distribution; likewise, if Y has a log-Cauchy distribution, then X = log(Y) has a Cauchy distribution.

<i>q</i>-Weibull distribution

In statistics, the q-Weibull distribution is a probability distribution that generalizes the Weibull distribution and the Lomax distribution. It is one example of a Tsallis distribution.

References

PD-icon.svg This article incorporates  public domain material from the National Institute of Standards and Technology website https://www.nist.gov .