Zero-inflated model

Last updated April 14, 2024

In statistics, a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.

Introduction to Zero-Inflated Models

Zero-inflated models are commonly used in the analysis of count data, such as the number of visits a patient makes to the emergency room in one year, or the number of fish caught in one day in one lake.^[1] Count data can take values of 0, 1, 2, … (non-negative integer values).^[2] Other examples of count data are the number of hits recorded by a Geiger counter in one minute, patient days in the hospital, goals scored in a soccer game,^[3] and the number of episodes of hypoglycemia per year for a patient with diabetes.^[4]

For statistical analysis, the distribution of the counts is often represented using a Poisson distribution or a negative binomial distribution. Hilbe ^[3] notes that "Poisson regression is traditionally conceived of as the basic count model upon which a variety of other count models are based." In a Poisson model, "… the random variable $y$ is the count response and parameter $\lambda$ (lambda) is the mean. Often, $\lambda$ is also called the rate or intensity parameter… In statistical literature, $\lambda$ is also expressed as $\mu$ (mu) when referring to Poisson and traditional negative binomial models."

In some data, the number of zeros is greater than would be expected using a Poisson distribution or a negative binomial distribution. Data with such an excess of zero counts are described as Zero-inflated.^[4]

Example histograms of zero-inflated Poisson distributions with mean $\mu$ of 5 or 10 and proportion of zero inflation $\pi$ of 0.2 or 0.5 are shown below, based on the R program ZeroInflPoiDistPlots.R from Bilder and Laughlin.^[1]

Examples of Zero-inflated count data

Fish counts ^[1] "… suppose we recorded the number of fish caught on various lakes in 4-hour fishing trips to Minnesota. Some lakes in Minnesota are too shallow for fish to survive the winter, so fishing in those lakes will yield no catch. On the other hand, even on a lake where fish are plentiful, we may or may not catch any fish due to conditions or our own competence. Thus, the number of fish caught will be zero if the lake does not support fish, and will be zero, one or more if it does."
Number of wisdom teeth extracted.^[5] The number of wisdom teeth that a person has had extracted can range from 0 to 4. Some individuals, about one-third of the population, do not have any wisdom teeth. For these individuals, the number of wisdom teeth extracted will always be zero. For other individuals, the number extracted will be between 0 and 4, where a 0 indicates that the subject has not yet, and may never, have any of their 4 wisdom teeth extracted.
Publications by PhD candidates.^[6] Long examined the number of publications by 915 doctoral candidates in biochemistry in the last three years of their PhD studies. The proportion of candidates with zero publications exceeded the number predicted by a Poisson model. "Long ^[6] argued that the PhD candidates might fall into two distinct groups: "publishers" (perhaps striving for an academic career) and "non-publishers" (seeking other career paths). One reasonable form of explanation is that the observed zero counts reflect a mixture of the two latent classes – those who simply have not yet published and those who will likely never publish."^[7]

Zero-inflated data as a mixture of two distributions

As the examples above show, zero-inflated data can arise as a mixture of two distributions. The first distribution generates zeros. The second distribution, which may be a Poisson distribution, a negative binomial distribution or other count distribution, generates counts, some of which may be zeros.".^[7]

In the statistical literature, different authors may use different names to distinguish zeros from the two distributions. Some authors describe zeros generated by the first (binary) distribution as "structural" and zeros generated by the second (count) distribution as "random".^[7] Other authors use the terminology "immune" and "susceptible" for the binary and count zeros, respectively ^[1]

Zero-inflated Poisson

One well-known zero-inflated model is Diane Lambert's zero-inflated Poisson model, which concerns a random event containing excess zero-count data in unit time.^[8] For example, the number of insurance claims within a population for a certain type of risk would be zero-inflated by those people who have not taken out insurance against the risk and thus are unable to claim. The zero-inflated Poisson (ZIP) model mixes two zero generating processes. The first process generates zeros. The second process is governed by a Poisson distribution that generates counts, some of which may be zero. The mixture distribution is described as follows:

\Pr(Y=0)=\pi +(1-\pi )e^{-\lambda }

\Pr(Y=y_{i})=(1-\pi ){\frac {\lambda ^{y_{i}}e^{-\lambda }}{y_{i}!}},\qquad y_{i}=1,2,3,...

where the outcome variable $y_{i}$ has any non-negative integer value, $\lambda$ is the expected Poisson count for the $i$ th individual; $\pi$ is the probability of extra zeros.

The mean is $(1-\pi )\lambda$ and the variance is $\lambda (1-\pi )(1+\pi \lambda )$ .

Estimators of ZIP parameters

The method of moments estimators are given by^[9]

{\hat {\lambda }}_{mo}={\frac {s^{2}+m^{2}}{m}}-1,

{\hat {\pi }}_{mo}={\frac {s^{2}-m}{s^{2}+m^{2}-m}},

where $m$ is the sample mean and $s^{2}$ is the sample variance.

The maximum likelihood estimator^[10] can be found by solving the following equation

m(1-e^{-{\hat {\lambda }}_{ml}})={\hat {\lambda }}_{ml}\left(1-{\frac {n_{0}}{n}}\right).

where ${\frac {n_{0}}{n}}$ is the observed proportion of zeros.

A closed form solution of this equation is given by^[11]

{\hat {\lambda }}_{ml}=W_{0}(-se^{-s})+s

with $W_{0}$ being the main branch of Lambert's W-function^[12] and

s={\frac {m}{1-{\frac {n_{0}}{n}}}}

.

Alternatively, the equation can be solved by iteration.^[13]

The maximum likelihood estimator for $\pi$ is given by

{\hat {\pi }}_{ml}=1-{\frac {m}{{\hat {\lambda }}_{ml}}}.

Related models

In 1994, Greene considered the zero-inflated negative binomial (ZINB) model.^[14] Daniel B. Hall adapted Lambert's methodology to an upper-bounded count situation, thereby obtaining a zero-inflated binomial (ZIB) model.^[15]

Discrete pseudo compound Poisson model

If the count data $Y$ is such that the probability of zero is larger than the probability of nonzero, namely

\Pr(Y=0)>0.5

then the discrete data $Y$ obey discrete pseudo compound Poisson distribution.^[16]

In fact, let $G(z)=\sum \limits _{n=0}^{\infty }P(Y=n)z^{n}$ be the probability generating function of $y_{i}$ . If $p_{0}=\Pr(Y=0)>0.5$ , then $|G(z)|\geqslant p_{0}-\sum \limits _{i=1}^{\infty }p_{i}=2p_{0}-1>0$ . Then from the Wiener–Lévy theorem,^[17] $G(z)$ has the probability generating function of the discrete pseudo compound Poisson distribution.

We say that the discrete random variable $Y$ satisfying probability generating function characterization

G_{Y}(z)=\sum \limits _{n=0}^{\infty }P(Y=n)z^{n}=\exp \left(\sum _{k=1}^{\infty }\alpha _{k}\lambda (z^{k}-1)\right),\quad (|z|\leq 1)

has a discrete pseudo compound Poisson distribution with parameters

(\lambda _{1},\lambda _{2},\ldots )=(\alpha _{1}\lambda ,\alpha _{2}\lambda ,\ldots )\in \mathbb {R} ^{\infty }\left(\sum _{k=1}^{\infty }\alpha _{k}=1,\sum \limits _{k=1}^{\infty }|\alpha _{k}|<\infty ,\alpha _{k}\in \mathbb {R} ,\lambda >0\right).

When all the $\alpha _{k}$ are non-negative, it is the discrete compound Poisson distribution (non-Poisson case) with overdispersion property.

Software

pscl, glmmTMB and brms R packages

Related Research Articles

<span class="mw-page-title-main">Negative binomial distribution</span> Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on some dice as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

In probability theory, the probability generating function of a discrete random variable is a power series representation (the generating function) of the probability mass function of the random variable. Probability generating functions are often employed for their succinct description of the sequence of probabilities Pr(X = i) in the probability mass function for a random variable X, and to make available the well-developed theory of power series with non-negative coefficients.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form

In probability theory, a compound Poisson distribution is the probability distribution of the sum of a number of independent identically-distributed random variables, where the number of terms to be added is itself a Poisson-distributed variable. The result can be either a continuous or a discrete distribution.

In probability theory, a Lévy process, named after the French mathematician Paul Lévy, is a stochastic process with independent, stationary increments: it represents the motion of a point whose successive displacements are random, in which displacements in pairwise disjoint time intervals are independent, and displacements in different time intervals of the same length have identical probability distributions. A Lévy process may thus be viewed as the continuous-time analog of a random walk.

A continuous-time Markov chain (CTMC) is a continuous stochastic process in which, for each state, the process will change state according to an exponential random variable and then move to a different state as specified by the probabilities of a stochastic matrix. An equivalent formulation describes the process as changing state according to the least value of a set of exponential random variables, one for each possible state it can move to, with the parameters determined by the current state.

In statistics, the Vuong closeness test is a likelihood-ratio-based test for model selection using the Kullback–Leibler information criterion. This statistic makes probabilistic statements about two models. They can be nested, strictly non-nested or partially non-nested. The statistic tests the null hypothesis that the two models are equally close to the true data generating process, against the alternative that one model is closer. It cannot make any decision whether the "closer" model is the true model.

<span class="mw-page-title-main">Noncentral chi-squared distribution</span> Noncentral generalization of the chi-squared distribution

In probability theory and statistics, the noncentral chi-squared distribution is a noncentral generalization of the chi-squared distribution. It often arises in the power analysis of statistical tests in which the null distribution is a chi-squared distribution; important examples of such tests are the likelihood-ratio tests.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability theory, a member of the (a, b, 0) class of distributions is any distribution of a discrete random variable N whose values are nonnegative integers whose probability mass function satisfies the recurrence formula

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

Wiener–Lévy theorem is a theorem in Fourier analysis, which states that a function of an absolutely convergent Fourier series has an absolutely convergent Fourier series under some conditions. The theorem was named after Norbert Wiener and Paul Lévy.

In probability theory and statistics, the Conway–Maxwell–binomial (CMB) distribution is a three parameter discrete probability distribution that generalises the binomial distribution in an analogous manner to the way that the Conway–Maxwell–Poisson distribution generalises the Poisson distribution. The CMB distribution can be used to model both positive and negative association among the Bernoulli summands,.

A mixed Poisson distribution is a univariate discrete probability distribution in stochastics. It results from assuming that the conditional distribution of a random variable, given the value of the rate parameter, is a Poisson distribution, and that the rate parameter itself is considered as a random variable. Hence it is a special case of a compound probability distribution. Mixed Poisson distributions can be found in actuarial mathematics as a general approach for the distribution of the number of claims and is also examined as an epidemiological model. It should not be confused with compound Poisson distribution or compound Poisson process.

References

1 2 3 4 Bilder, Christopher; Loughin, Thomas (2015), Analysis of Categorical Data with R (First ed.), CRC Press / Chapman & Hall, ISBN 978-1439855676
↑ Hilbe, Joseph M. (2014), Modeling Count Data (First ed.), Cambridge University Press, ISBN 978-1107611252
1 2 Hilbe, Joseph M. (2007), Negative Binomial Regression (Second ed.), Cambridge University Press, ISBN 978-0521198158
1 2 Lachin, John M. (2011), Biostatistical Methods: The Assessment of Relative Risks (Second ed.), Wiley, ISBN 978-0470508220
↑ "Biostatistics II. 1.3 - Zero-inflated Models". YouTube . Retrieved July 1, 2022.
1 2 Long, J. Scott (1997), Regression Models for Categorical and Limited Dependent Variables (First ed.), Sage Publications, ISBN 978-0803973749
1 2 3 Friendly, Michael; David, Thomas (2016), Discrete Data Analysis with R (First ed.), CRC Press / Chapman & Hall, ISBN 978-1498725835
↑ Lambert, Diane (1992). "Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing". Technometrics. 34 (1): 1–14. doi:10.2307/1269547. JSTOR 1269547.
↑ Beckett, Sadie; Jee, Joshua; Ncube, Thalepo; Washington, Quintel; Singh, Anshuman; Pal, Nabendu (2014). "Zero-inflated Poisson (ZIP) distribution: parameter estimation and applications to model data from natural calamities". Involve. 7 (6): 751–767. doi: 10.2140/involve.2014.7.751 .
↑ Johnson, Norman L.; Kotz, Samuel; Kemp, Adrienne W. (1992). Univariate Discrete Distributions (2nd ed.). Wiley. pp. 312–314. ISBN 978-0-471-54897-3.
↑ Dencks, Stefanie; Piepenbrock, Marion; Schmitz, Georg (2020). "Assessing Vessel Reconstruction in Ultrasound Localization Microscopy by Maximum-Likelihood Estimation of a Zero-Inflated Poisson Model". IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control. doi: 10.1109/TUFFC.2020.2980063 .
↑ Corless, R. M.; Gonnet, G. H.; Hare, D. E. G.; Jeffrey, D. J.; Knuth, D. E. (1996). "On the Lambert W Function". Advances in Computational Mathematics. 5 (1): 329–359. arXiv: 1809.07369 . doi:10.1007/BF02124750.
↑ Böhning, Dankmar; Dietz, Ekkehart; Schlattmann, Peter; Mendonca, Lisette; Kirchner, Ursula (1999). "The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology". Journal of the Royal Statistical Society, Series A. 162 (2): 195–209. doi:10.1111/1467-985x.00130.
↑ Greene, William H. (1994). "Some Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models". Working Paper EC-94-10: Department of Economics, New York University. SSRN 1293115.
↑ Hall, Daniel B. (2000). "Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study". Biometrics. 56 (4): 1030–1039. doi:10.1111/j.0006-341X.2000.01030.x.
↑ Huiming, Zhang; Yunxiao Liu; Bo Li (2014). "Notes on discrete compound Poisson model with applications to risk theory". Insurance: Mathematics and Economics. 59: 325–336. doi:10.1016/j.insmatheco.2014.09.012.
↑ Zygmund, A. (2002). Trigonometric Series. Cambridge: Cambridge University Press. p. 245.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[BilderLoughin2015-1] 1 2 3 4 Bilder, Christopher; Loughin, Thomas (2015), Analysis of Categorical Data with R (First ed.), CRC Press / Chapman & Hall, ISBN 978-1439855676

[HilbeNBR2014-2] Hilbe, Joseph M. (2014), Modeling Count Data (First ed.), Cambridge University Press, ISBN 978-1107611252

[HilbeNBR2007-3] 1 2 Hilbe, Joseph M. (2007), Negative Binomial Regression (Second ed.), Cambridge University Press, ISBN 978-0521198158

[Lachin2011-4] 1 2 Lachin, John M. (2011), Biostatistical Methods: The Assessment of Relative Risks (Second ed.), Wiley, ISBN 978-0470508220

[ChernyavskiyMcmurry-5] "Biostatistics II. 1.3 - Zero-inflated Models". YouTube . Retrieved July 1, 2022.

[Long1997-6] 1 2 Long, J. Scott (1997), Regression Models for Categorical and Limited Dependent Variables (First ed.), Sage Publications, ISBN 978-0803973749

[FriendlyMeyer2016-7] 1 2 3 Friendly, Michael; David, Thomas (2016), Discrete Data Analysis with R (First ed.), CRC Press / Chapman & Hall, ISBN 978-1498725835

[8] Lambert, Diane (1992). "Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing". Technometrics. 34 (1): 1–14. doi:10.2307/1269547. JSTOR 1269547.

[9] Beckett, Sadie; Jee, Joshua; Ncube, Thalepo; Washington, Quintel; Singh, Anshuman; Pal, Nabendu (2014). "Zero-inflated Poisson (ZIP) distribution: parameter estimation and applications to model data from natural calamities". Involve. 7 (6): 751–767. doi: 10.2140/involve.2014.7.751 .

[10] Johnson, Norman L.; Kotz, Samuel; Kemp, Adrienne W. (1992). Univariate Discrete Distributions (2nd ed.). Wiley. pp. 312–314. ISBN 978-0-471-54897-3.

[11] Dencks, Stefanie; Piepenbrock, Marion; Schmitz, Georg (2020). "Assessing Vessel Reconstruction in Ultrasound Localization Microscopy by Maximum-Likelihood Estimation of a Zero-Inflated Poisson Model". IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control. doi: 10.1109/TUFFC.2020.2980063 .

[12] Corless, R. M.; Gonnet, G. H.; Hare, D. E. G.; Jeffrey, D. J.; Knuth, D. E. (1996). "On the Lambert W Function". Advances in Computational Mathematics. 5 (1): 329–359. arXiv: 1809.07369 . doi:10.1007/BF02124750.

[13] Böhning, Dankmar; Dietz, Ekkehart; Schlattmann, Peter; Mendonca, Lisette; Kirchner, Ursula (1999). "The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology". Journal of the Royal Statistical Society, Series A. 162 (2): 195–209. doi:10.1111/1467-985x.00130.

[14] Greene, William H. (1994). "Some Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models". Working Paper EC-94-10: Department of Economics, New York University. SSRN 1293115.

[15] Hall, Daniel B. (2000). "Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study". Biometrics. 56 (4): 1030–1039. doi:10.1111/j.0006-341X.2000.01030.x.

[16] Huiming, Zhang; Yunxiao Liu; Bo Li (2014). "Notes on discrete compound Poisson model with applications to risk theory". Insurance: Mathematics and Economics. 59: 325–336. doi:10.1016/j.insmatheco.2014.09.012.

[17] Zygmund, A. (2002). Trigonometric Series. Cambridge: Cambridge University Press. p. 245.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]