Mark and recapture

Last updated

Mark and recapture is a method commonly used in ecology to estimate an animal population's size where it is impractical to count every individual. [1] A portion of the population is captured, marked, and released. Later, another portion will be captured and the number of marked individuals within the sample is counted. Since the number of marked individuals within the second sample should be proportional to the number of marked individuals in the whole population, an estimate of the total population size can be obtained by dividing the number of marked individuals by the proportion of marked individuals in the second sample. The method assumes, rightly or wrongly, that the probability of capture is the same for all individuals. [2] Other names for this method, or closely related methods, include capture-recapture, capture-mark-recapture, mark-recapture, sight-resight, mark-release-recapture, multiple systems estimation, band recovery, the Petersen method, [3] and the Lincoln method.

Contents

Another major application for these methods is in epidemiology, [4] where they are used to estimate the completeness of ascertainment of disease registers. Typical applications include estimating the number of people needing particular services (e.g. services for children with learning disabilities, services for medically frail elderly living in the community), or with particular conditions (e.g. illegal drug addicts, people infected with HIV, etc.). [5]

Biologist is marking a Chittenango ovate amber snail to monitor the population. Novisuccinea chittenangoensis 5.png
Biologist is marking a Chittenango ovate amber snail to monitor the population.

Typically a researcher visits a study area and uses traps to capture a group of individuals alive. Each of these individuals is marked with a unique identifier (e.g., a numbered tag or band), and then is released unharmed back into the environment. A mark-recapture method was first used for ecological study in 1896 by C.G. Johannes Petersen to estimate plaice, Pleuronectes platessa , populations. [2]

Sufficient time should be allowed to pass for the marked individuals to redistribute themselves among the unmarked population. [2]

Next, the researcher returns and captures another sample of individuals. Some individuals in this second sample will have been marked during the initial visit and are now known as recaptures. [6] Other organisms captured during the second visit, will not have been captured during the first visit to the study area. These unmarked animals are usually given a tag or band during the second visit and then are released. [2]

Population size can be estimated from as few as two visits to the study area. Commonly, more than two visits are made, particularly if estimates of survival or movement are desired. Regardless of the total number of visits, the researcher simply records the date of each capture of each individual. The "capture histories" generated are analyzed mathematically to estimate population size, survival, or movement. [2]

When capturing and marking organisms, ecologists need to consider the welfare of the organisms. If the chosen identifier harms the organism, then its behavior might become irregular.

Notation

Let

N = Number of animals in the population
n = Number of animals marked on the first visit
K = Number of animals captured on the second visit
k = Number of recaptured animals that were marked

A biologist wants to estimate the size of a population of turtles in a lake. She captures 10 turtles on her first visit to the lake, and marks their backs with paint. A week later she returns to the lake and captures 15 turtles. Five of these 15 turtles have paint on their backs, indicating that they are recaptured animals. This example is (n, K, k) = (10, 15, 5). The problem is to estimate N.

Lincoln–Petersen estimator

The Lincoln–Petersen method [7] (also known as the Petersen–Lincoln index [2] or Lincoln index) can be used to estimate population size if only two visits are made to the study area. This method assumes that the study population is "closed". In other words, the two visits to the study area are close enough in time so that no individuals die, are born, or move into or out of the study area between visits. The model also assumes that no marks fall off animals between visits to the field site by the researcher, and that the researcher correctly records all marks.

Given those conditions, estimated population size is:

Derivation

It is assumed [8] that all individuals have the same probability of being captured in the second sample, regardless of whether they were previously captured in the first sample (with only two samples, this assumption cannot be tested directly).

This implies that, in the second sample, the proportion of marked individuals that are caught () should equal the proportion of the total population that is marked (). For example, if half of the marked individuals were recaptured, it would be assumed that half of the total population was included in the second sample.

In symbols,

A rearrangement of this gives

the formula used for the Lincoln–Petersen method. [8]

Sample calculation

In the example (n, K, k) = (10, 15, 5) the Lincoln–Petersen method estimates that there are 30 turtles in the lake.

Chapman estimator

The Lincoln–Petersen estimator is asymptotically unbiased as sample size approaches infinity, but is biased at small sample sizes. [9] An alternative less biased estimator of population size is given by the Chapman estimator: [9]

Sample calculation

The example (n, K, k) = (10, 15, 5) gives

Note that the answer provided by this equation must be truncated not rounded. Thus, the Chapman method estimates 28 turtles in the lake.

Surprisingly, Chapman's estimate was one conjecture from a range of possible estimators: "In practice, the whole number immediately less than (K+1)(n+1)/(k+1) or even Kn/(k+1) will be the estimate. The above form is more convenient for mathematical purposes." [9] (see footnote, page 144). Chapman also found the estimator could have considerable negative bias for small Kn/N [9] (page 146), but was unconcerned because the estimated standard deviations were large for these cases.

Confidence interval

An approximate confidence interval for the population size N can be obtained as:

where corresponds to the quantile of a standard normal random variable, and

The example (n, K, k) = (10, 15, 5) gives the estimate N ≈ 30 with a 95% confidence interval of 22 to 65.

It has been shown that this confidence interval has actual coverage probabilities that are close to the nominal level even for small populations and extreme capture probabilities (near to 0 or 1), in which cases other confidence intervals fail to achieve the nominal coverage levels. [10]

Bayesian estimate

The mean value ± standard deviation is

where

for
for

A derivation is found here: Talk:Mark and recapture#Statistical treatment.

The example (n, K, k) = (10, 15, 5) gives the estimate N ≈ 42 ± 21.5

Capture probability

Bank vole, Myodes glareolus, in a capture-release small mammal population study for London Wildlife Trust at Gunnersbury Triangle local nature reserve Bank Vole GT Jo Hodges.jpg
Bank vole, Myodes glareolus , in a capture-release small mammal population study for London Wildlife Trust at Gunnersbury Triangle local nature reserve

The capture probability refers to the probability of a detecting an individual animal or person of interest, [11] and has been used in both ecology and epidemiology for detecting animal or human diseases, [12] respectively.

The capture probability is often defined as a two-variable model, in which f is defined as the fraction of a finite resource devoted to detecting the animal or person of interest from a high risk sector of an animal or human population, and q is the frequency of time that the problem (e.g., an animal disease) occurs in the high-risk versus the low-risk sector. [13] For example, an application of the model in the 1920s was to detect typhoid carriers in London, who were either arriving from zones with high rates of tuberculosis (probability q that a passenger with the disease came from such an area, where q>0.5), or low rates (probability 1−q). [14] It was posited that only 5 out of 100 of the travelers could be detected, and 10 out of 100 were from the high risk area. Then the capture probability P was defined as:

where the first term refers to the probability of detection (capture probability) in a high risk zone, and the latter term refers to the probability of detection in a low risk zone. Importantly, the formula can be re-written as a linear equation in terms of f:

Because this is a linear function, it follows that for certain versions of q for which the slope of this line (the first term multiplied by f) is positive, all of the detection resource should be devoted to the high-risk population (f should be set to 1 to maximize the capture probability), whereas for other value of q, for which the slope of the line is negative, all of the detection should be devoted to the low-risk population (f should be set to 0. We can solve the above equation for the values of q for which the slope will be positive to determine the values for which f should be set to 1 to maximize the capture probability:

which simplifies to:

This is an example of linear optimization. [13] In more complex cases, where more than one resource f is devoted to more than two areas, multivariate optimization is often used, through the simplex algorithm or its derivatives.

More than two visits

The literature on the analysis of capture-recapture studies has blossomed since the early 1990s[ citation needed ]. There are very elaborate statistical models available for the analysis of these experiments. [15] A simple model which easily accommodates the three source, or the three visit study, is to fit a Poisson regression model. Sophisticated mark-recapture models can be fit with several packages for the Open Source R programming language. These include "Spatially Explicit Capture-Recapture (secr)", [16] "Loglinear Models for Capture-Recapture Experiments (Rcapture)", [17] and "Mark-Recapture Distance Sampling (mrds)". [18] Such models can also be fit with specialized programs such as MARK [19] or E-SURGE. [20]

Other related methods which are often used include the Jolly–Seber model (used in open populations and for multiple census estimates) and Schnabel estimators [21] (an expansion to the Lincoln–Petersen method for closed populations). These are described in detail by Sutherland. [22]

Integrated approaches

Modelling mark-recapture data is trending towards a more integrative approach, [23] which combines mark-recapture data with population dynamics models and other types of data. The integrated approach is more computationally demanding, but extracts more information from the data improving parameter and uncertainty estimates. [24]

See also

Related Research Articles

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, but its precursors can be found in statistical physics as early as 1949. Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

<span class="mw-page-title-main">Discrete uniform distribution</span> Probability distribution on equally likely outcomes

In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of outcomes equally likely to happen".

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

<span class="mw-page-title-main">Kernel density estimation</span> Estimator

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

<span class="mw-page-title-main">Kaplan–Meier estimator</span> Non-parametric statistic used to estimate the survival function

The Kaplan–Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts to the Journal of the American Statistical Association. The journal editor, John Tukey, convinced them to combine their work into one paper, which has been cited more than 34,000 times since its publication in 1958.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-estimators are not inherently robust, as is clear from the fact that they include maximum likelihood estimators, which are in general not robust. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls. After drawing red balls, black balls and green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

<span class="mw-page-title-main">German tank problem</span> Mathematical problem

In the statistical theory of estimation, the German tank problem consists of estimating the maximum of a discrete uniform distribution from sampling without replacement. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to N. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate N from these observed numbers.

In survey research, the design effect is a number that shows how well a sample of people may represent a larger group of people for a specific measure of interest. This is important when the sample comes from a sampling method that is different than just picking people using a simple random sample.

Inverse probability weighting is a statistical technique for estimating quantities related to a population other than the one from which the data was collected. Study designs with a disparate sampling population and population of target inference are common in application. There may be prohibitive factors barring researchers from directly sampling from the target population such as cost, time, or ethical concerns. A solution to this problem is to use an alternate design strategy, e.g. stratified sampling. Weighting, when correctly applied, can potentially improve the efficiency and reduce the bias of unweighted estimators.

References

  1. "Mark-Recapture".
  2. 1 2 3 4 5 6 Southwood, T. R. E.; Henderson, P. (2000). Ecological Methods (3rd ed.). Oxford: Blackwell Science.
  3. Krebs, Charles J. (2009). Ecology (6th ed.). Pearson Benjamin Cummings. p. 119. ISBN   978-0-321-50743-3.
  4. Chao, A.; Tsay, P. K.; Lin, S. H.; Shau, W. Y.; Chao, D. Y. (2001). "The applications of capture-recapture models to epidemiological data". Statistics in Medicine . 20 (20): 3123–3157. doi:10.1002/sim.996. PMID   11590637. S2CID   78437.
  5. Allen; et al. (2019). "Estimating the Number of People Who Inject Drugs in A Rural County in Appalachia". American Journal of Public Health. 109 (3): 445–450. doi:10.2105/AJPH.2018.304873. PMC   6366498 . PMID   30676803.
  6. "Recapture Definition & Meaning - Merriam-Webster". 21 August 2023.
  7. Seber, G. A. F. (1982). The Estimation of Animal Abundance and Related Parameters. Caldwel, New Jersey: Blackburn Press. ISBN   1-930665-55-5.
  8. 1 2 Charles J. Krebs (1999). Ecological Methodology (2nd ed.). Benjamin/Cummings. ISBN   9780321021731.
  9. 1 2 3 4 Chapman, D.G. (1951). Some properties of the hypergeometric distribution with applications to zoological sample censuses. UC Publications in Statistics. University of California Press.
  10. Sadinle, Mauricio (2009-10-01). "Transformed Logit Confidence Intervals for Small Populations in Single Capture–Recapture Estimation". Communications in Statistics - Simulation and Computation. 38 (9): 1909–1924. doi:10.1080/03610910903168595. ISSN   0361-0918. S2CID   205556773.
  11. Drenner, Ray (1978). "Capture probability: the role of zooplankter escape in the selective feeding of planktivorous fish". Journal of the Fisheries Board of Canada. 35 (10): 1370–1373. doi:10.1139/f78-215.
  12. MacKenzie, Darryl (2002). "How should detection probability be incorporated into estimates of relative abundance?". Ecology. 83 (9): 2387–2393. doi:10.1890/0012-9658(2002)083[2387:hsdpbi]2.0.co;2.
  13. 1 2 Bolker, Benjamin (2008). Ecological Models and Data in R. Princeton University Press. ISBN   9781400840908.
  14. Unknown (1921). "The Health of London". Hosp Health Rev. 1 (3): 71–2. PMC   5518027 . PMID   29418259.
  15. McCrea, R.S. and Morgan, B.J.T. (2014) "Analysis of capture-recapture data" . Retrieved 19 Nov 2014. "Chapman and Hall/CRC Press" . Retrieved 19 Nov 2014.
  16. Efford, Murray (2016-09-02). "Spatially Explicit Capture-Recapture (secr)". Comprehensive R Archive Network (CRAN). Retrieved 2016-09-02.
  17. Rivest, Louis-Paul; Baillargeon, Sophie (2014-09-01). "Loglinear Models for Capture-Recapture Experiments (Rcapture)". Comprehensive R Archive Network (CRAN). Retrieved 2016-09-02.
  18. Laake, Jeff; Borchers, David; Thomas, Len; Miller, David; Bishop, Jon (2015-08-17). "Mark-Recapture Distance Sampling (mrds)". Comprehensive R Archive Network (CRAN).
  19. "Program MARK". Archived from the original on 21 February 2006. Retrieved 29 May 2013.
  20. "Logiciels". Archived from the original on 2009-07-24.
  21. Schnabel, Z. E. (1938). "The Estimation of the Total Fish Population of a Lake". American Mathematical Monthly . 45 (6): 348–352. doi:10.2307/2304025. JSTOR   2304025.
  22. William J. Sutherland, ed. (1996). Ecological Census Techniques: A Handbook. Cambridge University Press. ISBN   0-521-47815-4.
  23. Maunder M.N. (2003) Paradigm shifts in fisheries stock assessment: from integrated analysis to Bayesian analysis and back again. Natural Resource Modeling 16:465–475
  24. Maunder, M.N. (2001) Integrated Tagging and Catch-at-Age Analysis (ITCAAN). In Spatial Processes and Management of Fish Populations, edited by G.H. Kruse, N. Bez, A. Booth, M.W. Dorn, S. Hills, R.N. Lipcius, D. Pelletier, C. Roy, S.J. Smith, and D. Witherell, Alaska Sea Grant College Program Report No. AK-SG-01-02, University of Alaska Fairbanks, pp. 123–146.

Further reading