Distance sampling

Last updated

Distance sampling is a widely used group of closely related methods for estimating the density and/or abundance of populations. The main methods are based on line transects or point transects. [1] [2] In this method of sampling, the data collected are the distances of the objects being surveyed from these randomly placed lines or points, and the objective is to estimate the average density of the objects within a region. [3]

Contents

Basic line transect methodology

Basic distance sampling survey approach using line transects. A field observer detects an object and records distance r and angle th to the transect line. This allows calculation of object distance to the transect (x). All x from the survey are used to model how detectability decreases with distance from the transect, which allows estimation of total population density in the surveyed area. Distance sampling method line transects diagram.png
Basic distance sampling survey approach using line transects. A field observer detects an object and records distance r and angle θ to the transect line. This allows calculation of object distance to the transect (x). All x from the survey are used to model how detectability decreases with distance from the transect, which allows estimation of total population density in the surveyed area.

A common approach to distance sampling is the use of line transects. The observer traverses a straight line (placed randomly or following some planned distribution). Whenever they observe an object of interest (e.g., an animal of the type being surveyed), they record the distance from their current position to the object (r), as well as the angle of the detection to the transect line (θ). The distance of the object to the transect can then be calculated as x = r * sin(θ). These distances x are the detection distances that will be analyzed in further modeling.

Objects are detected out to a pre-determined maximum detection distance w. Not all objects within w will be detected, but a fundamental assumption is that all objects at zero distance (i.e., on the line itself) are detected. Overall detection probability is thus expected to be 1 on the line, and to decrease with increasing distance from the line. The distribution of the observed distances is used to estimate a "detection function" that describes the probability of detecting an object at a given distance. Given that various basic assumptions hold, this function allows the estimation of the average probability P of detecting an object given that is within width w of the line. Object density can then be estimated as D = n / (P*a), where n is the number of objects detected and a is the size of the region covered (total length of the transect (L) multiplied by 2w).

In summary, modeling how detectability drops off with increasing distance from the transect allows estimating how many objects there are in total in the area of interest, based on the number that were actually observed. [2]

The survey methodology for point transects is slightly different. In this case, the observer remains stationary, the survey ends not when the end of the transect is reached but after a pre-determined time, and measured distances to the observer are used directly without conversion to transverse distances. Detection function types and fitting are also different to some degree. [2]

Detection function

Half-normal detection function (red line) fitted to PDF of detection data. Data have been collated into distance bands (either collected as such, or combined after collection to improve model fitting). Detection probability decreases with distance from center line (y = 0). Distance sampling method function fitting basic.png
Half-normal detection function (red line) fitted to PDF of detection data. Data have been collated into distance bands (either collected as such, or combined after collection to improve model fitting). Detection probability decreases with distance from center line (y = 0).

The drop-off of detectability with increasing distance from the transect line is modeled using a detection function g(y) (here y is distance from the line). This function is fitted to the distribution of detection ranges represented as a probability density function (PDF). The PDF is a histogram of collected distances and describes the probability that an object at distance y will be detected by an observer on the center line, with detections on the line itself (y = 0) assumed to be certain (P = 1).

By preference, g(y) is a robust function that can represent data with unclear or weakly defined distribution characteristics, as is frequently the case in field data. Several types of functions are commonly used, depending on the general shape of the detection data's PDF:

Detection functionForm
Uniform 1/w
Half-normal exp(-y2/2σ2)
Hazard-rate 1-exp(-(y/σ)-b)
Negative exponential exp(-ay)

Here w is the overall detection truncation distance and a, b and σ are function-specific parameters. The half-normal and hazard-rate functions are generally considered to be most likely to represent field data that was collected under well-controlled conditions. Detection probability appearing to increase or remain constant with distance from the transect line may indicate problems with data collection or survey design. [2]

Covariates

Series expansions

A frequently used method to improve the fit of the detection function to the data is the use of series expansions. Here, the function is split into a "key" part (of the type covered above) and a "series" part; i.e., g(y) = key(y)[1 + series(y)]. The series generally takes the form of a polynomial (e.g. a Hermite polynomial) and is intended to add flexibility to the form of the key function, allowing it to fit more closely to the data PDF. While this can improve the precision of density/abundance estimates, its use is only defensible if the data set is of sufficient size and quality to represent a reliable estimate of detection distance distribution. Otherwise there is a risk of overfitting the data and allowing non-representative characteristics of the data set to bias the fitting process. [2] [4]

Assumptions and sources of bias

Since distance sampling is a comparatively complex survey method, the reliability of model results depends on meeting a number of basic assumptions. The most fundamental ones are listed below. Data derived from surveys that violate one or more of these assumptions can frequently, but not always, be corrected to some extent before or during analysis. [1] [2]

Basic assumptions of distance sampling
AssumptionViolationPrevention/post-hoc correctionData example
All animals on the transect line itself are detected (i.e., P(0) = 1)This can often be assumed in terrestrial surveys, but may be problematic in shipboard surveys. Violation may result in strong bias of model estimatesIn dual observer surveys, one observer may be tasked to "guard the centerline".

Post-hoc fixes are sometimes possible but can be complex. [1] It is thus worth avoiding any violations of this assumption

Animals are randomly and evenly distributed throughout the surveyed areaThe main sources of bias are

a) clustered populations (flocks etc.) but individual detections are treated as independent

b) transects are not placed independently of gradients of density (roads, watercourses etc.)

c) transects are too close together

a) record not individuals but clusters + cluster size, then incorporate estimation of cluster size into the detection function

b) place transects either randomly, or across known gradients of density

c) make sure that maximum detection range (w) does not overlap between transects

Animals do not move before detectionResulting bias is negligible if movement is random. Movement in response to the observer (avoidance/attraction) will incur a negative/positive bias in detectabilityAvoidance behaviour is common and may be difficult to prevent in the field. An effective post-hoc remedy is the averaging-out of data by partitioning detections into intervals, and by using detection functions with a shoulder (e.g., hazard-rate)
An indication of avoidance behaviour in the data - detections initially increase rather than decrease with added distance to the transect line Distance sampling observer avoidance.png
An indication of avoidance behaviour in the data - detections initially increase rather than decrease with added distance to the transect line
Measurements (angles and distances) are exactRandom errors are negligible, but systematic errors may introduce bias. This often happens with rounding of angles or distances to preferred ("round") values, resulting in heaping at particular values. Rounding of angles to zero is particularly commonAvoid dead reckoning in the field by using range finders and angle boards. Post-hoc smoothing of data by partitioning into detection intervals is effective in addressing minor biases
An indication of angle rounding to zero in the data - there are more detections than expected in the very first data interval Distance sampling rounding to zero.png
An indication of angle rounding to zero in the data - there are more detections than expected in the very first data interval

Software implementations

Related Research Articles

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.

Statistical inference

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. However, these activities can be viewed as two facets of the same field of application, and together they have undergone substantial development over the past few decades. A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Transect A path along which the observer counts and records occurrences of the subjects of the survey

A transect is a path along which one counts and records occurrences of the objects of study.

Outlier observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined.

Unified neutral theory of biodiversity theory of evolutionary biology

The unified neutral theory of biodiversity and biogeography is a theory and the title of a monograph by ecologist Stephen Hubbell. The hypothesis aims to explain the diversity and relative abundance of species in ecological communities. Like other neutral theories of ecology, Hubbell assumes that the differences between members of an ecological community of trophically similar species are "neutral", or irrelevant to their success. This implies that niche differences do not influence abundance and the abundance of each species follows a random walk. The theory has sparked controversy, and some authors consider it a more complex version of other null models that fit the data better.

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

Generalized linear model

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Mathematical statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

Mark and recapture

Mark and recapture is a method commonly used in ecology to estimate an animal population's size where it is impractical to count every individual. A portion of the population is captured, marked, and released. Later, another portion will be captured and the number of marked individuals within the sample is counted. Since the number of marked individuals within the second sample should be proportional to the number of marked individuals in the whole population, an estimate of the total population size can be obtained by dividing the number of marked individuals by the proportion of marked individuals in the second sample. The method is most useful when it is not practical to count all the individuals in the population. Other names for this method, or closely related methods, include capture-recapture, capture-mark-recapture, mark-recapture, sight-resight, mark-release-recapture, multiple systems estimation, band recovery, the Petersen method, and the Lincoln method.

The Malmquist bias is an effect in observational astronomy which leads to the preferential detection of intrinsically bright objects. It was first described in 1922 by Swedish astronomer Gunnar Malmquist (1893–1982), who then greatly elaborated upon this work in 1925. In statistics, this bias is referred to as a selection bias or data censoring. It affects the results in a brightness-limited survey, where stars below a certain apparent brightness cannot be included. Since observed stars and galaxies appear dimmer when farther away, the brightness that is measured will fall off with distance until their brightness falls below the observational threshold. Objects which are more luminous, or intrinsically brighter, can be observed at a greater distance, creating a false trend of increasing intrinsic brightness, and other related quantities, with distance. This effect has led to many spurious claims in the field of astronomy. Properly correcting for these effects has become an area of great focus.

Semi-supervised learning

Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning and supervised learning. It is a special instance of weak supervision.

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations; under this model, non-robust methods like a t-test work poorly.

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of samples/images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training samples/images.

Weak gravitational lensing

While the presence of any mass bends the path of light passing near it, this effect rarely produces the giant arcs and multiple images associated with strong gravitational lensing. Most lines of sight in the universe are thoroughly in the weak lensing regime, in which the deflection is impossible to detect in a single background source. However, even in these cases, the presence of the foreground mass can be detected, by way of a systematic alignment of background sources around the lensing mass. Weak gravitational lensing is thus an intrinsically statistical measurement, but it provides a way to measure the masses of astronomical objects without requiring assumptions about their composition or dynamical state.

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

References

  1. 1 2 3 Buckland, S. T., Anderson, D. R., Burnham, K. P. and Laake, J. L. (1993). Distance Sampling: Estimating Abundance of Biological Populations. London: Chapman and Hall. ISBN   0-412-42660-9
  2. 1 2 3 4 5 6 Buckland, Stephen T.; Anderson, David R.; Burnham, Kenneth Paul; Laake, Jeffrey Lee; Borchers, David Louis; Thomas, Leonard (2001). Introduction to distance sampling: estimating abundance of biological populations. Oxford: Oxford University Press.
  3. Everitt, B. S. (2002) The Cambridge Dictionary of Statistics, 2nd Edition. CUP ISBN   0-521-81099-X (entry for distance sampling)
  4. Buckland, S. T. (2004). Advanced distance sampling. Oxford University Press.

Further reading