The unseen species problem in ecology deals with the estimation of the number of species represented in an ecosystem that were not observed by samples. It more specifically relates to how many new species would be discovered if more samples were taken in an ecosystem. The study of the unseen species problem was started in the early 1940s, by Alexander Steven Corbet. He spent two years in British Malaya trapping butterflies and was curious how many new species he would discover if he spent another two years trapping. Many different estimation methods have been developed to determine how many new species would be discovered given more samples.
The unseen species problem also applies more broadly, as the estimators can be used to estimate any new elements of a set not previously found in samples. An example of this is determining how many words William Shakespeare knew based on all of his written works. [1]
The unseen species problem can be broken down mathematically as follows: If independent samples are taken, , and then if more independent samples were taken, the number of unseen species that will be discovered by the additional samples is given by with being the second set of samples.
In the early 1940s Alexander Steven Corbet spent 2 years in British Malaya trapping butterflies. [2] He kept track of how many species he observed, and how many members of each species were captured. For example, there were 74 different species of which he captured only 2 individual butterflies.
When Corbet returned to the United Kingdom, he approached biostatistician Ronald Fisher and asked how many new species of butterflies he could expect to catch if he went trapping for another two years; [3] in essence, Corbet was asking how many species he observed zero times.
Fisher responded with a simple estimation: for an additional 2 years of trapping, Corbet could expect to capture 75 new species. He did this using a simple summation (data provided by Orlitsky [3] in the table from the Example below: Here corresponds to the number of individual species that were observed times. Fisher's sum was later confirmed by Good–Toulmin. [2]
To estimate the number of unseen species, let be the number of future samples () divided by the number of past samples (), or . Let be the number of individual species observed times (for example, if there were 74 species of butterflies with 2 observed members throughout the samples, then ).
The Good–Toulmin (GT) estimator was developed by Good and Toulmin in 1953. [4] The estimate of the unseen species based on the Good–Toulmin estimator is given by The Good–Toulmin Estimator has been shown to be a good estimate for values of The Good–Toulmin estimator also approximately satisfies This means that estimates to within as long as
However, for , the Good–Toulmin estimator fails to capture accurate results. This is because, if increases by for with meaning that if grows super-linearly in but can grow at most linearly with Therefore, when grows faster than and does not approximate the true value. [3]
To compensate for this, Efron and Thisted in 1976 [1] showed that a truncated Euler transform can also be a usable estimate (the "ET" estimate): with where and where is the location chosen to truncate the Euler transform.
Similar to the approach by Efron and Thisted, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu developed the smooth Good–Toulmin estimator. They realized that the Good–Toulmin estimator failed because of the exponential growth, and not its bias. [3] Therefore, they estimated the number of unseen species by truncating the series Orlitsky, Suresh, and Wu also noted that for distributions with , the driving term in the summation estimate is the term, regardless of which value of is chosen. [2] To solve this, they selected a random nonnegative integer , truncated the series at , and then took the average over a distribution about . [3] The resulting estimator is This method was chosen because the bias of shifts signs due to the coefficient. Averaging over a distribution of therefore reduces the bias. This means that the estimator can be written as the linear combination of the prevalence: [2] Depending on the distribution of chosen, the results will vary. With this method, estimates can be made for , and this is the best possible. [3]
The species discovery curve can also be used. This curve relates the number of species found in an area as a function of the time. These curves can also be created by using estimators (such as the Good–Toulmin estimator) and plotting the number of unseen species at each value for . [5]
A species discovery curve is always increasing, as there is never a sample that could decrease the number of discovered species. Furthermore, the species discovery curve is also decelerating – the more samples taken, the fewer unseen species are expected to be discovered. The species discovery curve will also never asymptote, as it is assumed that although the discovery rate might become infinitely slow, it will never actually stop. [5] Two common models for a species discovery curve are the logarithmic and the exponential function.
As an example, consider the data Corbet provided Fisher in the 1940s. [3] Using the Good–Toulmin model, the number of unseen species is found using This can then be used to create a relationship between and .
Number of observed members, | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of species, | 118 | 74 | 44 | 24 | 29 | 22 | 20 | 19 | 20 | 15 | 12 | 14 | 6 | 12 | 6 |
This relationship is shown in the plot below.
From the plot, it is seen that at , which was the value of that Corbet brought to Fisher, the resulting estimate of is 75, matching what Fisher found. This plot also acts as a species discovery curve for this ecosystem and defines how many new species will be discovered as increases (and more samples are taken).
There are numerous uses for the predictive algorithm. Knowing that the estimators are accurate, it allows scientists to extrapolate accurately the results of polling people by a factor of 2. They can predict the number of unique answers based on the number of people that have answered similarly. The method can also be used to determine the extent of someone's knowledge.
Based on research of Shakespeare's known works done by Thisted and Efron, there are 884,647 total words. [1] The research also found that there are at total of different words that appear more than 100 times. Therefore, the total number of unique words was found to be 31,534. [1] Applying the Good–Toulmin model, if an equal number of works by Shakespeare were discovered, then it is estimated that unique words would be found. The goal would be to derive for . Thisted and Efron estimate that , meaning that Shakespeare most likely knew over twice as many words as he actually used in all of his writings. [1]
The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution, Cauchy–Lorentz distribution, Lorentz(ian) function, or Breit–Wigner distribution. The Cauchy distribution is the distribution of the x-intercept of a ray issuing from with a uniformly distributed angle. It is also the distribution of the ratio of two independent normally distributed random variables with mean zero.
In mathematics, the Fibonacci sequence is a sequence in which each number is the sum of the two preceding ones. Numbers that are part of the Fibonacci sequence are known as Fibonacci numbers, commonly denoted Fn . The sequence commonly starts from 0 and 1, although some authors start the sequence from 1 and 1 or sometimes from 1 and 2. Starting from 0 and 1, the sequence begins
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction. The reference point is called the pole, and the ray from the pole in the reference direction is the polar axis. The distance from the pole is called the radial coordinate, radial distance or simply radius, and the angle is called the angular coordinate, polar angle, or azimuth. Angles in polar notation are generally expressed in either degrees or radians.
The Allan variance (AVAR), also known as two-sample variance, is a measure of frequency stability in clocks, oscillators and amplifiers. It is named after David W. Allan and expressed mathematically as . The Allan deviation (ADEV), also known as sigma-tau, is the square root of the Allan variance, .
In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:
A Fourier series is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series, but not all trigonometric series are Fourier series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation. This application is possible because the derivatives of trigonometric functions fall into simple patterns. Fourier series cannot be used to approximate arbitrary functions, because most functions have infinitely many terms in their Fourier series, and the series do not always converge. Well-behaved functions, for example smooth functions, have Fourier series that converge to the original function. The coefficients of the Fourier series are determined by integrals of the function multiplied by trigonometric functions, described in Common forms of the Fourier series below.
In mathematics, a linear form is a linear map from a vector space to its field of scalars.
In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.
The sine-Gordon equation is a second-order nonlinear partial differential equation for a function dependent on two variables typically denoted and , involving the wave operator and the sine of .
In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator that is unbiased for a given unknown quantity and that depends on the data only through a complete, sufficient statistic is the unique best unbiased estimator of that quantity. The Lehmann–Scheffé theorem is named after Erich Leo Lehmann and Henry Scheffé, given their two early papers.
In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.
In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.
Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colours of the balls. After drawing red balls, black balls and green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen colour.
In mathematics, an elliptic boundary value problem is a special kind of boundary value problem which can be thought of as the stable state of an evolution problem. For example, the Dirichlet problem for the Laplacian gives the eventual distribution of heat in a room several hours after the heating is turned on.
In applied mathematics, discontinuous Galerkin methods (DG methods) form a class of numerical methods for solving differential equations. They combine features of the finite element and the finite volume framework and have been successfully applied to hyperbolic, elliptic, parabolic and mixed form problems arising from a wide range of applications. DG methods have in particular received considerable interest for problems with a dominant first-order part, e.g. in electrodynamics, fluid mechanics and plasma physics. Indeed, the solutions of such problems may involve strong gradients (and even discontinuities) so that classical finite element methods fail, while finite volume methods are restricted to low order approximations.
In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.
In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.
Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.
In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.