Unseen species problem

Last updated

The unseen species problem is commonly referred to in ecology and deals with the estimation of the number of species represented in an ecosystem that were not observed by samples. It more specifically relates to how many new species would be discovered if more samples were taken in an ecosystem. The study of the unseen species problem was started in the early 1940s by Alexander Steven Corbet. He spent 2 years in British Malaya trapping butterflies and was curious how many new species he would discover if he spent another 2 years trapping. Many different estimation methods have been developed to determine how many new species would be discovered given more samples. The unseen species problem also applies more broadly, as the estimators can be used to estimate any new elements of a set not previously found in samples. An example of this is determining how many words William Shakespeare knew based on all of his written works.

Contents

The unseen species problem can be broken down mathematically as follows: If independent samples are taken, , and then if more independent samples were taken, the number of unseen species that will be discovered by the additional samples is given by

with being the second set of samples.

History

In the early 1940s Alexander Steven Corbet spent 2 years in British Malaya trapping butterflies. [1] He kept track of how many species he observed, and how many members of each species were captured. For example, there were 74 different species of which he captured only 2 individual butterflies.

When Corbet returned to the United Kingdom, he approached biostatistician Ronald Fisher and asked how many new species of butterflies he could expect to catch if he went trapping for another two years; [2] in essence, Corbet was asking how many species he observed zero times.

Fisher responded with a simple estimation: for an additional 2 years of trapping, Corbet could expect to capture 75 new species. He did this using a simple summation (data provided by Orlitsky [2] in the table from the Example below:

Here corresponds to the number of individual species that were observed times. Fisher's sum was later confirmed by Good–Toulmin. [1]

Estimators

To estimate the number of unseen species, let be the number of future samples () divided by the number of past samples (), or . Let be the number of individual species observed times (for example, if there were 74 species of butterflies with 2 observed members throughout the samples, then ).

The Good–Toulmin estimator

The Good–Toulmin (GT) estimator was developed by Good and Toulmin in 1953. [3] The estimate of the unseen species based on the Good–Toulmin estimator is given by

The Good–Toulmin Estimator has been shown to be a good estimate for values of The Good–Toulmin estimator also approximately satisfies

This means that estimates to within as long as

However, for , the Good–Toulmin estimator fails to capture accurate results. This is because, if increases by for with meaning that if grows super-linearly in but can grow at most linearly with Therefore, when grows faster than and does not approximate the true value. [2]

To compensate for this, Efron and Thisted in 1976 [4] showed that a truncated Euler transform can also be a usable estimate (the "ET" estimate):

with

where and

where is the location chosen to truncate the Euler transform.

The smoothed Good–Toulmin estimator

Similar to the approach by Efron and Thisted, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu developed the smooth Good–Toulmin estimator. They realized that the Good–Toulmin estimator failed because of the exponential growth, and not its bias. [2] Therefore, they estimated the number of unseen species by truncating the series

Orlitsky, Suresh, and Wu also noted that for distributions with , the driving term in the summation estimate is the term, regardless of which value of is chosen. [1] To solve this, they selected a random nonnegative integer , truncated the series at , and then took the average over a distribution about . [2] The resulting estimator is

This method was chosen because the bias of shifts signs due to the coefficient. Averaging over a distribution of therefore reduces the bias. This means that the estimator can be written as the linear combination of the prevalence: [1]

Depending on the distribution of chosen, the results will vary. With this method, estimates can be made for , and this is the best possible. [2]

Species discovery curve

The species discovery curve can also be used. This curve relates the number of species found in an area as a function of the time. These curves can also be created by using estimators (such as the Good–Toulmin estimator) and plotting the number of unseen species at each value for . [5]

A species discovery curve is always increasing, as there is never a sample that could decrease the number of discovered species. Furthermore, the species discovery curve is also decelerating  the more samples taken, the fewer unseen species are expected to be discovered. The species discovery curve will also never asymptote, as it is assumed that although the discovery rate might become infinitely slow, it will never actually stop. [5] Two common models for a species discovery curve are the logarithmic and the exponential function.

Example – Corbet's butterflies

As an example, consider the data Corbet provided Fisher in the 1940s. [2] Using the Good–Toulmin model, the number of unseen species is found using

This can then be used to create a relationship between and .

Data provided to Fisher by Corbet [2]
Number of observed members, 123456789101112131415
Number of species, 11874442429222019201512146126

This relationship is shown in the plot below.

Number of unseen species as a function of t, the ratio of new samples to previous samples. Unseen Species Example.png
Number of unseen species as a function of t, the ratio of new samples to previous samples.

From the plot, it is seen that at , which was the value of that Corbet brought to Fisher, the resulting estimate of is 75, matching what Fisher found. This plot also acts as a species discovery curve for this ecosystem and defines how many new species will be discovered as increases (and more samples are taken).

Other uses

There are numerous uses for the predictive algorithm. Knowing that the estimators are accurate, it allows scientists to extrapolate accurately the results of polling people by a factor of 2. They can predict the number of unique answers based on the number of people that have answered similarly. The method can also be used to determine the extent of someone's knowledge. A prime example is determining how many unique words Shakespeare knew based on the written works we have today.[ citation needed ]

Example – How many words did Shakespeare know?

Based on research of Shakespeare's known works done by Thisted and Efron, there are 884,647 total words. [4] The research also found that there are at total of different words that appear more than 100 times. Therefore, the total number of unique words was found to be 31,534. [4] Applying the Good–Toulmin model, if an equal number of works by Shakespeare were discovered, then it is estimated that unique words would be found. The goal would be to derive for . Thisted and Efron estimate that , meaning that Shakespeare most likely knew over twice as many words as he actually used in all of his writings. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Fibonacci sequence</span> Numbers obtained by adding the two previous ones

In mathematics, the Fibonacci sequence is a sequence in which each number is the sum of the two preceding ones. Numbers that are part of the Fibonacci sequence are known as Fibonacci numbers, commonly denoted Fn. The sequence commonly starts from 0 and 1, although some authors start the sequence from 1 and 1 or sometimes from 1 and 2. Starting from 0 and 1, the first few values in the sequence are:

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Polar coordinate system</span> Coordinates determined by distance and angle

In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction. The reference point is called the pole, and the ray from the pole in the reference direction is the polar axis. The distance from the pole is called the radial coordinate, radial distance or simply radius, and the angle is called the angular coordinate, polar angle, or azimuth. Angles in polar notation are generally expressed in either degrees or radians.

In mathematics, the tangent space of a manifold is a generalization of tangent lines to curves in two-dimensional space and tangent planes to surfaces in three-dimensional space in higher dimensions. In the context of physics the tangent space to a manifold at a point can be viewed as the space of possible velocities for a particle moving on the manifold.

<span class="mw-page-title-main">Fourier series</span> Decomposition of periodic functions into sums of simpler sinusoidal forms

A Fourier series is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series, but not all trigonometric series are Fourier series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation. This application is possible because the derivatives of trigonometric functions fall into simple patterns. Fourier series cannot be used to approximate arbitrary functions, because most functions have infinitely many terms in their Fourier series, and the series do not always converge. Well-behaved functions, for example smooth functions, have Fourier series that converge to the original function. The coefficients of the Fourier series are determined by integrals of the function multiplied by trigonometric functions, described in Common forms of the Fourier series below.

In mathematics, a linear form is a linear map from a vector space to its field of scalars.

<span class="mw-page-title-main">Jensen's inequality</span> Theorem of convex functions

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

<span class="mw-page-title-main">Green's function</span> Impulse response of an inhomogeneous linear differential operator

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown quantity and that depends on the data only through a complete, sufficient statistic is the unique best unbiased estimator of that quantity. The Lehmann–Scheffé theorem is named after Erich Leo Lehmann and Henry Scheffé, given their two early papers.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

<span class="mw-page-title-main">Debye–Hückel equation</span> Electrochemical equation

The chemists Peter Debye and Erich Hückel noticed that solutions that contain ionic solutes do not behave ideally even at very low concentrations. So, while the concentration of the solutes is fundamental to the calculation of the dynamics of a solution, they theorized that an extra factor that they termed gamma is necessary to the calculation of the activities of the solution. Hence they developed the Debye–Hückel equation and Debye–Hückel limiting law. The activity is only proportional to the concentration and is altered by a factor known as the activity coefficient . This factor takes into account the interaction energy of ions in solution.

<span class="mw-page-title-main">Kernel density estimation</span> Estimator

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

In statistics, a tobit model is any of a class of regression models in which the observed range of the dependent variable is censored in some way. The term was coined by Arthur Goldberger in reference to James Tobin, who developed the model in 1958 to mitigate the problem of zero-inflated data for observations of household expenditure on durable goods. Because Tobin's method can be easily extended to handle truncated and other non-randomly selected samples, some authors adopt a broader definition of the tobit model that includes these cases.

In mathematics, especially functional analysis, a Fréchet algebra, named after Maurice René Fréchet, is an associative algebra over the real or complex numbers that at the same time is also a Fréchet space. The multiplication operation for is required to be jointly continuous. If is an increasing family of seminorms for the topology of , the joint continuity of multiplication is equivalent to there being a constant and integer for each such that for all . Fréchet algebras are also called B0-algebras.

<span class="mw-page-title-main">Elliptic boundary value problem</span>

In mathematics, an elliptic boundary value problem is a special kind of boundary value problem which can be thought of as the stable state of an evolution problem. For example, the Dirichlet problem for the Laplacian gives the eventual distribution of heat in a room several hours after the heating is turned on.

In applied mathematics, discontinuous Galerkin methods (DG methods) form a class of numerical methods for solving differential equations. They combine features of the finite element and the finite volume framework and have been successfully applied to hyperbolic, elliptic, parabolic and mixed form problems arising from a wide range of applications. DG methods have in particular received considerable interest for problems with a dominant first-order part, e.g. in electrodynamics, fluid mechanics and plasma physics.

In statistical signal processing, the goal of spectral density estimation (SDE) or simply spectral estimation is to estimate the spectral density of a signal from a sequence of time samples of the signal. Intuitively speaking, the spectral density characterizes the frequency content of the signal. One purpose of estimating the spectral density is to detect any periodicities in the data, by observing peaks at the frequencies corresponding to these periodicities.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

References

  1. 1 2 3 4 Orlitsky, Alon; Suresh, Ananda Theertha; Wu, Yihong (2016-11-22). "Optimal prediction of the number of unseen species". Proceedings of the National Academy of Sciences. 113 (47): 13283–13288. doi: 10.1073/pnas.1607774113 . PMC   5127330 . PMID   27830649.
  2. 1 2 3 4 5 6 7 8 Orlitsky, Alon; Suresh, Ananda Theertha; Wu, Yihong (2015-11-23). "Estimating the number of unseen species: A bird in the hand is worth log n in the bush". arXiv: 1511.07428 [math.ST].
  3. Good, I. J.; Toulmin, G. H. (1956). "The number of new species, and the increase in population coverage, when a sample is increased". Biometrika. 43 (1–2): 45–63. doi:10.1093/biomet/43.1-2.45. ISSN   0006-3444.
  4. 1 2 3 4 Efron, Bradley; Thisted, Ronald (1976). "Estimating the number of unsen species: How many words did Shakespeare know?". Biometrika. 63 (3): 435–447. doi:10.2307/2335721. JSTOR   2335721.
  5. 1 2 Bebber, D. P; Marriott, F. H.C; Gaston, K. J; Harris, S. A; Scotland, R. W (7 July 2007). "Predicting unknown species numbers using discovery curves". Proceedings of the Royal Society B: Biological Sciences. 274 (1618): 1651–1658. doi:10.1098/rspb.2007.0464. PMC   2169286 . PMID   17456460.