Sulston score

Last updated

The Sulston score is an equation used in DNA mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance. Used as such, it is a test of statistical significance. That is, low values imply that similarity is significant, suggesting that two DNA clones overlap one another and that the given similarity is not just a chance event. The name is an eponym that refers to John Sulston by virtue of his being the lead author of the paper that first proposed the equation's use. [1]

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's defined significance level, α, is the probability of the study rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant, by the standards of the study, when p < α. The significance level for a study is chosen before data collection, and typically set to 5% or much lower, depending on the field of study.

Eponym Someone or something after which something is named

An eponym is a person, place, or thing after whom or after which something is named, or believed to be named. The adjectives derived from eponym include eponymous and eponymic. For example, Elizabeth I of England is the eponym of the Elizabethan era, and "the eponymous founder of the Ford Motor Company" refers to Henry Ford. Recent usage, especially in the recorded-music industry, also allows eponymous to mean "named after its central character or creator".

John Sulston British biologist and Nobel laureate

Sir John Edward Sulston was a British biologist and academic who won the Nobel Prize in Physiology or Medicine for his work on the cell lineage and genome of the worm Caenorhabditis elegans in 2002 with his colleagues Sydney Brenner and Robert Horvitz. He was a leader in human genome research and Chair of the Institute for Science, Ethics and Innovation at the University of Manchester. Sulston was in favour of science in the public interest, such as free public access of scientific information and against the patenting of genes and the privatisation of genetic technologies.

Contents

The overlap problem in mapping

Each clone in a DNA mapping project has a "fingerprint", i.e. a set of DNA fragment lengths inferred from (1) enzymatically digesting the clone, (2) separating these fragments on a gel, and (3) estimating their lengths based on gel location. For each pairwise clone comparison, one can establish how many lengths from each set match-up. Cases having at least 1 match indicate that the clones might overlap because matches may represent the same DNA. However, the underlying sequences for each match are not known. Consequently, two fragments whose lengths match may still represent different sequences. In other words, matches do not conclusively indicate overlaps. The problem is instead one of using matches to probabilistically classify overlap status.

Probability is the measure of the likelihood that an event will occur. See glossary of probability and statistics. Probability quantifies as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes are both equally probable; the probability of "heads" equals the probability of "tails"; and since no other outcomes are possible, the probability of either "heads" or "tails" is 1/2.

Mathematical scores in overlap assessment

Biologists have used a variety of means (often in combination) to discern clone overlaps in DNA mapping projects. While many are biological, i.e. looking for shared markers, others are basically mathematical, usually adopting probabilistic and/or statistical approaches.

Sulston score exposition

The Sulston score is rooted in the concepts of Bernoulli and binomial processes, as follows. Consider two clones, and , having and measured fragment lengths, respectively, where . That is, clone has at least as many fragments as clone , but usually more. The Sulston score is the probability that at least fragment lengths on clone will be matched by any combination of lengths on . Intuitively, we see that, at most, there can be matches. Thus, for a given comparison between two clones, one can measure the statistical significance of a match of fragments, i.e. how likely it is that this match occurred simply as a result of random chance. Very low values would indicate a significant match that is highly unlikely to have arisen by pure chance, while higher values would suggest that the given match could be just a coincidence.

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesXi are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable Xi in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

Mathematical refinement

In a 2005 paper, [2] Michael Wendl gave an example showing that the assumption of independent trials is not valid. So, although the traditional Sulston score does indeed represent a probability distribution, it is not actually the distribution characteristic of the fingerprint problem. Wendl went on to give the general solution for this problem in terms of the Bell polynomials, showing the traditional score overpredicts P-values by orders of magnitude. (P-values are very small in this problem, so we are talking, for example, about probabilities on the order of 10×1014 versus 10×1012, the latter Sulston value being 2 orders of magnitude too high.) This solution provides a basis for determining when a problem has sufficient information content to be treated by the probabilistic approach and is also a general solution to the birthday problem of 2 types.

A disadvantage of the exact solution is that its evaluation is computationally intensive and, in fact, is not feasible for comparing large clones. [2] Some fast approximations for this problem have been proposed. [3]

Related Research Articles

In frequentist inference, a likelihood function is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play a key role in frequentist inference, especially methods of estimating a parameter from a set of statistics. In informal contexts, "likelihood" is often used as a synonym for "probability". In mathematical statistics, the two terms have different meanings:

The Post correspondence problem is an undecidable decision problem that was introduced by Emil Post in 1946. Because it is simpler than the halting problem and the Entscheidungsproblem it is often used in proofs of undecidability.

Beta distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. It is a special case of the Dirichlet distribution.

Gamma distribution probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are three different parametrizations in common use:

  1. With a shape parameter k and a scale parameter θ.
  2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.
  3. With a shape parameter k and a mean parameter μ = = α/β.
Gumbel distribution probability distribution

In probability theory and statistics, the Gumbel distribution is used to model the distribution of the maximum of a number of samples of various distributions. This distribution might be used to represent the distribution of the maximum level of a river in a particular year if there was a list of maximum values for the past ten years. It is useful in predicting the chance that an extreme earthquake, flood or other natural disaster will occur. The potential applicability of the Gumbel distribution to represent the distribution of maxima relates to extreme value theory, which indicates that it is likely to be useful if the distribution of the underlying sample data is of the normal or exponential type. The rest of this article refers to the Gumbel distribution to model the distribution of the maximum value. To model the minimum value, use the negative of the original values.

In quantum mechanics, the particle in a one-dimensional lattice is a problem that occurs in the model of a periodic crystal lattice. The potential is caused by ions in the periodic structure of the crystal creating an electromagnetic field so electrons are subject to a regular potential inside the lattice. This is an extension of the free electron model that assumes zero potential inside the lattice.

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory. A similar concept had been discovered independently by George Alfred Barnard.

This article describes periodic points of some complex quadratic maps. A map is a formula for computing a value of a variable based on its own previous value or values; a quadratic map is one that involves the previous value raised to the powers one and two; and a complex map is one in which the variable and the parameters are complex numbers. A periodic point of a map is a value of the variable that occurs repeatedly after intervals of a fixed length.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

Estimation of distribution algorithm

Estimation of distribution algorithms (EDAs), sometimes called probabilistic model-building genetic algorithms (PMBGAs), are stochastic optimization methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding the uniform distribution over admissible solutions and ending with the model that generates only the global optima.

Beta prime distribution

In probability theory and statistics, the beta prime distribution is an absolutely continuous probability distribution defined for with two parameters α and β, having the probability density function:

The Hückel method or Hückel molecular orbital theory, proposed by Erich Hückel in 1930, is a very simple linear combination of atomic orbitals molecular orbitals method for the determination of energies of molecular orbitals of π-electrons in π-delocalized molecules, such as ethylene, benzene, butadiene, and pyridine. It is the theoretical basis for Hückel's rule for the aromaticity of π-electron cyclic, planar systems. It was later extended to conjugated molecules such as pyridine, pyrrole and furan that contain atoms other than carbon, known in this context as heteroatoms. A more dramatic extension of the method to include σ-electrons, known as the extended Hückel method, was developed by Roald Hoffmann. The extended Hückel method gives some degree of quantitative accuracy for organic molecules in general and was used to test the Woodward–Hoffmann rules.

In computational complexity theory, PostBQP is a complexity class consisting of all of the computational problems solvable in polynomial time on a quantum Turing machine with postselection and bounded error.

Dirichlet process

In probability theory, Dirichlet processes are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

Great ellipse

A great ellipse is an ellipse passing through two points on a spheroid and having the same center as that of the spheroid. Equivalently, it is an ellipse on the surface of a spheroid and centered on the origin, or the curve formed by intersecting the spheroid by a plane through its center. For points that are separated by less than about a quarter of the circumference of the earth, about , the length of the great ellipse connecting the points is close to the geodesic distance.

Log-logistic distribution

In probability and statistics, the log-logistic distribution is a continuous probability distribution for a non-negative random variable. It is used in survival analysis as a parametric model for events whose rate increases initially and decreases later, for example mortality rate from cancer following diagnosis or treatment. It has also been used in hydrology to model stream flow and precipitation, in economics as a simple model of the distribution of wealth or income, and in networking to model the transmission times of data considering both the network and the software.

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

Although the concept choice models is widely understood and practiced these days, it is often difficult to acquire hands-on knowledge in simulating choice models. While many stat packages provide useful tools to simulate, researchers attempting to test and simulate new choice models with data often encounter problems from as simple as scaling parameter to misspecification. This article goes beyond simply defining discrete choice models. Rather, it aims at providing a comprehensive overview of how to simulate such models in computer.

Michael Christopher Wendl is a mathematician and biomedical engineer who has worked on DNA sequencing theory, covering and matching problems in probability, theoretical fluid mechanics, and co-wrote Phred. He was a scientist on the Human Genome Project and has done bioinformatics and biostatistics work in cancer. Wendl is of ethnic German heritage and is the son of the aerospace engineer Michael J. Wendl.

Occam learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

References

  1. Sulston J, Mallett F, Staden R, Durbin R, Horsnell T, Coulson A (Mar 1988). "Software for genome mapping by fingerprinting techniques". Comput Appl Biosci. 4 (1): 125–32. doi:10.1093/bioinformatics/4.1.125. PMID   2838135.
  2. 1 2 Wendl MC (Apr 2005). "Probabilistic assessment of clone overlaps in DNA fingerprint mapping via a priori models". J. Comput. Biol. 12 (3): 283–97. doi:10.1089/cmb.2005.12.283. PMID   15857243.
  3. Wendl MC (2007). "Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping". BMC Bioinformatics. 8: 127. doi:10.1186/1471-2105-8-127. PMC   1868038 . PMID   17442113.

See also