ProbOnto | |
---|---|
Keywords | Statistics, Probability distribution |
Objective | Design, implement and maintain knowledge base and ontology of probability distributions. |
Duration | 2015 – |
Website | probonto |
ProbOnto is a knowledge base and ontology of probability distributions. [1] [2] ProbOnto 2.5 (released on January 16, 2017) contains over 150 uni- and multivariate distributions and alternative parameterizations, more than 220 relationships and re-parameterization formulas, supporting also the encoding of empirical and univariate mixture distributions.
ProbOnto was initially designed to facilitate the encoding of nonlinear-mixed effect models and their annotation in Pharmacometrics Markup Language (PharmML) [3] [4] developed by DDMoRe, [5] [6] an Innovative Medicines Initiative project. However, ProbOnto, due to its generic structure can be applied in other platforms and modeling tools for encoding and annotation of diverse models applicable to discrete (e.g. count, categorical and time-to-event) and continuous data.
The knowledge base stores for each distribution:
ProbOnto stores in Version 2.5 over 220 relationships between univariate distributions with re-parameterizations as a special case, see figure. While this form of relationships is often neglected in literature, and the authors concentrate one a particular form for each distribution, they are crucial from the interoperability point of view. ProbOnto focuses on this aspect and features more than 15 distributions with alternative parameterizations.
Many distributions are defined with mathematically equivalent but algebraically different formulas. This leads to issues when exchanging models between software tools. [7] The following examples illustrate that.
Normal distribution can be defined in at least three ways
The following formulas can be used to re-calculate the three different forms of the normal distribution (we use abbreviations i.e. instead of etc.)
In the case of the log-normal distribution there are more options. This is due to the fact that it can be parameterized in terms of parameters on the natural and log scale, see figure.
The available forms in ProbOnto 2.0 are
ProbOnto knowledge base stores such re-parameterization formulas to allow for a correct translation of models between tools.
Consider the situation when one would like to run a model using two different optimal design tools, e.g. PFIM [15] and PopED. [16] The former supports the LN2, the latter LN7 parameterization, respectively. Therefore, the re-parameterization is required, otherwise the two tools would produce different results.
For the transition following formulas hold .
For the transition following formulas hold .
All remaining re-parameterisation formulas can be found in the specification document on the project website. [2]
The knowledge base is built from a simple ontological model. At its core, a probability distribution is an instance of the class thereof, a specialization of the class of mathematical objects. A distribution relates to a number of other individuals, which are instances of various categories in the ontology. For example, these are parameters and related functions associated with a given probability distribution. This strategy allows for the rich representation of attributes and relationships between domain objects. The ontology can be seen as a conceptual schema in the domain of mathematics and has been implemented as a PowerLoom knowledge base. [17] An OWL version is generated programmatically using the Jena API. [18]
Output for ProbOnto are provided as supplementary materials and published on or linked from the probonto.org website. The OWL version of ProbOnto is available via Ontology Lookup Service (OLS) [19] to facilitate simple searching and visualization of the content. In addition the OLS API provides methods to programmatically access ProbOnto and to integrate it into applications. ProbOnto is also registered on the BioSharing portal. [20]
A PharmML interface is provided in form of a generic XML schema for the definition of the distributions and their parameters. Defining functions, such as probability density function (PDF), probability mass function (PMF), hazard function (HF) and survival function (SF), can be accessed via methods provided in the PharmML schema.
This example shows how the zero-inflated Poisson distribution is encoded by using its codename and declaring that of its parameters (‘rate’ and ‘probabilityOfZero’). Model parameters Lambda and P0 are assigned to the parameter code names.
<Distribution><po:ProbOntoname="ZeroInflatedPoisson1"><po:Parametername="rate"><ct:Assign><ct:SymbRefsymbIdRef="Lambda"/></ct:Assign></po:Parameter><po:Parametername="probabilityOfZero"><ct:Assign><ct:SymbRefsymbIdRef="P0"/></ct:Assign></po:Parameter></po:ProbOnto></Distribution>
To specify any given distribution unambiguously using ProbOnto, it is sufficient to declare its code name and the code names of its parameters. More examples and a detailed specification can be found on the project website. [2]
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable that is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in the natural sciences, engineering, as well as medicine, economics and other fields. It can be applied to diverse quantities such as energies, concentrations, lengths, prices of financial instruments, and other metrics, while acknowledging the inherent uncertainty in all measurements.
In probability and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.
In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
In nuclear physics, the chiral model, introduced by Feza Gürsey in 1960, is a phenomenological model describing effective interactions of mesons in the chiral limit (where the masses of the quarks go to zero), but without necessarily mentioning quarks at all. It is a nonlinear sigma model with the principal homogeneous space of a Lie group as its target manifold. When the model was originally introduced, this Lie group was the SU(N), where N is the number of quark flavors. The Riemannian metric of the target manifold is given by a positive constant multiplied by the Killing form acting upon the Maurer–Cartan form of SU(N).
In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the opposite ends of the range, and the division of the range could notionally be made at any point.
The scaled inverse chi-squared distribution, where is the scale parameter, equals the univariate inverse Wishart distribution with degrees of freedom .
In probability theory, calculation of the sum of normally distributed random variables is an instance of the arithmetic of random variables.
Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius in the x-y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.
The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ2, the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.
Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.
In probability theory and statistics, the normal-gamma distribution is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and precision.
In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.
In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.
In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).
In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.