ProbOnto

ProbOnto
ProbOnto
Keywords	Statistics, Probability distribution
Objective	Design, implement and maintain knowledge base and ontology of probability distributions.
Duration	2015 –
Website	probonto.org

Last updated April 30, 2024

ProbOnto is a knowledge base and ontology of probability distributions.^[1]^[2] ProbOnto 2.5 (released on January 16, 2017) contains over 150 uni- and multivariate distributions and alternative parameterizations, more than 220 relationships and re-parameterization formulas, supporting also the encoding of empirical and univariate mixture distributions.

Introduction

ProbOnto was initially designed to facilitate the encoding of nonlinear-mixed effect models and their annotation in Pharmacometrics Markup Language (PharmML)^[3]^[4] developed by DDMoRe,^[5]^[6] an Innovative Medicines Initiative project. However, ProbOnto, due to its generic structure can be applied in other platforms and modeling tools for encoding and annotation of diverse models applicable to discrete (e.g. count, categorical and time-to-event) and continuous data.

Knowledge base

Overview of supported distributions in ProbOnto, version 2.5, and relationships between univariate probability distributions. ProbOnto2.5.jpg — Overview of supported distributions in ProbOnto, version 2.5, and relationships between univariate probability distributions.

The knowledge base stores for each distribution:

Probability density or mass functions and where available cumulative distribution, hazard and survival functions.
Related quantities such as mean, median, mode and variance.
Parameter and support/range definitions and distribution type.
LaTeX and R code for mathematical functions.
Model definition and references.

Relationships

ProbOnto stores in Version 2.5 over 220 relationships between univariate distributions with re-parameterizations as a special case, see figure. While this form of relationships is often neglected in literature, and the authors concentrate one a particular form for each distribution, they are crucial from the interoperability point of view. ProbOnto focuses on this aspect and features more than 15 distributions with alternative parameterizations.

Alternative parameterizations

Many distributions are defined with mathematically equivalent but algebraically different formulas. This leads to issues when exchanging models between software tools.^[7] The following examples illustrate that.

Normal distribution

Normal distribution can be defined in at least three ways

Normal1(μ,σ) with mean, μ, and standard deviation, σ ^[8]

P(x;{\boldsymbol {\mu }},{\boldsymbol {\sigma }})={\frac {1}{\sigma {\sqrt {2\pi }}}}\exp {\Big [}-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}{\Big ]}

Normal2(μ,υ) with mean, μ, and variance, υ = σ^2^[9] or

P(x;{\boldsymbol {\mu }},{\boldsymbol {v}})={\frac {1}{{\sqrt {v}}{\sqrt {2\pi }}}}\exp {\Big [}-{\frac {(x-\mu )^{2}}{2v}}{\Big ]}

Normal3(μ,τ) with mean, μ, and precision, τ = 1/υ = 1/σ^2.^[10]^[11]

P(x;{\boldsymbol {\mu }},{\boldsymbol {\tau }})={\sqrt {\frac {\tau }{2\pi }}}\exp {\Big [}-{\frac {\tau }{2}}(x-\mu )^{2}{\Big ]}

Re-parameterization formulas

The following formulas can be used to re-calculate the three different forms of the normal distribution (we use abbreviations i.e. $N1$ instead of $Normal1$ etc.)

$N1(\mu ,\sigma )\rightarrow N2(\mu ,v):v=\sigma ^{2}{\mbox{ and }}N2(\mu ,v)\rightarrow N1(\mu ,\sigma ):\sigma ={\sqrt {v}};$

$N1(\mu ,\sigma )\rightarrow N3(\mu ,\tau ):\tau =1/\sigma ^{2}{\mbox{ and }}N3(\mu ,\tau )\rightarrow N1(\mu ,\sigma ):\sigma =1/{\sqrt {\tau }};$

$N2(\mu ,v)\rightarrow N3(\mu ,\tau ):\tau =1/v{\mbox{ and }}N3(\mu ,\tau )\rightarrow N2(\mu ,v):v=1/\tau .$

Log-normal distribution

In the case of the log-normal distribution there are more options. This is due to the fact that it can be parameterized in terms of parameters on the natural and log scale, see figure.

Support of different parameterizations of the log-normal distributions in various tools and there connections, see text for examples. Tools visualised are Matlab (supports LN1), MCSim (LN6), Monolix (LN2 & LN3), PFIM (LN2 & LN3), Phoenix NLME (LN1, LN3 & LN6), PopED (LN7), R (programming language) (LN1), Simcyp Simulator (LN1), Simulx (LN1) and winBUGS (LN5) LNrelationships.png — Support of different parameterizations of the log-normal distributions in various tools and there connections, see text for examples. Tools visualised are Matlab (supports LN1), MCSim (LN6), Monolix (LN2 & LN3), PFIM (LN2 & LN3), Phoenix NLME (LN1, LN3 & LN6), PopED (LN7), R (programming language) (LN1), Simcyp Simulator (LN1), Simulx (LN1) and winBUGS (LN5)

The available forms in ProbOnto 2.0 are

LogNormal1(μ,σ) with mean, μ, and standard deviation, σ, both on the log-scale^[8]

P(x;{\boldsymbol {\mu }},{\boldsymbol {\sigma }})={\frac {1}{x\sigma {\sqrt {2\pi }}}}\exp {\Big [}{\frac {-(\log x-\mu )^{2}}{2\sigma ^{2}}}{\Big ]}

LogNormal2(μ,υ) with mean, μ, and variance, υ, both on the log-scale

P(x;{\boldsymbol {\mu }},{\boldsymbol {v}})={\frac {1}{x{\sqrt {v}}{\sqrt {2\pi }}}}\exp {\Big [}{\frac {-(\log x-\mu )^{2}}{2v}}{\Big ]}

LogNormal3(m,σ) with median, m, on the natural scale and standard deviation, σ, on the log-scale^[8]

P(x;{\boldsymbol {m}},{\boldsymbol {\sigma }})={\frac {1}{x\sigma {\sqrt {2\pi }}}}\exp {\Big [}{\frac {-[\log(x/m)]^{2}}{2\sigma ^{2}}}{\Big ]}

LogNormal4(m,cv) with median, m, and coefficient of variation, cv, both on the natural scale

P(x;{\boldsymbol {m}},{\boldsymbol {cv}})={\frac {1}{x{\sqrt {\log(cv^{2}+1)}}{\sqrt {2\pi }}}}\exp {\Big [}{\frac {-[\log(x/m)]^{2}}{2\log(cv^{2}+1)}}{\Big ]}

LogNormal5(μ,τ) with mean, μ, and precision, τ, both on the log-scale^[12]

P(x;{\boldsymbol {\mu }},{\boldsymbol {\tau }})={\sqrt {\frac {\tau }{2\pi }}}{\frac {1}{x}}\exp {\Big [}{-{\frac {\tau }{2}}(\log x-\mu )^{2}}{\Big ]}

LogNormal6(m,σ_g) with median, m, and geometric standard deviation, σ_g, both on the natural scale^[13]

P(x;{\boldsymbol {m}},{\boldsymbol {\sigma _{g}}})={\frac {1}{x\log(\sigma _{g}){\sqrt {2\pi }}}}\exp {\Big [}{\frac {-[\log(x/m)]^{2}}{2\log ^{2}(\sigma _{g})}}{\Big ]}

LogNormal7(μ_N,σ_N) with mean, μ_N, and standard deviation, σ_N, both on the natural scale^[14]

P(x;{\boldsymbol {\mu _{N}}},{\boldsymbol {\sigma _{N}}})={\frac {1}{x{\sqrt {2\pi \log {\Big (}1+\sigma _{N}^{2}/\mu _{N}^{2}{\Big )}}}}}\exp {\Bigg (}{\frac {-{\Big [}\log(x)-\log {\Big (}{\frac {\mu _{N}}{\sqrt {1+\sigma _{N}^{2}/\mu _{N}^{2}}}}{\Big )}{\Big ]}^{2}}{2\log {\Big (}1+\sigma _{N}^{2}/\mu _{N}^{2}{\Big )}}}{\Bigg )}

ProbOnto knowledge base stores such re-parameterization formulas to allow for a correct translation of models between tools.

Examples for re-parameterization

Consider the situation when one would like to run a model using two different optimal design tools, e.g. PFIM^[15] and PopED.^[16] The former supports the LN2, the latter LN7 parameterization, respectively. Therefore, the re-parameterization is required, otherwise the two tools would produce different results.

For the transition $LN2(\mu ,v)\rightarrow LN7(\mu _{N},\sigma _{N})$ following formulas hold $\mu _{N}=\exp(\mu +v/2){\text{ and }}\sigma _{N}=\exp(\mu +v/2){\sqrt {\exp(v)-1}}$ .

For the transition $LN7(\mu _{N},\sigma _{N})\rightarrow LN2(\mu ,v)$ following formulas hold $\mu =\log {\Big (}\mu _{N}/{\sqrt {1+\sigma _{N}^{2}/\mu _{N}^{2}}}{\Big )}{\text{ and }}v=\log(1+\sigma _{N}^{2}/\mu _{N}^{2})$ .

All remaining re-parameterisation formulas can be found in the specification document on the project website.^[2]

Ontology

The knowledge base is built from a simple ontological model. At its core, a probability distribution is an instance of the class thereof, a specialization of the class of mathematical objects. A distribution relates to a number of other individuals, which are instances of various categories in the ontology. For example, these are parameters and related functions associated with a given probability distribution. This strategy allows for the rich representation of attributes and relationships between domain objects. The ontology can be seen as a conceptual schema in the domain of mathematics and has been implemented as a PowerLoom knowledge base.^[17] An OWL version is generated programmatically using the Jena API.^[18]

Output for ProbOnto are provided as supplementary materials and published on or linked from the probonto.org website. The OWL version of ProbOnto is available via Ontology Lookup Service (OLS)^[19] to facilitate simple searching and visualization of the content. In addition the OLS API provides methods to programmatically access ProbOnto and to integrate it into applications. ProbOnto is also registered on the BioSharing portal.^[20]

ProbOnto in PharmML

A PharmML interface is provided in form of a generic XML schema for the definition of the distributions and their parameters. Defining functions, such as probability density function (PDF), probability mass function (PMF), hazard function (HF) and survival function (SF), can be accessed via methods provided in the PharmML schema.

Use example

This example shows how the zero-inflated Poisson distribution is encoded by using its codename and declaring that of its parameters (‘rate’ and ‘probabilityOfZero’). Model parameters Lambda and P0 are assigned to the parameter code names.

<Distribution><po:ProbOntoname="ZeroInflatedPoisson1"><po:Parametername="rate"><ct:Assign><ct:SymbRefsymbIdRef="Lambda"/></ct:Assign></po:Parameter><po:Parametername="probabilityOfZero"><ct:Assign><ct:SymbRefsymbIdRef="P0"/></ct:Assign></po:Parameter></po:ProbOnto></Distribution>

To specify any given distribution unambiguously using ProbOnto, it is sufficient to declare its code name and the code names of its parameters. More examples and a detailed specification can be found on the project website.^[2]

Related Research Articles

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable $X$ is log-normally distributed, then $Y = ln(X)$ has a normal distribution. Equivalently, if $Y$ has a normal distribution, then the exponential function of $Y$ , $X = exp(Y)$ , has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in exact and engineering sciences, as well as medicine, economics and other topics (e.g., energies, concentrations, lengths, prices of financial instruments, and other metrics).

In probability and statistics, Student's $t$ distribution $is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.$

<span class="mw-page-title-main">Fokker–Planck equation</span> Partial differential equation

In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

A Newtonian fluid is a fluid in which the viscous stresses arising from its flow are at every point linearly correlated to the local strain rate — the rate of change of its deformation over time. Stresses are proportional to the rate of change of the fluid's velocity vector.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

In nuclear physics, the chiral model, introduced by Feza Gürsey in 1960, is a phenomenological model describing effective interactions of mesons in the chiral limit (where the masses of the quarks go to zero), but without necessarily mentioning quarks at all. It is a nonlinear sigma model with the principal homogeneous space of a Lie group $as its target manifold. When the model was originally introduced, this Lie group was the SU(N), where N is the number of quark flavors. The Riemannian metric of the target manifold is given by a positive constant multiplied by the Killing form acting upon the Maurer-Cartan form of SU(N).$

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the opposite ends of the range, and the division of the range could notionally be made at any point.

The scaled inverse chi-squared distribution is the distribution for x = 1/s², where s² is a sample mean of the squares of ν independent normal random variables that have mean 0 and inverse variance 1/σ² = τ². The distribution is therefore parametrised by the two quantities ν and τ², referred to as the number of chi-squared degrees of freedom and the scaling parameter, respectively.

Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius $in the x - y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.$

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In probability theory and statistics, the normal-gamma distribution is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and precision.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

In probability theory, an exponentially modified Gaussian distribution describes the sum of independent normal and exponential random variables. An exGaussian random variable Z may be expressed as Z = X + Y, where X and Y are independent, X is Gaussian with mean μ and variance σ², and Y is exponential of rate λ. It has a characteristic positive skew from the exponential component.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

↑ Swat, MJ; Grenon, P; Wimalaratne, S (2016). "ProbOnto: ontology and knowledge base of probability distributions". Bioinformatics. 32: 2719. doi:10.1093/bioinformatics/btw170. PMC 5013898 . PMID 27153608.
1 2 3 Main project website, URL: http://probonto.org
↑ Swat MJ. et al. (2015). Pharmacometrics Markup Language (PharmML): Opening New Perspectives for Model Exchange in Drug Development. CPT Pharmacometrics Syst Pharmacol, 4(6):316-9.
↑ PharmML website, URL: http://pharmml.org
↑ DDMoRe project website, URL: http://ddmore.eu
↑ ProbOnto description on the DDMoRe website, URL: http://ddmore.eu/probonto
↑ LeBauer DS et al. Translating probability density functions: From R to BUGS and back again, R Journal, 2013
1 2 3 Forbes et al. Probability Distributions (2011), John Wiley & Sons, Inc.
↑ Wolfram Mathworld, URL: http://mathworld.wolfram.com/NormalDistribution.html
↑ 'LaplacesDemon' R package, URL: http://search.r-project.org/library/LaplacesDemon/html/dist.Normal.Precision.html
↑ Cyert RM, MH DeGroot, Bayesian Analysis and Uncertainty in Economic (1987), TheoryRowman & Littlefield
↑ Lunn, D. (2012). The BUGS book: a practical introduction to Bayesian analysis. Texts in statistical science. CRC Press.
↑ Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341-352.
↑ Nyberg J. et al. (2012) PopED - An extended, parallelized, population optimal design tool. Comput Methods Programs Biomed.; 108(2):789-805. doi: 10.1016/j.cmpb.2012.05.005
↑ Retout S, Duffull S, Mentré F (2001) Development and implementation of the population Fisher information matrix for the evaluation of population pharmacokinetic designs. Comp Meth Pro Biomed 65:141–151
↑ The PopED Development Team (2016). PopED Manual, Release version 2.13. Technical report, Uppsala University.
↑ MacGregor R. et al. (1997) Powerloom Manual. ISI, University of South California, Marina del Rey.
↑ McBride B. (2001) Jena: Implementing the RDF model and syntax specification. In: SemWeb.
↑ ProbOnto on Ontology Lookup Service, URL: http://www.ebi.ac.uk/ols/ontologies/probonto
↑ ProbOnto on BioSharing, the database of biological databases, URL: https://biosharing.org/biodbcore-000772

External links

Official website
Leemis chart
Ultimate Univariate Probability Distribution Explorer – most likely the largest, free collection of univariate distributions and their features.
UncertML

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Swat, MJ; Grenon, P; Wimalaratne, S (2016). "ProbOnto: ontology and knowledge base of probability distributions". Bioinformatics. 32: 2719. doi:10.1093/bioinformatics/btw170. PMC 5013898 . PMID 27153608.

[probontoWebsite-2] 1 2 3 Main project website, URL: http://probonto.org

[3] Swat MJ. et al. (2015). Pharmacometrics Markup Language (PharmML): Opening New Perspectives for Model Exchange in Drug Development. CPT Pharmacometrics Syst Pharmacol, 4(6):316-9.

[4] PharmML website, URL: http://pharmml.org

[5] DDMoRe project website, URL: http://ddmore.eu

[6] ProbOnto description on the DDMoRe website, URL: http://ddmore.eu/probonto

[7] LeBauer DS et al. Translating probability density functions: From R to BUGS and back again, R Journal, 2013

[Forbes-8] 1 2 3 Forbes et al. Probability Distributions (2011), John Wiley & Sons, Inc.

[9] Wolfram Mathworld, URL: http://mathworld.wolfram.com/NormalDistribution.html

[10] 'LaplacesDemon' R package, URL: http://search.r-project.org/library/LaplacesDemon/html/dist.Normal.Precision.html

[11] Cyert RM, MH DeGroot, Bayesian Analysis and Uncertainty in Economic (1987), TheoryRowman & Littlefield

[12] Lunn, D. (2012). The BUGS book: a practical introduction to Bayesian analysis. Texts in statistical science. CRC Press.

[13] Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341-352.

[14] Nyberg J. et al. (2012) PopED - An extended, parallelized, population optimal design tool. Comput Methods Programs Biomed.; 108(2):789-805. doi: 10.1016/j.cmpb.2012.05.005

[15] Retout S, Duffull S, Mentré F (2001) Development and implementation of the population Fisher information matrix for the evaluation of population pharmacokinetic designs. Comp Meth Pro Biomed 65:141–151

[16] The PopED Development Team (2016). PopED Manual, Release version 2.13. Technical report, Uppsala University.

[17] MacGregor R. et al. (1997) Powerloom Manual. ISI, University of South California, Marina del Rey.

[18] McBride B. (2001) Jena: Implementing the RDF model and syntax specification. In: SemWeb.

[19] ProbOnto on Ontology Lookup Service, URL: http://www.ebi.ac.uk/ols/ontologies/probonto

[20] ProbOnto on BioSharing, the database of biological databases, URL: https://biosharing.org/biodbcore-000772

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]