ProbOnto

Last updated
ProbOnto
Probonto logo.jpg
Keywords Statistics, Probability distribution
ObjectiveDesign, implement and maintain knowledge base and ontology of probability distributions.
Duration2015 –
Website probonto.org

ProbOnto is a knowledge base and ontology of probability distributions. [1] [2] ProbOnto 2.5 (released on January 16, 2017) contains over 150 uni- and multivariate distributions and alternative parameterizations, more than 220 relationships and re-parameterization formulas, supporting also the encoding of empirical and univariate mixture distributions.

Contents

Introduction

ProbOnto was initially designed to facilitate the encoding of nonlinear-mixed effect models and their annotation in Pharmacometrics Markup Language (PharmML) [3] [4] developed by DDMoRe, [5] [6] an Innovative Medicines Initiative project. However, ProbOnto, due to its generic structure can be applied in other platforms and modeling tools for encoding and annotation of diverse models applicable to discrete (e.g. count, categorical and time-to-event) and continuous data.

Knowledge base

Overview of supported distributions in ProbOnto, version 2.5, and relationships between univariate probability distributions. ProbOnto2.5.jpg
Overview of supported distributions in ProbOnto, version 2.5, and relationships between univariate probability distributions.

The knowledge base stores for each distribution:

Relationships

ProbOnto stores in Version 2.5 over 220 relationships between univariate distributions with re-parameterizations as a special case, see figure. While this form of relationships is often neglected in literature, and the authors concentrate one a particular form for each distribution, they are crucial from the interoperability point of view. ProbOnto focuses on this aspect and features more than 15 distributions with alternative parameterizations.

Alternative parameterizations

Many distributions are defined with mathematically equivalent but algebraically different formulas. This leads to issues when exchanging models between software tools. [7] The following examples illustrate that.

Normal distribution

Normal distribution can be defined in at least three ways

  • Normal2(μ,υ) with mean, μ, and variance, υ = σ^2 [9] or


Re-parameterization formulas

The following formulas can be used to re-calculate the three different forms of the normal distribution (we use abbreviations i.e. instead of etc.)

Log-normal distribution

In the case of the log-normal distribution there are more options. This is due to the fact that it can be parameterized in terms of parameters on the natural and log scale, see figure.

Overview of parameterizations of the log-normal distributions. LogNormal17.jpg
Overview of parameterizations of the log-normal distributions.
Support of different parameterizations of the log-normal distributions in various tools and there connections, see text for examples. Tools visualised are Matlab (supports LN1), MCSim (LN6), Monolix (LN2 & LN3), PFIM (LN2 & LN3), Phoenix NLME (LN1, LN3 & LN6), PopED (LN7), R (programming language) (LN1), Simcyp Simulator (LN1), Simulx (LN1) and winBUGS (LN5) LNrelationships.png
Support of different parameterizations of the log-normal distributions in various tools and there connections, see text for examples. Tools visualised are Matlab (supports LN1), MCSim (LN6), Monolix (LN2 & LN3), PFIM (LN2 & LN3), Phoenix NLME (LN1, LN3 & LN6), PopED (LN7), R (programming language) (LN1), Simcyp Simulator (LN1), Simulx (LN1) and winBUGS (LN5)

The available forms in ProbOnto 2.0 are

  • LogNormal1(μ,σ) with mean, μ, and standard deviation, σ, both on the log-scale [8]
  • LogNormal2(μ,υ) with mean, μ, and variance, υ, both on the log-scale
  • LogNormal3(m,σ) with median, m, on the natural scale and standard deviation, σ, on the log-scale [8]
  • LogNormal5(μ,τ) with mean, μ, and precision, τ, both on the log-scale [12]
  • LogNormal7(μNN) with mean, μN, and standard deviation, σN, both on the natural scale [14]

ProbOnto knowledge base stores such re-parameterization formulas to allow for a correct translation of models between tools.

Examples for re-parameterization

Consider the situation when one would like to run a model using two different optimal design tools, e.g. PFIM [15] and PopED. [16] The former supports the LN2, the latter LN7 parameterization, respectively. Therefore, the re-parameterization is required, otherwise the two tools would produce different results.

For the transition following formulas hold .

For the transition following formulas hold .

All remaining re-parameterisation formulas can be found in the specification document on the project website. [2]

Ontology

The knowledge base is built from a simple ontological model. At its core, a probability distribution is an instance of the class thereof, a specialization of the class of mathematical objects. A distribution relates to a number of other individuals, which are instances of various categories in the ontology. For example, these are parameters and related functions associated with a given probability distribution. This strategy allows for the rich representation of attributes and relationships between domain objects. The ontology can be seen as a conceptual schema in the domain of mathematics and has been implemented as a PowerLoom knowledge base. [17] An OWL version is generated programmatically using the Jena API. [18]

Output for ProbOnto are provided as supplementary materials and published on or linked from the probonto.org website. The OWL version of ProbOnto is available via Ontology Lookup Service (OLS) [19] to facilitate simple searching and visualization of the content. In addition the OLS API provides methods to programmatically access ProbOnto and to integrate it into applications. ProbOnto is also registered on the BioSharing portal. [20]

ProbOnto in PharmML

A PharmML interface is provided in form of a generic XML schema for the definition of the distributions and their parameters. Defining functions, such as probability density function (PDF), probability mass function (PMF), hazard function (HF) and survival function (SF), can be accessed via methods provided in the PharmML schema.

Use example

This example shows how the zero-inflated Poisson distribution is encoded by using its codename and declaring that of its parameters (‘rate’ and ‘probabilityOfZero’). Model parameters Lambda and P0 are assigned to the parameter code names.

<Distribution><po:ProbOntoname="ZeroInflatedPoisson1"><po:Parametername="rate"><ct:Assign><ct:SymbRefsymbIdRef="Lambda"/></ct:Assign></po:Parameter><po:Parametername="probabilityOfZero"><ct:Assign><ct:SymbRefsymbIdRef="P0"/></ct:Assign></po:Parameter></po:ProbOnto></Distribution>

To specify any given distribution unambiguously using ProbOnto, it is sufficient to declare its code name and the code names of its parameters. More examples and a detailed specification can be found on the project website. [2]

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution, while the parameter is the variance. The standard deviation of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Log-normal distribution</span> Probability distribution

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable that is log-normally distributed takes only positive real values. It is a convenient and useful model for measurements in the natural sciences, engineering, as well as medicine, economics and other fields. It can be applied to diverse quantities such as energies, concentrations, lengths, prices of financial instruments, and other metrics, while acknowledging the inherent uncertainty in all measurements.

<span class="mw-page-title-main">Student's t-distribution</span> Probability distribution

In probability and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

<span class="mw-page-title-main">Fokker–Planck equation</span> Partial differential equation

In statistical mechanics and information theory, the Fokker–Planck equation is a partial differential equation that describes the time evolution of the probability density function of the velocity of a particle under the influence of drag forces and random forces, as in Brownian motion. The equation can be generalized to other observables as well. The Fokker-Planck equation has multiple applications in information theory, graph theory, data science, finance, economics etc.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.
<span class="mw-page-title-main">Chiral model</span> Model of mesons in the massless quark limit

In nuclear physics, the chiral model, introduced by Feza Gürsey in 1960, is a phenomenological model describing effective interactions of mesons in the chiral limit (where the masses of the quarks go to zero), but without necessarily mentioning quarks at all. It is a nonlinear sigma model with the principal homogeneous space of a Lie group as its target manifold. When the model was originally introduced, this Lie group was the SU(N), where N is the number of quark flavors. The Riemannian metric of the target manifold is given by a positive constant multiplied by the Killing form acting upon the Maurer–Cartan form of SU(N).

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the opposite ends of the range, and the division of the range could notionally be made at any point.

<span class="mw-page-title-main">Scaled inverse chi-squared distribution</span> Probability distribution

The scaled inverse chi-squared distribution, where is the scale parameter, equals the univariate inverse Wishart distribution with degrees of freedom .

In probability theory, calculation of the sum of normally distributed random variables is an instance of the arithmetic of random variables.

<span class="mw-page-title-main">Oblate spheroidal coordinates</span> Three-dimensional orthogonal coordinate system

Oblate spheroidal coordinates are a three-dimensional orthogonal coordinate system that results from rotating the two-dimensional elliptic coordinate system about the non-focal axis of the ellipse, i.e., the symmetry axis that separates the foci. Thus, the two foci are transformed into a ring of radius in the x-y plane. Oblate spheroidal coordinates can also be considered as a limiting case of ellipsoidal coordinates in which the two largest semi-axes are equal in length.

<span class="mw-page-title-main">Folded normal distribution</span> Probability distribution

The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ2, the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In probability theory and statistics, the normal-gamma distribution is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and precision.

<span class="mw-page-title-main">Half-normal distribution</span> Probability distribution

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

<span class="mw-page-title-main">Logit-normal distribution</span>

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

  1. Swat, MJ; Grenon, P; Wimalaratne, S (2016). "ProbOnto: ontology and knowledge base of probability distributions". Bioinformatics. 32: 2719. doi:10.1093/bioinformatics/btw170. PMC   5013898 . PMID   27153608.
  2. 1 2 3 Main project website, URL: http://probonto.org
  3. Swat MJ. et al. (2015). Pharmacometrics Markup Language (PharmML): Opening New Perspectives for Model Exchange in Drug Development. CPT Pharmacometrics Syst Pharmacol, 4(6):316-9.
  4. PharmML website, URL: http://pharmml.org
  5. DDMoRe project website, URL: http://ddmore.eu
  6. ProbOnto description on the DDMoRe website, URL: http://ddmore.eu/probonto
  7. LeBauer DS et al. Translating probability density functions: From R to BUGS and back again, R Journal, 2013
  8. 1 2 3 Forbes et al. Probability Distributions (2011), John Wiley & Sons, Inc.
  9. Wolfram Mathworld, URL: http://mathworld.wolfram.com/NormalDistribution.html
  10. 'LaplacesDemon' R package, URL: http://search.r-project.org/library/LaplacesDemon/html/dist.Normal.Precision.html
  11. Cyert RM, MH DeGroot, Bayesian Analysis and Uncertainty in Economic (1987), TheoryRowman & Littlefield
  12. Lunn, D. (2012). The BUGS book: a practical introduction to Bayesian analysis. Texts in statistical science. CRC Press.
  13. Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341-352.
  14. Nyberg J. et al. (2012) PopED - An extended, parallelized, population optimal design tool. Comput Methods Programs Biomed.; 108(2):789-805. doi: 10.1016/j.cmpb.2012.05.005
  15. Retout S, Duffull S, Mentré F (2001) Development and implementation of the population Fisher information matrix for the evaluation of population pharmacokinetic designs. Comp Meth Pro Biomed 65:141–151
  16. The PopED Development Team (2016). PopED Manual, Release version 2.13. Technical report, Uppsala University.
  17. MacGregor R. et al. (1997) Powerloom Manual. ISI, University of South California, Marina del Rey.
  18. McBride B. (2001) Jena: Implementing the RDF model and syntax specification. In: SemWeb.
  19. ProbOnto on Ontology Lookup Service, URL: http://www.ebi.ac.uk/ols/ontologies/probonto
  20. ProbOnto on BioSharing, the database of biological databases, URL: https://biosharing.org/biodbcore-000772