Monotone likelihood ratio

Last updated August 26, 2023

A monotonic likelihood ratio in distributions

f(x)

and

g(x)

The ratio of the density functions above is monotone in the parameter $x$ , so $f(x)/g(x)$ satisfies the monotone likelihood ratio property.

Intuition
Example: Working hard or slacking off
Families of distributions satisfying MLR
List of families
Hypothesis testing
Example: Effort and output
Relation to other statistical properties
Exponential families
Uniformly most powerful tests: The Karlin–Rubin theorem
Median unbiased estimation
Lifetime analysis: Survival analysis and reliability
Uses
Economics
References

In statistics, the monotone likelihood ratio property is a property of the ratio of two probability density functions (PDFs). Formally, distributions ƒ(x) and g(x) bear the property if

{\text{for every }}x_{1}>x_{0},\quad {\frac {f(x_{1})}{g(x_{1})}}\geq {\frac {f(x_{0})}{g(x_{0})}}

that is, if the ratio is nondecreasing in the argument $x$ .

If the functions are first-differentiable, the property may sometimes be stated

{\frac {\partial }{\partial x}}\left({\frac {f(x)}{g(x)}}\right)\geq 0

For two distributions that satisfy the definition with respect to some argument x, we say they "have the MLRP in x." For a family of distributions that all satisfy the definition with respect to some statistic T(X), we say they "have the MLR in T(X)."

Intuition

The MLRP is used to represent a data-generating process that enjoys a straightforward relationship between the magnitude of some observed variable and the distribution it draws from. If $f(x)$ satisfies the MLRP with respect to $g(x)$ , the higher the observed value $x$ , the more likely it was drawn from distribution $f$ rather than $g$ . As usual for monotonic relationships, the likelihood ratio's monotonicity comes in handy in statistics, particularly when using maximum-likelihood estimation. Also, distribution families with MLR have a number of well-behaved stochastic properties, such as first-order stochastic dominance and increasing hazard ratios. Unfortunately, as is also usual, the strength of this assumption comes at the price of realism. Many processes in the world do not exhibit a monotonic correspondence between input and output.

Example: Working hard or slacking off

Suppose you are working on a project, and you can either work hard or slack off. Call your choice of effort $e$ and the quality of the resulting project $q$ . If the MLRP holds for the distribution of q conditional on your effort $e$ , the higher the quality the more likely you worked hard. Conversely, the lower the quality the more likely you slacked off.

Choose effort $e\in \{H,L\}$ where H means high, L means low
Observe $q$ drawn from $f(q\mid e)$ . By Bayes' law with a uniform prior,
$\Pr[e=H\mid q]={\frac {f(q\mid H)}{f(q\mid H)+f(q\mid L)}}$
Suppose $f(q\mid e)$ satisfies the MLRP. Rearranging, the probability the worker worked hard is

{\frac {1}{1+f(q\mid L)/f(q\mid H)}}

which, thanks to the MLRP, is monotonically increasing in

q

(because

f(q\mid L)/f(q\mid H)

is decreasing in

q

). Hence if some employer is doing a "performance review" he can infer his employee's behavior from the merits of his work.

Families of distributions satisfying MLR

Statistical models often assume that data are generated by a distribution from some family of distributions and seek to determine that distribution. This task is simplified if the family has the monotone likelihood ratio property (MLRP).

A family of density functions $\{f_{\theta }(x)\}_{\theta \in \Theta }$ indexed by a parameter $\theta$ taking values in an ordered set $\Theta$ is said to have a monotone likelihood ratio (MLR) in the statistic $T(X)$ if for any $\theta _{1}<\theta _{2}$ ,

{\frac {f_{\theta _{2}}(X=x_{1},x_{2},x_{3},\dots )}{f_{\theta _{1}}(X=x_{1},x_{2},x_{3},\dots )}}

is a non-decreasing function of

T(X)

.

Then we say the family of distributions "has MLR in $T(X)$ ".

List of families

Family	$T(X)$ in which $f_{\theta }(X)$ has the MLR
Exponential $[\lambda ]$	$\sum x_{i}$ observations
Binomial $[n,p]$	$\sum x_{i}$ observations
Poisson $[\lambda ]$	$\sum x_{i}$ observations
Normal $[\mu ,\sigma ]$	if $\sigma$ known, $\sum x_{i}$ observations

Hypothesis testing

If the family of random variables has the MLRP in $T(X)$ , a uniformly most powerful test can easily be determined for the hypothesis $H_{0}:\theta \leq \theta _{0}$ versus $H_{1}:\theta >\theta _{0}$ .

Example: Effort and output

Example: Let $e$ be an input into a stochastic technology – worker's effort, for instance – and $y$ its output, the likelihood of which is described by a probability density function $f(y;e).$ Then the monotone likelihood ratio property (MLRP) of the family $f$ is expressed as follows: for any $e_{1},e_{2}$ , the fact that $e_{2}>e_{1}$ implies that the ratio $f(y;e_{2})/f(y;e_{1})$ is increasing in $y$ .

Relation to other statistical properties

Monotone likelihoods are used in several areas of statistical theory, including point estimation and hypothesis testing, as well as in probability models.

Exponential families

One-parameter exponential families have monotone likelihood-functions. In particular, the one-dimensional exponential family of probability density functions or probability mass functions with

f_{\theta }(x)=c(\theta )h(x)\exp(\pi (\theta )T(x))

has a monotone non-decreasing likelihood ratio in the sufficient statistic T(x), provided that $\pi (\theta )$ is non-decreasing.

Uniformly most powerful tests: The Karlin–Rubin theorem

Monotone likelihood functions are used to construct uniformly most powerful tests, according to the Karlin–Rubin theorem.^[1] Consider a scalar measurement having a probability density function parameterized by a scalar parameter θ, and define the likelihood ratio $\ell (x)=f_{\theta _{1}}(x)/f_{\theta _{0}}(x)$ . If $\ell (x)$ is monotone non-decreasing, in $x$ , for any pair $\theta _{1}\geq \theta _{0}$ (meaning that the greater $x$ is, the more likely $H_{1}$ is), then the threshold test:

\varphi (x)={\begin{cases}1&{\text{if }}x>x_{0}\\0&{\text{if }}x<x_{0}\end{cases}}

where

x_{0}

is chosen so that

\operatorname {E} _{\theta _{0}}\varphi (X)=\alpha

is the UMP test of size α for testing $H_{0}:\theta \leq \theta _{0}{\text{ vs. }}H_{1}:\theta >\theta _{0}.$

Note that exactly the same test is also UMP for testing $H_{0}:\theta =\theta _{0}{\text{ vs. }}H_{1}:\theta >\theta _{0}.$

Median unbiased estimation

Monotone likelihood-functions are used to construct median-unbiased estimators, using methods specified by Johann Pfanzagl and others.^[2]^[3] One such procedure is an analogue of the Rao–Blackwell procedure for mean-unbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao–Blackwell procedure for mean-unbiased estimation but for a larger class of loss functions.^[3]^: 713

Lifetime analysis: Survival analysis and reliability

If a family of distributions $f_{\theta }(x)$ has the monotone likelihood ratio property in $T(X)$ ,

the family has monotone decreasing hazard rates in $\theta$ (but not necessarily in $T(X)$ )
the family exhibits the first-order (and hence second-order) stochastic dominance in $x$ , and the best Bayesian update of $\theta$ is increasing in $T(X)$ .

But not conversely: neither monotone hazard rates nor stochastic dominance imply the MLRP.

Proofs

Let distribution family $f_{\theta }$ satisfy MLR in x, so that for $\theta _{1}>\theta _{0}$ and $x_{1}>x_{0}$ :

{\frac {f_{\theta _{1}}(x_{1})}{f_{\theta _{0}}(x_{1})}}\geq {\frac {f_{\theta _{1}}(x_{0})}{f_{\theta _{0}}(x_{0})}},

or equivalently:

f_{\theta _{1}}(x_{1})f_{\theta _{0}}(x_{0})\geq f_{\theta _{1}}(x_{0})f_{\theta _{0}}(x_{1}).\,

Integrating this expression twice, we obtain:

1. To

x_{1}

with respect to

x_{0}

{\begin{aligned}&\int _{\min _{x}\in X}^{x_{1}}f_{\theta _{1}}(x_{1})f_{\theta _{0}}(x_{0})\,dx_{0}\\[6pt]\geq {}&\int _{\min _{x}\in X}^{x_{1}}f_{\theta _{1}}(x_{0})f_{\theta _{0}}(x_{1})\,dx_{0}\end{aligned}}

integrate and rearrange to obtain

{\frac {f_{\theta _{1}}}{f_{\theta _{0}}}}(x)\geq {\frac {F_{\theta _{1}}}{F_{\theta _{0}}}}(x)

2. From

x_{0}

with respect to

x_{1}

{\begin{aligned}&\int _{x_{0}}^{\max _{x}\in X}f_{\theta _{1}}(x_{1})f_{\theta _{0}}(x_{0})\,dx_{1}\\[6pt]\geq {}&\int _{x_{0}}^{\max _{x}\in X}f_{\theta _{1}}(x_{0})f_{\theta _{0}}(x_{1})\,dx_{1}\end{aligned}}

integrate and rearrange to obtain

{\frac {1-F_{\theta _{1}}(x)}{1-F_{\theta _{0}}(x)}}\geq {\frac {f_{\theta _{1}}}{f_{\theta _{0}}}}(x)

First-order stochastic dominance

Combine the two inequalities above to get first-order dominance:

F_{\theta _{1}}(x)\leq F_{\theta _{0}}(x)\ \forall x

Monotone hazard rate

Use only the second inequality above to get a monotone hazard rate:

{\frac {f_{\theta _{1}}(x)}{1-F_{\theta _{1}}(x)}}\leq {\frac {f_{\theta _{0}}(x)}{1-F_{\theta _{0}}(x)}}\ \forall x

Uses

Economics

The MLR is an important condition on the type distribution of agents in mechanism design and economics of information, where Paul Milgrom defined "favorableness" of signals (in terms of stochastic dominance) as a consequence of MLR.^[4] Most solutions to mechanism design models assume type distributions that satisfy the MLR to take advantage of solution methods that may be easier to apply and interpret.

Related Research Articles

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $and a scale parameter .$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of exact simulation method. The method works for any distribution in $with a density.$

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In probability theory, a Lévy process, named after the French mathematician Paul Lévy, is a stochastic process with independent, stationary increments: it represents the motion of a point whose successive displacements are random, in which displacements in pairwise disjoint time intervals are independent, and displacements in different time intervals of the same length have identical probability distributions. A Lévy process may thus be viewed as the continuous-time analog of a random walk.

In mathematical statistics, the Kullback–Leibler divergence, denoted $, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P . While it is a distance, it is not a metric, the most familiar type of distance: it is not symmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

In convex analysis, a non-negative function $f : R n \to R +$ is logarithmically concave if its domain is a convex set, and if it satisfies the inequality

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ²-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

In monotone comparative statics, the single-crossing condition or single-crossing property refers to a condition where the relationship between two or more functions is such that they will only cross once. For example, a mean-preserving spread will result in an altered probability distribution whose cumulative distribution function will intersect with the original's only once.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In probability theory and statistics, a stochastic order quantifies the concept of one random variable being "bigger" than another. These are usually partial orders, so that one random variable $may be neither stochastically greater than, less than nor equal to another random variable . Many different orders exist, which have different applications.$

In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power among all possible tests of a given size α. For example, according to the Neyman–Pearson lemma, the likelihood-ratio test is UMP for testing simple (point) hypotheses.

In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.

In probability theory and statistics, empirical likelihood (EL) is a nonparametric method for estimating the parameters of statistical models. It requires fewer assumptions about the error distribution while retaining some of the merits in likelihood-based inference. The estimation method requires that the data are independent and identically distributed (iid). It performs well even when the distribution is asymmetric or censored. EL methods can also handle constraints and prior information on parameters. Art Owen pioneered work in this area with his 1988 paper.

Exponential Tilting (ET), Exponential Twisting, or Exponential Change of Measure (ECM) is a distribution shifting technique used in many parts of mathematics. The different exponential tiltings of a random variable $is known as the natural exponential family of .$

References

↑ Casella, G.; Berger, R.L. (2008), Statistical Inference, Brooks/Cole. ISBN 0-495-39187-5 (Theorem 8.3.17)
↑ Pfanzagl, Johann (1979). "On optimal median unbiased estimators in the presence of nuisance parameters". Annals of Statistics . 7 (1): 187–193. doi: 10.1214/aos/1176344563 .
1 2 Brown, L. D.; Cohen, Arthur; Strawderman, W. E. (1976). "A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications". Ann. Statist. 4 (4): 712–722. doi: 10.1214/aos/1176343543 .
↑ Milgrom, P. R. (1981). Good News and Bad News: Representation Theorems and Applications. The Bell Journal of Economics, 12(2), 380–391. https://doi.org/10.2307/3003562

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Casella, G.; Berger, R.L. (2008), Statistical Inference, Brooks/Cole. ISBN 0-495-39187-5 (Theorem 8.3.17)

[2] Pfanzagl, Johann (1979). "On optimal median unbiased estimators in the presence of nuisance parameters". Annals of Statistics . 7 (1): 187–193. doi: 10.1214/aos/1176344563 .

[BrownEtAl1976-3] 1 2 Brown, L. D.; Cohen, Arthur; Strawderman, W. E. (1976). "A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications". Ann. Statist. 4 (4): 712–722. doi: 10.1214/aos/1176343543 .

[4] Milgrom, P. R. (1981). Good News and Bad News: Representation Theorems and Applications. The Bell Journal of Economics, 12(2), 380–391. https://doi.org/10.2307/3003562

[1]

[2]

[3]

[4]

v t e Theory of probability distributions
probability mass function (pmf) probability density function (pdf) cumulative distribution function (cdf) quantile function
raw moment central moment mean variance standard deviation skewness kurtosis L-moment
moment-generating function (mgf) characteristic function probability-generating function (pgf) cumulant combinant