Admissible decision rule

Last updated September 09, 2021

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it^[1] (or at least sometimes better and never worse), in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

Definition

Define sets $\Theta \,$ , ${\mathcal {X}}$ and ${\mathcal {A}}$ , where $\Theta \,$ are the states of nature, ${\mathcal {X}}$ the possible observations, and ${\mathcal {A}}$ the actions that may be taken. An observation $x\in {\mathcal {X}}\,\!$ is distributed as $F(x\mid \theta )\,\!$ and therefore provides evidence about the state of nature $\theta \in \Theta \,\!$ . A decision rule is a function ${\displaystyle \delta$ , where upon observing $x\in {\mathcal {X}}$ , we choose to take action $\delta (x)\in {\mathcal {A}}\,\!$ .

Also define a loss function $L:\Theta \times {\mathcal {A}}\rightarrow \mathbb {R}$ , which specifies the loss we would incur by taking action $a\in {\mathcal {A}}$ when the true state of nature is $\theta \in \Theta$ . Usually we will take this action after observing data $x\in {\mathcal {X}}$ , so that the loss will be $L(\theta ,\delta (x))\,\!$ . (It is possible though unconventional to recast the following definitions in terms of a utility function, which is the negative of the loss.)

Define the risk function as the expectation

R(\theta ,\delta )=\operatorname {E} _{F(x\mid \theta )}[{L(\theta ,\delta (x))]}.\,\!

Whether a decision rule $\delta \,\!$ has low risk depends on the true state of nature $\theta \,\!$ . A decision rule $\delta ^{*}\,\!$ dominates a decision rule $\delta \,\!$ if and only if $R(\theta ,\delta ^{*})\leq R(\theta ,\delta )$ for all $\theta \,\!$ , and the inequality is strict for some $\theta \,\!$ .

A decision rule is admissible (with respect to the loss function) if and only if no other rule dominates it; otherwise it is inadmissible. Thus an admissible decision rule is a maximal element with respect to the above partial order. An inadmissible rule is not preferred (except for reasons of simplicity or computational efficiency), since by definition there is some other rule that will achieve equal or lower risk for all $\theta \,\!$ . But just because a rule $\delta \,\!$ is admissible does not mean it is a good rule to use. Being admissible means there is no other single rule that is always as good or better – but other admissible rules might achieve lower risk for most $\theta \,\!$ that occur in practice. (The Bayes risk discussed below is a way of explicitly considering which $\theta \,\!$ occur in practice.)

Bayes rules and generalized Bayes rules

Bayes rules

Let $\pi (\theta )\,\!$ be a probability distribution on the states of nature. From a Bayesian point of view, we would regard it as a prior distribution . That is, it is our believed probability distribution on the states of nature, prior to observing data. For a frequentist, it is merely a function on $\Theta \,\!$ with no such special interpretation. The Bayes risk of the decision rule $\delta \,\!$ with respect to $\pi (\theta )\,\!$ is the expectation

r(\pi ,\delta )=\operatorname {E} _{\pi (\theta )}[R(\theta ,\delta )].\,\!

A decision rule $\delta \,\!$ that minimizes $r(\pi ,\delta )\,\!$ is called a Bayes rule with respect to $\pi (\theta )\,\!$ . There may be more than one such Bayes rule. If the Bayes risk is infinite for all $\delta \,\!$ , then no Bayes rule is defined.

Generalized Bayes rules

In the Bayesian approach to decision theory, the observed $x\,\!$ is considered fixed. Whereas the frequentist approach (i.e., risk) averages over possible samples $x\in {\mathcal {X}}\,\!$ , the Bayesian would fix the observed sample $x\,\!$ and average over hypotheses $\theta \in \Theta \,\!$ . Thus, the Bayesian approach is to consider for our observed $x\,\!$ the expected loss

\rho (\pi ,\delta \mid x)=\operatorname {E} _{\pi (\theta \mid x)}[L(\theta ,\delta (x))].\,\!

where the expectation is over the posterior of $\theta \,\!$ given $x\,\!$ (obtained from $\pi (\theta )\,\!$ and $F(x\mid \theta )\,\!$ using Bayes' theorem).

Having made explicit the expected loss for each given $x\,\!$ separately, we can define a decision rule $\delta \,\!$ by specifying for each $x\,\!$ an action $\delta (x)\,\!$ that minimizes the expected loss. This is known as a generalized Bayes rule with respect to $\pi (\theta )\,\!$ . There may be more than one generalized Bayes rule, since there may be multiple choices of $\delta (x)\,\!$ that achieve the same expected loss.

At first, this may appear rather different from the Bayes rule approach of the previous section, not a generalization. However, notice that the Bayes risk already averages over $\Theta \,\!$ in Bayesian fashion, and the Bayes risk may be recovered as the expectation over ${\mathcal {X}}$ of the expected loss (where $x\sim \theta \,\!$ and $\theta \sim \pi \,\!$ ). Roughly speaking, $\delta \,\!$ minimizes this expectation of expected loss (i.e., is a Bayes rule) if and only if it minimizes the expected loss for each $x\in {\mathcal {X}}$ separately (i.e., is a generalized Bayes rule).

Then why is the notion of generalized Bayes rule an improvement? It is indeed equivalent to the notion of Bayes rule when a Bayes rule exists and all $x\,\!$ have positive probability. However, no Bayes rule exists if the Bayes risk is infinite (for all $\delta \,\!$ ). In this case it is still useful to define a generalized Bayes rule $\delta \,\!$ , which at least chooses a minimum-expected-loss action $\delta (x)\!\,$ for those $x\,\!$ for which a finite-expected-loss action does exist. In addition, a generalized Bayes rule may be desirable because it must choose a minimum-expected-loss action $\delta (x)\,\!$ for every $x\,\!$ , whereas a Bayes rule would be allowed to deviate from this policy on a set $X\subseteq {\mathcal {X}}$ of measure 0 without affecting the Bayes risk.

More important, it is sometimes convenient to use an improper prior $\pi (\theta )\,\!$ . In this case, the Bayes risk is not even well-defined, nor is there any well-defined distribution over $x\,\!$ . However, the posterior $\pi (\theta \mid x)\,\!$ —and hence the expected loss—may be well-defined for each $x\,\!$ , so that it is still possible to define a generalized Bayes rule.

Admissibility of (generalized) Bayes rules

According to the complete class theorems, under mild conditions every admissible rule is a (generalized) Bayes rule (with respect to some prior $\pi (\theta )\,\!$ —possibly an improper one—that favors distributions $\theta \,\!$ where that rule achieves low risk). Thus, in frequentist decision theory it is sufficient to consider only (generalized) Bayes rules.

Conversely, while Bayes rules with respect to proper priors are virtually always admissible, generalized Bayes rules corresponding to improper priors need not yield admissible procedures. Stein's example is one such famous situation.

Examples

The James–Stein estimator is a nonlinear estimator of the mean of Gaussian random vectors which can be shown to dominate, or outperform, the ordinary least squares technique with respect to a mean-square error loss function.^[2] Thus least squares estimation is not an admissible estimation procedure in this context. Some others of the standard estimates associated with the normal distribution are also inadmissible: for example, the sample estimate of the variance when the population mean and variance are unknown.^[3]

Notes

↑ Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms. OUP. ISBN 0-19-920613-9 (entry for admissible decision function)
↑ Cox & Hinkley 1974 , Section 11.8
↑ Cox & Hinkley 1974 , Exercise 11.7

Related Research Articles

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, carpooling, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal unit vectors. A set of vectors form an orthonormal set if all vectors in the set are mutually orthogonal and all of unit length. An orthonormal set which forms a basis is called an orthonormal basis.

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields.

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative, in which case it is to be maximized.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In general relativity, the Gibbons–Hawking–York boundary term is a term that needs to be added to the Einstein–Hilbert action when the underlying spacetime manifold has a boundary.

In calculus, the Leibniz integral rule for differentiation under the integral sign, named after Gottfried Leibniz, states that for an integral of the form

In mathematics, the Weyl character formula in representation theory describes the characters of irreducible representations of compact Lie groups in terms of their highest weights. It was proved by Hermann Weyl. There is a closely related formula for the character of an irreducible representation of a semisimple Lie algebra. In Weyl's approach to the representation theory of connected compact Lie groups, the proof of the character formula is a key step in proving that every dominant integral element actually arises as the highest weight of some irreducible representation. Important consequences of the character formula are the Weyl dimension formula and the Kostant multiplicity formula.

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

The differentiation of trigonometric functions is the mathematical process of finding the derivative of a trigonometric function, or its rate of change with respect to a variable. For example, the derivative of the sine function is written sin′(a) = cos(a), meaning that the rate of change of sin(x) at a particular angle x = a is given by the cosine of that angle.

In statistics, the concept of being an invariant estimator is a criterion that can be used to compare the properties of different estimators for the same quantity. It is a way of formalising the idea that an estimator should have certain intuitively appealing qualities. Strictly speaking, "invariant" would mean that the estimates themselves are unchanged when both the measurements and the parameters are transformed in a compatible way, but the meaning has been extended to allow the estimates to change in appropriate ways with such transformations. The term equivariant estimator is used in formal mathematical contexts that include a precise description of the relation of the way the estimator changes in response to changes to the dataset and parameterisation: this corresponds to the use of "equivariance" in more general mathematics.

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) $from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.$

Bayesian econometrics is a branch of econometrics which applies Bayesian principles to economic modelling. Bayesianism is based on a degree-of-belief interpretation of probability, as opposed to a relative-frequency interpretation.

In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory.

In probability theory and statistics, the Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. It was introduced by Thomas Ferguson as a prior over probability distributions.

In mathematical logic, the hypersequent framework is an extension of the proof-theoretical framework of sequent calculi used in structural proof theory to provide analytic calculi for logics which are not captured in the sequent framework. A hypersequent is usually taken to be a finite multiset of ordinary sequents, written

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

In statistics, suppose that we have been given some data, and we are constructing a statistical model of that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

References

Cox, D. R.; Hinkley, D. V. (1974). Theoretical Statistics. Wiley. ISBN 0-412-12420-3.
Berger, James O. (1980). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer-Verlag. ISBN 0-387-96098-8.
DeGroot, Morris (2004) [1st. pub. 1970]. Optimal Statistical Decisions. Wiley Classics Library. ISBN 0-471-68029-X.
Robert, Christian P. (1994). The Bayesian Choice. Springer-Verlag. ISBN 3-540-94296-3.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms. OUP. ISBN 0-19-920613-9 (entry for admissible decision function)

[2] Cox & Hinkley 1974 , Section 11.8

[3] Cox & Hinkley 1974 , Exercise 11.7

[1]

[2]

[3]