Randomised decision rule

Last updated May 13, 2022

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

As nonrandomised alternatives always exist to randomised Bayes rules, randomisation is not needed in Bayesian statistics, although frequentist statistical theory sometimes requires the use of randomised rules to satisfy optimality conditions such as minimax, most notably when deriving confidence intervals and hypothesis tests about discrete probability distributions.

A statistical test making use of a randomized decision rule is called a randomized test.

Definition and interpretation

Let ${\mathcal {D}}=\{d_{1},d_{2}...,d_{h}\}$ be a set of non-randomised decision rules with associated probabilities $p_{1},p_{2},...,p_{h}$ . Then the randomised decision rule $d^{*}$ is defined as $\sum _{i=1}^{h}p_{i}d_{i}$ and its associated risk function $R(\theta ,d^{*})$ is $\sum _{i=1}^{h}p_{i}R(\theta ,d_{i})$ .^[1] This rule can be treated as a random experiment in which the decision rules $d_{1},...,d_{h}\in {\mathcal {D}}$ are selected with probabilities $p_{1},...p_{h}$ respectively.^[2]

Alternatively, a randomised decision rule may assign probabilities directly on elements of the actions space ${\mathcal {A}}$ for each member of the sample space. More formally, $d^{*}(x,A)$ denotes the probability that an action $a\in {\mathcal {A}}$ is chosen. Under this approach, its loss function is also defined directly as: $\int _{A\in {\mathcal {A}}}d^{*}(x,A)L(\theta ,A)dA$ .^[3]

The introduction of randomised decision rules thus creates a larger decision space from which the statistician may choose his decision. As non-randomised decision rules are a special case of randomised decision rules where one decision or action has probability 1, the original decision space ${\mathcal {D}}$ is a proper subset of the new decision space ${\mathcal {D}}^{*}$ .^[4]

Selection of randomised decision rules

As with nonrandomised decision rules, randomised decision rules may satisfy favourable properties such as admissibility, minimaxity and Bayes. This shall be illustrated in the case of a finite decision problem, i.e. a problem where the parameter space is a finite set of, say, $k$ elements. The risk set, henceforth denoted as ${\mathcal {S}}$ , is the set of all vectors in which each entry is the value of the risk function associated with a randomised decision rule under a certain parameter: it contains all vectors of the form $(R(\theta _{1},d^{*}),...R(\theta _{k},d^{*})),d^{*}\in {\mathcal {D}}^{*}$ . Note that by the definition of the randomised decision rule, the risk set is the convex hull of the risks $(R(\theta _{1},d),...R(\theta _{k},d)),d\in {\mathcal {D}}$ .^[5]

In the case where the parameter space has only two elements $\theta _{1}$ and $\theta _{2}$ , this constitutes a subset of $\mathbb {R} ^{2}$ , so it may be drawn with respect to the coordinate axes $R_{1}$ and $R_{2}$ corresponding to the risks under $\theta _{1}$ and $\theta _{2}$ respectively.^[6] An example is shown on the right.

Admissibility

An admissible decision rule is one that is not dominated by any other decision rule, i.e. there is no decision rule that has equal risk as or lower risk than it for all parameters and strictly lower risk than it for some parameter. In a finite decision problem, the risk point of an admissible decision rule has either lower x-coordinates or y-coordinates than all other risk points or, more formally, it is the set of rules with risk points of the form $(a,b)$ such that $\{(R_{1},R_{2}):R_{1}\leq a,R_{2}\leq b\}\cap {\mathcal {S}}=(a,b)$ . Thus the left side of the lower boundary of the risk set is the set of admissible decision rules.^[6]^[7]

Minimax

A minimax Bayes rule is one that minimises the supremum risk $\sup _{\theta \in \Theta }R(\theta ,d^{*})$ among all decision rules in ${\mathcal {D}}^{*}$ . Sometimes, a randomised decision rule may perform better than all other nonrandomised decision rules in this regard.^[1]

In a finite decision problem with two possible parameters, the minimax rule can be found by considering the family of squares $Q(c)=\{(R_{1},R_{2}):0\leq R_{1}\leq c,0\leq R_{2}\leq c\}$ .^[8] The value of $c$ for the smallest of such squares that touches ${\mathcal {S}}$ is the minimax risk and the corresponding point or points on the risk set is the minimax rule.

If the risk set intersects the line $R_{1}=R_{2}$ , then the admissible decision rule lying on the line is minimax. If $R_{2}>R_{1}$ or $R_{1}>R_{2}$ holds for every point in the risk set, then the minimax rule can either be an extreme point (i.e. a nonrandomised decision rule) or a line connecting two extreme points (nonrandomised decision rules).^[9]^[6]

The minimax rule is the randomised decision rule $(1-p)d_{1}+pd_{2}$ .
The minimax rule is $d_{2}$ .
The minimax rules are all rules of the form $(1-p)d_{1}+pd_{2}$ , $0\leq p\leq 1$ .

Bayes

A randomised Bayes rule is one that has infimum Bayes risk $r(\pi ,d^{*})$ among all decision rules. In the special case where the parameter space has two elements, the line $\pi _{1}R_{1}+(1-\pi _{1})R_{2}=c$ , where $\pi _{1}$ and $\pi _{2}$ denote the prior probabilities of $\theta _{1}$ and $\theta _{2}$ respectively, is a family of points with Bayes risk $c$ . The minimum Bayes risk for the decision problem is therefore the smallest $c$ such that the line touches the risk set.^[10]^[11] This line may either touch only one extreme point of the risk set, i.e. correspond to a nonrandomised decision rule, or overlap with an entire side of the risk set, i.e. correspond to two nonrandomised decision rules and randomised decision rules combining the two. This is illustrated by the three situations below:

The Bayes rules are the set of decision rules of the form $(1-p)d_{1}+pd_{2}$ , $0\leq p\leq 1$ .
The Bayes rule is $d_{1}$ .
The Bayes rule is $d_{2}$ .

As different priors result in different slopes, the set of all rules that are Bayes with respect to some prior are the same as the set of admissible rules.^[12]

Note that no situation is possible where a nonrandomised Bayes rule does not exist but a randomised Bayes rule does. The existence of a randomised Bayes rule implies the existence of a nonrandomised Bayes rule. This is also true in the general case, even with infinite parameter space, infinite Bayes risk, and regardless of whether the infimum Bayes risk can be attained.^[3]^[12] This supports the intuitive notion that the statistician need not utilise randomisation to arrive at statistical decisions.^[4]

In practice

As randomised Bayes rules always have nonrandomised alternatives, they are unnecessary in Bayesian statistics. However, in frequentist statistics, randomised rules are theoretically necessary under certain situations,^[13] and were thought to be useful in practice when they were first invented: Egon Pearson forecast that they 'will not meet with strong objection'.^[14] However, few statisticians actually implement them nowadays.^[14]^[15]

Randomised test

Randomized tests should not be confused with permutation tests.^[16]

In the usual formulation of the likelihood ratio test, the null hypothesis is rejected whenever the likelihood ratio $\Lambda$ is smaller than some constant $K$ , and accepted otherwise. However, this is sometimes problematic when $\Lambda$ is discrete under the null hypothesis, when $\Lambda =K$ is possible.

A solution is to define a test function $\phi (x)$ , whose value is the probability at which the null hypothesis is accepted:^[17]^[18]

$\phi (x)=\left\{{\begin{array}{l}1&{\text{ if }}\Lambda >K\\p(x)&{\text{ if }}\Lambda =K\\0&{\text{ if }}\Lambda <K\end{array}}\right.$

This can be interpreted as flipping a biased coin with a probability $p(x)$ of returning heads whenever $\Lambda =k$ and rejecting the null hypothesis if a heads turns up.^[15]

A generalised form of the Neyman–Pearson lemma states that this test has maximum power among all tests at the same significance level $\alpha$ , that such a test must exist for any significance level $\alpha$ , and that the test is unique under normal situations.^[19]

As an example, consider the case where the underlying distribution is Bernoulli with probability $p$ , and we would like to test the null hypothesis $p\leq \lambda$ against the alternative hypothesis $p>\lambda$ . It is natural to choose some $k$ such that $P({\hat {p}}>k|H_{0})=\alpha$ , and reject the null whenever ${\hat {p}}>k$ , where ${\hat {p}}$ is the test statistic. However, to take into account cases where ${\hat {p}}=k$ , we define the test function:

$\phi (x)=\left\{{\begin{array}{l}1&{\text{ if }}{\hat {p}}>k\\\gamma &{\text{ if }}{\hat {p}}=k\\0&{\text{ if }}{\hat {p}}<k\end{array}}\right.$

where $\gamma$ is chosen such that $P({\hat {p}}>k|H_{0})+\gamma P({\hat {p}}=k|H_{0})=\alpha$ .

Randomised confidence intervals

An analogous problem arises in the construction of confidence intervals. For instance, the Clopper-Pearson interval is always conservative because of the discrete nature of the binomial distribution. An alternative is to find the upper and lower confidence limits $U$ and $L$ by solving the following equations:^[14]

$\left\{{\begin{array}{l}Pr({\hat {p}}<k|p=U)+\gamma P({\hat {p}}=k|p=U)&=\alpha /2\\Pr({\hat {p}}>k|p=L)+\gamma P({\hat {p}}=k|p=L)&=\alpha /2\end{array}}\right.$

where $\gamma$ is a uniform random variable on (0, 1).

Footnotes

1 2 Young and Smith, p. 11
↑ Bickel and Doksum, p. 28
1 2 Parmigiani, p. 132
1 2 DeGroot, p.128-129
↑ Bickel and Doksum, p.29
1 2 3 Young and Smith, p.12
↑ Bickel and Doksum, p. 32
↑ Bickel and Doksum, p.30
↑ Young and Smith, pp.14–16
↑ Young and Smith, p. 13
↑ Bickel and Doksum, pp. 29–30
1 2 Bickel and Doksum, p.31
↑ Robert, p.66
1 2 3 Agresti and Gottard, p.367
1 2 Bickel and Doksum, p.224
↑ Onghena, Patrick (2017-10-30), Berger, Vance W. (ed.), "Randomization Tests or Permutation Tests? A Historical and Terminological Clarification", Randomization, Masking, and Allocation Concealment (1 ed.), Boca Raton, FL: Chapman and Hall/CRC, pp. 209–228, doi:10.1201/9781315305110-14, ISBN 978-1-315-30511-0 , retrieved 2021-10-08
↑ Young and Smith, p.68
↑ Robert, p.243
↑ Young and Smith, p.68

Bibliography

Agresti, Alan; Gottard, Anna (2005). "Comment: Randomized Confidence Intervals and the Mid-P Approach" (PDF). Statistical Science. 5 (4): 367–371. doi: 10.1214/088342305000000403 .
Bickel, Peter J.; Doksum, Kjell A. (2001). Mathematical statistics : basic ideas and selected topics (2nd ed.). Upper Saddle River, NJ: Prentice-Hall. ISBN 978-0138503635.
DeGroot, Morris H. (2004). Optimal statistical decisions. Hoboken, N.J: Wiley-Interscience. ISBN 978-0471680291.
Parmigiani, Giovanni; Inoue, Lurdes Y T (2009). Decision theory : principles and approaches. Chichester, West Sussex: John Wiley and Sons. ISBN 9780470746684.
Robert, Christian P (2007). The Bayesian choice : from decision-theoretic foundations to computational implementation. New York: Springer. ISBN 9780387715988.
Young, G.A.; Smith, R.L. (2005). Essentials of Statistical Inference. Cambridge: Cambridge University Press. ISBN 9780521548663.

Related Research Articles

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value $in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.$

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In physics, the CHSH inequality can be used in the proof of Bell's theorem, which states that certain consequences of entanglement in quantum mechanics can not be reproduced by local hidden variable theories. Experimental verification of violation of the inequalities is seen as experimental confirmation that nature cannot be described by local hidden variables theories. CHSH stands for John Clauser, Michael Horne, Abner Shimony, and Richard Holt, who described it in a much-cited paper published in 1969. They derived the CHSH inequality, which, as with John Bell's original inequality, is a constraint on the statistics of "coincidences" in a Bell test which is necessarily true if there exist underlying local hidden variables. This constraint can, on the other hand, be infringed by quantum mechanics.

In probability theory, the Borel–Kolmogorov paradox is a paradox relating to conditional probability with respect to an event of probability zero. It is named after Émile Borel and Andrey Kolmogorov.

In mathematical statistics, the Kullback–Leibler divergence, $, is a statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q . A simple interpretation of the divergence of P from Q is the expected excess surprise from using Q as a model when the actual distribution is P . While it is a distance, it is not a metric, the most familiar type of distance: it is asymmetric in the two distributions, and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a divergence, a generalization of squared distance, and for certain classes of distributions, it satisfies a generalized Pythagorean theorem.$

Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters.

In mathematics, a $π$ -system on a set $is a collection of certain subsets of such that$

The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value $. Like this, over the generation sequence, individuals with better and better -values are generated.$

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) $from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.$

Twisting properties in general terms are associated with the properties of samples that identify with statistics that are suitable for exchange.

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson. The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area or volume.

In probability theory, a telescoping Markov chain (TMC) is a vector-valued stochastic process that satisfies a Markov property and admits a hierarchical format through a network of transition matrices with cascading dependence.

In probability theory and directional statistics, a wrapped exponential distribution is a wrapped probability distribution that results from the "wrapping" of the exponential distribution around the unit circle.

In mathematics, the Poisson boundary is a measure space associated to a random walk. It is an object designed to encode the asymptotic behaviour of the random walk, i.e. how trajectories diverge when the number of steps goes to infinity. Despite being called a boundary it is in general a purely measure-theoretical object and not a boundary in the topological sense. However, in the case where the random walk is on a topological space the Poisson boundary can be related to the Martin boundary which is an analytic construction yielding a genuine topological boundary. Both boundaries are related to harmonic functions on the space via generalisations of the Poisson formula.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[ys11-1] 1 2 Young and Smith, p. 11

[2] Bickel and Doksum, p. 28

[parm-3] 1 2 Parmigiani, p. 132

[groot-4] 1 2 DeGroot, p.128-129

[bd29-5] Bickel and Doksum, p.29

[ys12-6] 1 2 3 Young and Smith, p.12

[7] Bickel and Doksum, p. 32

[bd30-8] Bickel and Doksum, p.30

[9] Young and Smith, pp.14–16

[10] Young and Smith, p. 13

[11] Bickel and Doksum, pp. 29–30

[bd31-12] 1 2 Bickel and Doksum, p.31

[13] Robert, p.66

[ag-14] 1 2 3 Agresti and Gottard, p.367

[bd224-15] 1 2 Bickel and Doksum, p.224

[16] Onghena, Patrick (2017-10-30), Berger, Vance W. (ed.), "Randomization Tests or Permutation Tests? A Historical and Terminological Clarification", Randomization, Masking, and Allocation Concealment (1 ed.), Boca Raton, FL: Chapman and Hall/CRC, pp. 209–228, doi:10.1201/9781315305110-14, ISBN 978-1-315-30511-0 , retrieved 2021-10-08

[17] Young and Smith, p.68

[18] Robert, p.243

[19] Young and Smith, p.68

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]