Optimistic knowledge gradient

Last updated May 09, 2024

In statistics The optimistic knowledge gradient^[1] is a approximation policy proposed by Xi Chen, Qihang Lin and Dengyong Zhou in 2013. This policy is created to solve the challenge of computationally intractable of large size of optimal computing budget allocation problem in binary/multi-class crowd labeling where each label from the crowd has a certain cost.^[2]

Motivation

The optimal computing budget allocation problem is formulated as a Bayesian Markov decision process ^[3](MDP) and is solved by using the dynamic programming (DP) algorithm where the Optimistic knowledge gradient policy is used to solve the computationally intractable of the dynamic programming ^[4] (DP) algorithm.

Consider a budget allocation issue in crowdsourcing. The particular crowdsourcing problem we considering is crowd labeling. Crowd labeling is a large amount of labeling tasks which are hard to solve by machine, turn out to easy to solve by human beings, then we just outsourced to an unidentified group of random people in a distributed environment.

Methodology

We want to finish this labeling tasks rely on the power of the crowd hopefully. For example, suppose we want to identify a picture according to the people in a picture is adult or not, this is a Bernoulli labeling problem, and all of us can do in one or two seconds, this is an easy task for human being. However, if we have tens of thousands picture like this, then this is no longer the easy task any more. That's why we need to rely on crowdsourcing framework to make this fast. Crowdsourcing framework of this consists of two steps. Step one, we just dynamically acquire from the crowd for items. On the other sides, this is dynamic procedure. We don't just send out this picture to everyone and we focus every response, instead, we do this in quantity. We are going to decide which picture we send it in the next, and which worker we are going to hire in the crowd in the next. According to his or her historical labeling results. And each picture can be sent to multiple workers and every worker can also work on different pictures. Then after we collect enough number of labels for different picture, we go to the second steps where we want to infer true label of each picture based on the collected labels. So there are multiple ways we can do inference. For instance, the simplest we can do this is just majority vote. The problem is that no free lunch, we have to pays for worker for each label he or she provides and we only have a limited project budget. So the question is how to spend the limited budget in a smart way.

Challenges

Before showing the mathematic model, the paper mentions what kinds of challenges we are facing.

Challenge 1

First of all, the items have a different level of difficulty to compute the label, in a previous example, some picture are easy to classify. In this case, you will usually see very consistent labels from the crowd. However, if some pictures are ambiguous, people may disagree with each other resulting in highly inconsistent labelling. So we may allocate more resources on this ambiguous task.

Challenge 2

And another difficulty we often have is that the worker are not perfect, sometimes this worker are not responsible, they just provide the random label, therefore, of course, we would not spend our budget on this no reliable workers. Now the problem is both the difficulty of the pictures and the reliability of the worker we completely unknown at the beginning. We can only estimate them during the procedure. Therefore, we are naturally facing to exploration and exploitation, and our goal is to give a reasonable good policy to spend money to the right way–maximize the overall accuracy of final inferred labels.

Mathematical model

For the mathematical model, we have the K items, $i=\{1,2,\ldots ,k\}$ , and total budget is T and we assume each label cost 1 so we are going to have T labels eventually. We assume each items has true label $Z_{i}$ which positive or negative, this binomial cases and we can extended to multiple class, labeling cases, this a singular idea. And the positive set $H^{*}$ is defined as the set of items whose true label is positive. And $\theta _{i}$ also defined a soft-label, $\theta _{i}$ for each item which number between 0 and 1, and we define $\theta _{i}$ as underlying probability of being labeled as positive by a member randomly picked from a group of perfect workers.

In this first case, we assume for every worker is perfect, it means they all reliable, but being perfect doesn’t means this worker gives the same answer or right answer. It just means they will try their best to figure out the best answer in their mind, and suppose everyone is perfect worker, just randomly picked one of them, and with $\theta _{i}$ probability, we going to get a guy who believe this one is positive. That is how we explain $\theta _{i}$ . So we are assume a label $Y_{i}$ is drawn from Bernoulli( $\theta _{i}$ ), and $\theta _{i}$ must be consistent with the true label, which means $\theta _{i}$ is greater or equal to 0.5 if and only if this item is positive with a true positive label. So our goal is to learn H*, the set of positive items. In other word, we want to make an inferred positive set H based on collected labels to maximize:

\sum _{i=1}^{k}({\textbf {1}}_{(i\in H)}{\textbf {1}}_{(i\in H^{\star })}+{\textbf {1}}_{(i\notin H)}{\textbf {1}}_{(i\notin H^{\star })})

It can also be written as:

|H\cap H^{\star }|+|H^{c}\cap H^{\star c}|

step1: Bayesian decision process

Before show the Bayesian framework, the paper use an example to mention why we choose Bayesian instead of frequency approach, such that we can propose some posterior of prior distribution on the soft-label $\theta _{i}$ . We assume each $\theta _{i}$ is drawn from a known Beta prior:

\theta _{i}\sim \mathrm {Beta} (a_{i}^{o},b_{i}^{o})

And the matrix:

s^{o}=\left\langle (a_{i}^{o},b_{i}^{o})\right\rangle _{i=1}^{k}\in {\textbf {R}}^{k\times 2}

So we know that the Bernoulli conjugate of beta, so once we get a new label for item i, we going to update posterior distribution, the beta distribution by:

\theta _{i}\sim \mathrm {Beta} (a_{i}^{t},b_{i}^{t})

y_{i}\mid \theta _{i}\sim \mathrm {Bernoulli} (\theta _{i})

\theta _{i}\mid y_{i}=1\sim \mathrm {Beta} (a_{i}^{t}+1,b_{i}^{t})

\theta _{i}\mid y_{i}=-1\sim \mathrm {Beta} (a_{i}^{t}+1,b_{i}^{t})

Depending on the label is positive or negative.

Here is the whole procedure in the high level, we have T stage, $0\leq t\leq T-1$ . And in current stage we look at matrix S, which summarized the posterior distribution information for all the $\theta _{i}$

s^{t}=\left\langle (a_{i}^{t},b_{i}^{t})\right\rangle _{i=1}^{k}\in {\textbf {R}}^{k\times 2}

We are going to make a decision, choose the next item to label $i_{t}$ , $i_{t}\in \{1,2,\ldots ,k\}$ .

And depending what the label is positive or negative, we add a matrix to getting a label:

\theta _{i}\sim \mathrm {Beta} (a_{i}^{t},b_{i}^{t})

y_{i}\mid \theta _{i}\sim \mathrm {Bernoulli} (\theta _{i})

\theta _{i}\mid y_{i}=1\sim \mathrm {Beta} (a_{i}^{t}+1,b_{i}^{t})

\theta _{i}\mid y_{i}=-1\sim \mathrm {Beta} (a_{i}^{t}+1,b_{i}^{t})

Above all, this is the whole framework.

step2: Inference on positive set

When the t labels are collected, we can make an inference about the positive set H_t based on posterior distribution given by S_t

{\begin{aligned}H_{t}&=\operatorname {argmax} \limits _{H\subset \{1,2,\ldots ,k\}}E\left(\sum _{i=1}^{k}({\textbf {1}}(i\in H){\textbf {1}}(i\in H^{\star })+{\textbf {1}}(i\notin H){\textbf {1}}{(i\notin H^{\star })})\mid S^{\star }\right)\\&=\operatorname {argmax} \limits _{H\subset \{1,2,\ldots ,k\}}\sum _{i=1}^{k}({\textbf {1}}(i\in H)\Pr(i\in H^{\star }\mid S^{t})+{\textbf {1}}(i\notin H)\Pr(i\notin H^{\star }\mid S^{t}))\\&=\{i:\Pr(i\in H^{\star }\mid S^{t})\geq 0.5\}\end{aligned}}

So here become the Bernoulli selection problem, we just take to look at the probability of being positive or being negative conditional $S_{t}$ to see is greater than 0.5 or not, if it is greater than 0.5, then we prove this item into the current infer positive set $H_{t}$ so this is a cost form for current optimal solution $H_{t}$ based on the information in $S_{t}$ .

After know what is optimal solution, then the paper show what is the optimal value. Plug $t$ in the optimal function,

h(x)=\max(x,1-x)

This function is just a single function which choose the larger one between the conditional probability of being positive and being negative. Once we get one more label for item i, we take a difference between this value, before and after we get a new label, we can see this conditional probability can actually simplify as follows:

{\begin{aligned}R(s^{t},i_{t},y_{i_{t}})&=\sum _{i=1}^{k}h(\Pr {(i\in H^{\star }\mid s^{t+1})})-\sum _{i=1}^{k}h(\Pr(i\in H^{\star }\mid s^{t}))\\&=\sum _{i=1}^{k}h(\Pr {(a_{i}^{t+1,b_{i}^{t+1}})})-\sum _{i=1}^{k}h(\Pr(a_{i}^{t},b_{i}^{t})).\end{aligned}}

The positive item being positive only depends on the beta posterior, therefore, if only the function of parameter of beta distribution function are a and b, as

h(\Pr(a_{i_{t}}^{t+1},b_{i_{t}}^{t+1}))-h(\Pr(a_{i_{t}}^{t},b_{i_{t}}^{t}))

One more label for this particular item, we double change the posterior function, so all of this items can be cancel except 1, so this is the change for whole accuracy and we defined as stage-wise reward: improvement the inference accuracy by one more sample. Of course this label have two positive value, we’ve get positive label or negative label, take average for this two, get expect reward. We just choose item to be label such that the expect reward is maximized using Knowledge Gradient:

{\begin{aligned}i_{t}&=\operatorname {argmax} \limits _{i\in \{1,2,\ldots ,k\}}E(R(s^{t},i,y_{i})\mid s^{t})\\&=\operatorname {argmax} \limits _{i\in \{1,2,\ldots ,k\}}\left({\frac {a_{i}^{t}}{a_{i}^{t}+b_{i}^{t}}}R(s^{t},i,1)+{\frac {b_{i}^{t}}{a_{i}^{t}+b_{i}^{t}}}R(s^{t},i,-1)\right)\end{aligned}}

They are multiple items, let us know how do we break the ties. If we break the tie deterministically, which means we choose the smallest index. We are going to have a problem because this is not consistent which means the positive stage $H_{t}$ does not converge to the true positive stage $H^{*}$ .

So we can also try to break the ties randomly, it works, however, we will see the performance is almost like uniform sampling, is the best reward. The writer’s policy is kinds of more greedy, instead of choosing the average in stage once reward, we can actually calculate the larger one, the max of the two stage possible reward, so Optimistic Knowledge Gradient:

i_{t}=\operatorname {argmax} \limits _{i\in \{1,\ldots ,k\}}(R^{+}(S^{t},i))=\max(R(S^{t},i,1),R(S^{t},i,-1))

And we know under optimistic knowledge gradient, the final inference accuracy converge to 100%. Above is based on every worker is perfect, however, in practice, workers are not always responsible. So if in imperfect workers, we assume K items, $1\leq i\leq k$ .

\theta _{i}\in (0,1)\sim \mathrm {Bet} a(a_{i}^{o},b_{i}^{o})

The probability of item $i$ being labeled as positive by a perfect worker. M workers, $1\leq j\leq M$ , $\rho _{j}\in (0,1)\sim \mathrm {Beta} (c_{j}^{o},d_{j}^{o})$ The probability of worker $j$ giving the same label as a perfect worker. Distribution of the label $Z_{ij}$ from worker $j$ to item $i$ :

\Pr(Z_{ij}=1\mid \theta _{i},\rho _{j})=\Pr(Z_{ij}=1\mid Y_{i}=1)\Pr(Y_{i}=1)+\Pr(Z_{ij}=1\mid Y_{i}=-1)\Pr(Y_{i}=-1)=\rho _{j}\theta _{i}t(1-\rho _{j})(1-\theta _{i})

And the action space is that

\Pr(Z_{ij}=1\mid \theta _{i},\rho _{j})=Pr(Z_{ij}=1\mid Y_{i}=1)\Pr(Y_{i}=1)+\Pr(Z_{ij}=1\mid Y_{i}=-1)\Pr(Y_{i}=-1)=\rho _{j}\theta _{i}t(1-\rho _{j})(1-\theta _{i})=\rho _{j}\theta _{i}t(1-\rho _{j})(1-\theta _{i}),

where $\qquad \qquad (i,j)\in \{1,2,\ldots ,k\}\times \{1,2,\ldots ,M\}$ , label matrix: $Z_{ij}\in \{-1,1\}$

It is difficult to calculate, so we can use Variational Bayesian methods ^[5] of $\Pr(i\in H^{\star }\mid S^{t})$

Related Research Articles

A centripetal force is a force that makes a body follow a curved path. The direction of the centripetal force is always orthogonal to the motion of the body and towards the fixed point of the instantaneous center of curvature of the path. Isaac Newton described it as "a force by which bodies are drawn or impelled, or in any way tend, towards a point as to a centre". In Newtonian mechanics, gravity provides the centripetal force causing astronomical orbits.

The Fresnel equations describe the reflection and transmission of light when incident on an interface between different optical media. They were deduced by French engineer and physicist Augustin-Jean Fresnel who was the first to understand that light is a transverse wave, when no one realized that the waves were electric and magnetic fields. For the first time, polarization could be understood quantitatively, as Fresnel's equations correctly predicted the differing behaviour of waves of the s and p polarizations incident upon a material interface.

The likelihood function is the joint probability mass of observed data viewed as a function of the parameters of a statistical model. Intuitively, the likelihood function $is the probability of observing data assuming is the actual parameter.$

<span class="mw-page-title-main">Bremsstrahlung</span> Electromagnetic radiation due to deceleration of charged particles

In particle physics, bremsstrahlung is electromagnetic radiation produced by the deceleration of a charged particle when deflected by another charged particle, typically an electron by an atomic nucleus. The moving particle loses kinetic energy, which is converted into radiation, thus satisfying the law of conservation of energy. The term is also used to refer to the process of producing the radiation. Bremsstrahlung has a continuous spectrum, which becomes more intense and whose peak intensity shifts toward higher frequencies as the change of the energy of the decelerated particles increases.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of $L p$ spaces.

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

With a shape parameter $k$ and a scale parameter $θ$
With a shape parameter $and an inverse scale parameter, called a rate parameter.$

In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral

<span class="mw-page-title-main">Halbach array</span> Special arrangement of permanent magnets

A Halbach array is a special arrangement of permanent magnets that augments the magnetic field on one side of the array while cancelling the field to near zero on the other side. This is achieved by having a spatially rotating pattern of magnetisation.

In multivariable calculus, the implicit function theorem is a tool that allows relations to be converted to functions of several real variables. It does so by representing the relation as the graph of a function. There may not be a single function whose graph can represent the entire relation, but there may be such a function on a restriction of the domain of the relation. The implicit function theorem gives a sufficient condition to ensure that there is such a function.

In linear algebra, a rotation matrix is a transformation matrix that is used to perform a rotation in Euclidean space. For example, using the convention below, the matrix

In mathematics, the Riesz–Thorin theorem, often referred to as the Riesz–Thorin interpolation theorem or the Riesz–Thorin convexity theorem, is a result about interpolation of operators. It is named after Marcel Riesz and his student G. Olof Thorin.

The Pythagorean trigonometric identity, also called simply the Pythagorean identity, is an identity expressing the Pythagorean theorem in terms of trigonometric functions. Along with the sum-of-angles formulae, it is one of the basic relations between the sine and cosine functions.

In quantum field theory, the Lehmann–Symanzik–Zimmermann (LSZ) reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

In general relativity, a point mass deflects a light ray with impact parameter $by an angle approximately equal to$

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

In mathematics, the Kodaira–Spencer map, introduced by Kunihiko Kodaira and Donald C. Spencer, is a map associated to a deformation of a scheme or complex manifold X, taking a tangent space of a point of the deformation space to the first cohomology group of the sheaf of vector fields on X.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

↑ Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling Xi Chen, Qihang Lin, Dengyong Zhou; 16(Jan):1−46, 2015.
↑ Proceedings of the 30-th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Xi Chen, Qihang Lin, Dengyong Zhou
↑
- Learning to Solve Markovian Decision Processes by Satinder P. Singh
↑ An Introduction to Dynamic Programming
↑
- Variational-Bayes Repository A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling Xi Chen, Qihang Lin, Dengyong Zhou; 16(Jan):1−46, 2015.

[2] Proceedings of the 30-th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Xi Chen, Qihang Lin, Dengyong Zhou

[3] 
Learning to Solve Markovian Decision Processes by Satinder P. Singh

[mwxQ] Learning to Solve Markovian Decision Processes by Satinder P. Singh

[4] An Introduction to Dynamic Programming

[5] 
Variational-Bayes Repository A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning

[mwzg] Variational-Bayes Repository A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning

[1]

[2]

[3]

[4]

[5]