Optimistic knowledge gradient

Last updated

In statistics, the optimistic knowledge gradient [1] is a smart decision-making strategy developed by Xi Chen, Qihang Lin and Dengyong Zhou in 2013 to help solve complex problems in crowdsourced data labeling (a form of optimal computing budget allocation problem). In crowdsourcing, multiple people are asked to label or classify data, but each labeling attempt comes with a cost. [2]

Contents

The main challenge is figuring out the most efficient way to allocate resources when you want to get the most accurate labels without spending too much money. Imagine you're running a project where you need to classify thousands of images, and each person you ask to label an image charges a fee. The optimistic knowledge gradient helps you determine the most cost-effective way to get the most reliable labels by strategically choosing which items to have labeled and by whom.

This approach is particularly useful in machine learning and data science, where getting accurate labeled data is crucial but can be expensive. By using mathematical techniques, the method tries to maximize the information gained while minimizing the overall cost of labeling.

Motivation

The optimal computing budget allocation problem is formulated as a Bayesian Markov decision process [3] (MDP) and is solved by using the dynamic programming (DP) algorithm where the Optimistic knowledge gradient policy is used to solve the computationally intractable of the dynamic programming [4] (DP) algorithm.

Consider a budget allocation issue in crowdsourcing. The particular crowdsourcing problem we considering is crowd labeling. Crowd labeling is a large amount of labeling tasks which are hard to solve by machine, turn out to easy to solve by human beings, then we just outsourced to an unidentified group of random people in a distributed environment.

Methodology

We want to finish this labeling tasks rely on the power of the crowd hopefully. For example, suppose we want to identify a picture according to the people in a picture is adult or not, this is a Bernoulli labeling problem, and all of us can do in one or two seconds, this is an easy task for human being. However, if we have tens of thousands picture like this, then this is no longer the easy task any more. That's why we need to rely on crowdsourcing framework to make this fast. Crowdsourcing framework of this consists of two steps. Step one, we just dynamically acquire from the crowd for items. On the other sides, this is dynamic procedure. We don't just send out this picture to everyone and we focus every response, instead, we do this in quantity. We are going to decide which picture we send it in the next, and which worker we are going to hire in the crowd in the next. According to his or her historical labeling results. And each picture can be sent to multiple workers and every worker can also work on different pictures. Then after we collect enough number of labels for different picture, we go to the second steps where we want to infer true label of each picture based on the collected labels. So there are multiple ways we can do inference. For instance, the simplest we can do this is just majority vote. The problem is that no free lunch, we have to pays for worker for each label he or she provides and we only have a limited project budget. So the question is how to spend the limited budget in a smart way.

Challenges

Before showing the mathematic model, the paper mentions what kinds of challenges we are facing.

Challenge 1

First of all, the items have a different level of difficulty to compute the label, in a previous example, some picture are easy to classify. In this case, you will usually see very consistent labels from the crowd. However, if some pictures are ambiguous, people may disagree with each other resulting in highly inconsistent labelling. So we may allocate more resources on this ambiguous task.

Challenge 2

And another difficulty we often have is that the worker are not perfect, sometimes this worker are not responsible, they just provide the random label, therefore, of course, we would not spend our budget on this no reliable workers. Now the problem is both the difficulty of the pictures and the reliability of the worker we completely unknown at the beginning. We can only estimate them during the procedure. Therefore, we are naturally facing to exploration and exploitation, and our goal is to give a reasonable good policy to spend money to the right way–maximize the overall accuracy of final inferred labels.

Mathematical model

For the mathematical model, we have the K items, , and total budget is T and we assume each label cost 1 so we are going to have T labels eventually. We assume each items has true label which positive or negative, this binomial cases and we can extended to multiple class, labeling cases, this a singular idea. And the positive set is defined as the set of items whose true label is positive. And also defined a soft-label, for each item which number between 0 and 1, and we define as underlying probability of being labeled as positive by a member randomly picked from a group of perfect workers.

In this first case, we assume for every worker is perfect, it means they all reliable, but being perfect doesn’t means this worker gives the same answer or right answer. It just means they will try their best to figure out the best answer in their mind, and suppose everyone is perfect worker, just randomly picked one of them, and with probability, we going to get a guy who believe this one is positive. That is how we explain . So we are assume a label is drawn from Bernoulli(), and must be consistent with the true label, which means is greater or equal to 0.5 if and only if this item is positive with a true positive label. So our goal is to learn H*, the set of positive items. In other word, we want to make an inferred positive set H based on collected labels to maximize:

It can also be written as:

step1: Bayesian decision process

Before show the Bayesian framework, the paper use an example to mention why we choose Bayesian instead of frequency approach, such that we can propose some posterior of prior distribution on the soft-label . We assume each is drawn from a known Beta prior:

And the matrix:

So we know that the Bernoulli conjugate of beta, so once we get a new label for item i, we going to update posterior distribution, the beta distribution by:

Depending on the label is positive or negative.

Here is the whole procedure in the high level, we have T stage, . And in current stage we look at matrix S, which summarized the posterior distribution information for all the

We are going to make a decision, choose the next item to label , .

And depending what the label is positive or negative, we add a matrix to getting a label:

Above all, this is the whole framework.

step2: Inference on positive set

When the t labels are collected, we can make an inference about the positive set Ht based on posterior distribution given by St

So here become the Bernoulli selection problem, we just take to look at the probability of being positive or being negative conditional to see is greater than 0.5 or not, if it is greater than 0.5, then we prove this item into the current infer positive set so this is a cost form for current optimal solution based on the information in .

After know what is optimal solution, then the paper show what is the optimal value. Plug in the optimal function,

This function is just a single function which choose the larger one between the conditional probability of being positive and being negative. Once we get one more label for item i, we take a difference between this value, before and after we get a new label, we can see this conditional probability can actually simplify as follows:

The positive item being positive only depends on the beta posterior, therefore, if only the function of parameter of beta distribution function are a and b, as

One more label for this particular item, we double change the posterior function, so all of this items can be cancel except 1, so this is the change for whole accuracy and we defined as stage-wise reward: improvement the inference accuracy by one more sample. Of course this label have two positive value, we’ve get positive label or negative label, take average for this two, get expect reward. We just choose item to be label such that the expect reward is maximized using Knowledge Gradient:

They are multiple items, let us know how do we break the ties. If we break the tie deterministically, which means we choose the smallest index. We are going to have a problem because this is not consistent which means the positive stage does not converge to the true positive stage .

So we can also try to break the ties randomly, it works, however, we will see the performance is almost like uniform sampling, is the best reward. The writer’s policy is kinds of more greedy, instead of choosing the average in stage once reward, we can actually calculate the larger one, the max of the two stage possible reward, so Optimistic Knowledge Gradient:

And we know under optimistic knowledge gradient, the final inference accuracy converge to 100%. Above is based on every worker is perfect, however, in practice, workers are not always responsible. So if in imperfect workers, we assume K items, .

The probability of item being labeled as positive by a perfect worker. M workers, , The probability of worker giving the same label as a perfect worker. Distribution of the label from worker to item :

And the action space is that

where , label matrix:

It is difficult to calculate, so we can use Variational Bayesian methods [5] of

Related Research Articles

<span class="mw-page-title-main">Centripetal force</span> Force directed to the center of rotation

A centripetal force is a force that makes a body follow a curved path. The direction of the centripetal force is always orthogonal to the motion of the body and towards the fixed point of the instantaneous center of curvature of the path. Isaac Newton described it as "a force by which bodies are drawn or impelled, or in any way tend, towards a point as to a centre". In Newtonian mechanics, gravity provides the centripetal force causing astronomical orbits.

A likelihood function measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters.

<span class="mw-page-title-main">Bremsstrahlung</span> Electromagnetic radiation due to deceleration of charged particles

In particle physics, bremsstrahlung is electromagnetic radiation produced by the deceleration of a charged particle when deflected by another charged particle, typically an electron by an atomic nucleus. The moving particle loses kinetic energy, which is converted into radiation, thus satisfying the law of conservation of energy. The term is also used to refer to the process of producing the radiation. Bremsstrahlung has a continuous spectrum, which becomes more intense and whose peak intensity shifts toward higher frequencies as the change of the energy of the decelerated particles increases.

In statistics, sufficiency is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. A sufficient statistic contains all of the information that the dataset provides about the model parameters. It is closely related to the concepts of an ancillary statistic which contains no information about the model parameters, and of a complete statistic which only contains information about the parameters and no ancillary information.

In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of Lp spaces.

<span class="mw-page-title-main">Beta function</span> Mathematical function

In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability distributions. It can be used to calculate the distance between probability distributions.

<span class="mw-page-title-main">Halbach array</span> Special arrangement of permanent magnets

A Halbach array is a special arrangement of permanent magnets that augments the magnetic field on one side of the array while cancelling the field to near zero on the other side. This is achieved by having a spatially rotating pattern of magnetisation.

In multivariable calculus, the implicit function theorem is a tool that allows relations to be converted to functions of several real variables. It does so by representing the relation as the graph of a function. There may not be a single function whose graph can represent the entire relation, but there may be such a function on a restriction of the domain of the relation. The implicit function theorem gives a sufficient condition to ensure that there is such a function.

In linear algebra, a rotation matrix is a transformation matrix that is used to perform a rotation in Euclidean space. For example, using the convention below, the matrix

In mathematics, the Riesz–Thorin theorem, often referred to as the Riesz–Thorin interpolation theorem or the Riesz–Thorin convexity theorem, is a result about interpolation of operators. It is named after Marcel Riesz and his student G. Olof Thorin.

The Pythagorean trigonometric identity, also called simply the Pythagorean identity, is an identity expressing the Pythagorean theorem in terms of trigonometric functions. Along with the sum-of-angles formulae, it is one of the basic relations between the sine and cosine functions.

<span class="mw-page-title-main">LSZ reduction formula</span> Connection between correlation functions and the S-matrix

In quantum field theory, the Lehmann–Symanzik–Zimmermann (LSZ) reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

In the theory of three-dimensional rotation, Rodrigues' rotation formula, named after Olinde Rodrigues, is an efficient algorithm for rotating a vector in space, given an axis and angle of rotation. By extension, this can be used to transform all three basis vectors to compute a rotation matrix in SO(3), the group of all rotation matrices, from an axis–angle representation. In terms of Lie theory, the Rodrigues' formula provides an algorithm to compute the exponential map from the Lie algebra so(3) to its Lie group SO(3).

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

There are several equivalent ways for defining trigonometric functions, and the proofs of the trigonometric identities between them depend on the chosen definition. The oldest and most elementary definitions are based on the geometry of right triangles and the ratio between their sides. The proofs given in this article use these definitions, and thus apply to non-negative angles not greater than a right angle. For greater and negative angles, see Trigonometric functions.

<span class="mw-page-title-main">Gravitational lensing formalism</span>

In general relativity, a point mass deflects a light ray with impact parameter by an angle approximately equal to

In mathematics, the Kodaira–Spencer map, introduced by Kunihiko Kodaira and Donald C. Spencer, is a map associated to a deformation of a scheme or complex manifold X, taking a tangent space of a point of the deformation space to the first cohomology group of the sheaf of vector fields on X.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References