Gittins index

Last updated

The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminating. Upon terminating at a given state, the reward achieved is the sum of the probabilistic expected rewards associated with every state from the actual terminating state to the ultimate terminal state, inclusive. The index is a real scalar.

Contents

Terminology

To illustrate the theory we can take two examples from a developing sector, such as from electricity generating technologies: wind power and wave power. If we are presented with the two technologies when they are both proposed as ideas we cannot say which will be better in the long run as we have no data, as yet, to base our judgments on. [1] It would be easy to say that wave power would be too problematic to develop as it seems easier to put up many wind turbines than to make the long floating generators, tow them out to sea and lay the cables necessary.

If we were to make a judgment call at that early time in development we could be condemning one technology to being put on the shelf and the other would be developed and put into operation. If we develop both technologies we would be able to make a judgment call on each by comparing the progress of each technology at a set time interval such as every three months. The decisions we make about investment in the next stage would be based on those results. [1]

In a paper in 1979 called Bandit Processes and Dynamic Allocation Indices John C. Gittins suggests a solution for problems such as this. He takes the two basic functions of a "scheduling Problem" and a "multi-armed bandit" problem [2] and shows how these problems can be solved using Dynamic allocation indices. He first takes the "Scheduling Problem" and reduces it to a machine which has to perform jobs and has a set time period, every hour or day for example, to finish each job in. The machine is given a reward value, based on finishing or not within the time period, and a probability value of whether it will finish or not for each job is calculated. The problem is "to decide which job to process next at each stage so as to maximize the total expected reward." [1] He then moves on to the "Multi–armed bandit problem" where each pull on a "one armed bandit" lever is allocated a reward function for a successful pull, and a zero reward for an unsuccessful pull. The sequence of successes forms a Bernoulli process and has an unknown probability of success. There are multiple "bandits" and the distribution of successful pulls is calculated and different for each machine. Gittins states that the problem here is "to decide which arm to pull next at each stage so as to maximize the total expected reward from an infinite sequence of pulls." [1]

Gittins says that "Both the problems described above involve a sequence of decisions, each of which is based on more information than its predecessors, and these both problems may be tackled by dynamic allocation indices." [2]

Definition

In applied mathematics, the "Gittins index" is a real scalar value associated to the state of a stochastic process with a reward function and with a probability of termination. It is a measure of the reward that can be achieved by the process evolving from that state on, under the probability that it will be terminated in future. The "index policy" induced by the Gittins index, consisting of choosing at any time the stochastic process with the currently highest Gittins index, is the solution of some stopping problems such as the one of dynamic allocation, where a decision-maker has to maximize the total reward by distributing a limited amount of effort to a number of competing projects, each returning a stochastic reward. If the projects are independent from each other and only one project at a time may evolve, the problem is called multi-armed bandit (one type of Stochastic scheduling problems) and the Gittins index policy is optimal. If multiple projects can evolve, the problem is called Restless bandit and the Gittins index policy is a known good heuristic but no optimal solution exists in general. In fact, in general this problem is NP-complete and it is generally accepted that no feasible solution can be found.

History

Questions about the optimal stopping policies in the context of clinical trials have been open from the 1940s and in the 1960s a few authors analyzed simple models leading to optimal index policies, [3] but it was only in the 1970s that Gittins and his collaborators demonstrated in a Markovian framework that the optimal solution of the general case is an index policy whose "dynamic allocation index" is computable in principle for every state of each project as a function of the single project's dynamics. [2] [4] In parallel to Gittins, Martin Weitzman established the same result in the economics literature. [5]

Soon after the seminal paper of Gittins, Peter Whittle [6] demonstrated that the index emerges as a Lagrange multiplier from a dynamic programming formulation of the problem called retirement process and conjectured that the same index would be a good heuristic in a more general setup named Restless bandit. The question of how to actually calculate the index for Markov chains was first addressed by Varaiya and his collaborators [7] with an algorithm that computes the indexes from the largest first down to the smallest and by Chen and Katehakis [8] who showed that standard LP could be used to calculate the index of a state without requiring its calculation for all states with higher index values. LCM Kallenberg [9] provided a parametric LP implementation to compute the indices for all states of a Markov chain. Further, Katehakis and Veinott [10] demonstrated that the index is the expected reward of a Markov decision process constructed over the Markov chain and known as Restart in State and can be calculated exactly by solving that problem with the policy iteration algorithm, or approximately with the value iteration algorithm. This approach also has the advantage of calculating the index for one specific state without having to calculate all the greater indexes and it is valid under more general state space conditions. A faster algorithm for the calculation of all indices was obtained in 2004 by Sonin [11] as a consequence of his elimination algorithm for the optimal stopping of a Markov chain. In this algorithm the termination probability of the process may depend on the current state rather than being a fixed factor. A faster algorithm was proposed in 2007 by Niño-Mora [12] by exploiting the structure of a parametric simplex to reduce the computational effort of the pivot steps and thereby achieving the same complexity as the Gaussian elimination algorithm. Cowan, W. and Katehakis (2014), [13] provide a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under frameworks in which, either the discount factors may be non-uniform and vary over time, or the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices.

Mathematical definition

Dynamic allocation index

The classical definition by Gittins et al. is:

where is a stochastic process, is the utility (also called reward) associated to the discrete state , is the probability that the stochastic process does not terminate, and is the conditional expectation operator given c:

with being the domain of X.

Retirement process formulation

The dynamic programming formulation in terms of retirement process, given by Whittle, is:

where is the value function

with the same notation as above. It holds that

Restart-in-state formulation

If is a Markov chain with rewards, the interpretation of Katehakis and Veinott (1987) associates to every state the action of restarting from one arbitrary state , thereby constructing a Markov decision process .

The Gittins Index of that state is the highest total reward which can be achieved on if one can always choose to continue or restart from that state .

where indicates a policy over . It holds that

.

Generalized index

If the probability of survival depends on the state , a generalization introduced by Sonin [11] (2008) defines the Gittins index as the maximum discounted total reward per chance of termination.

where

If is replaced by in the definitions of , and , then it holds that

this observation leads Sonin [11] to conclude that and not is the "true meaning" of the Gittins index.

Queueing theory

In queueing theory, Gittins index is used to determine the optimal scheduling of jobs, e.g., in an M/G/1 queue. The mean completion time of jobs under a Gittins index schedule can be determined using the SOAP approach. [14] Note that the dynamics of the queue are intrinsically Markovian, and stochasticity is due to the arrival and service processes. This is in contrast to most of the works in the learning literature, where stochasticity is explicitly accounted through a noise term.

Fractional problems

While conventional Gittins indices induce a policy to optimize the accrual of a reward, a common problem setting consists of optimizing the ratio of accrued rewards. For example, this is a case for systems to maximize bandwidth, consisting of data over time, or minimize power consumption, consisting of energy over time.

This class of problems is different from the optimization of a semi-Markov reward process, because the latter one might select states with a disproportionate sojourn time just for accruing a higher reward. Instead, it corresponds to the class of linear-fractional markov reward optimization problem.

However, a detrimental aspect of such ratio optimizations is that, once the achieved ratio in some state is high, the optimization might select states leading to a low ratio because they bear a high probability of termination, so that the process is likely to terminate before the ratio drops significantly. A problem setting to prevent such early terminations consists of defining the optimization as maximization of the future ratio seen by each state. An indexation is conjectured to exist for this problem, be computable as simple variation on existing restart-in-state or state elimination algorithms and evaluated to work well in practice. [15]

Notes

  1. 1 2 3 4 Cowan, Robin (July 1991). "Tortoises and Hares: Choice among technologies of unknown merit". The Economic Journal. 101 (407): 801–814. doi:10.2307/2233856. JSTOR   2233856.
  2. 1 2 3 Gittins, J. C. (1979). "Bandit Processes and Dynamic Allocation Indices". Journal of the Royal Statistical Society. Series B (Methodological). 41 (2): 148–177. doi:10.1111/j.2517-6161.1979.tb01068.x. JSTOR   2985029. S2CID   17724147.
  3. Mitten L (1960). "An Analytic Solution to the Least Cost Testing Sequence Problem". Journal of Industrial Engineering. 11 (1): 17.
  4. Gittins, J. C.; Jones, D. M. (1979). "A Dynamic Allocation Index for the Discounted Multiarmed Bandit Problem". Biometrika. 66 (3): 561–565. doi:10.2307/2335176. JSTOR   2335176.
  5. Weitzman, Martin L. (1979). "Optimal Search for the Best Alternative". Econometrica. 47 (3): 641–654. doi:10.2307/1910412. hdl: 1721.1/31303 . JSTOR   1910412. S2CID   32530881.
  6. Whittle, Peter (1980). "Multi-armed Bandits and the Gittins Index". Journal of the Royal Statistical Society, Series B . 42 (2): 143–149.
  7. Varaiya, P.; Walrand, J.; Buyukkoc, C. (May 1985). "Extensions of the multiarmed bandit problem: The discounted case". IEEE Transactions on Automatic Control. 30 (5): 426–439. doi:10.1109/TAC.1985.1103989.
  8. Chen, Yih Ren; Katehakis, Michael N. (1986). "Linear programming for finite state multi-armed bandit problems". Mathematics of Operations Research . 11 (1): 180–183. doi:10.1287/moor.11.1.180.
  9. Kallenberg, Lodewijk C. M. (1986). "A Note on M. N. Katehakis' and Y.-R. Chen's Computation of the Gittins Index". Mathematics of Operations Research . 11 (1): 184–186. doi:10.1287/moor.11.1.184.
  10. Katehakis, Michael N.; Veinott, Arthur F. Jr. (1987). "The multi-armed bandit problem: decomposition and computation". Mathematics of Operations Research . 12 (2): 262–268. doi:10.1287/moor.12.2.262. JSTOR   3689689. S2CID   656323.
  11. 1 2 3 Sonin I (2008). "A generalized Gittins index for a Markov chain and its recursive calculation". Statistics and Probability Letters. 78 (12): 1526–1533. doi:10.1016/j.spl.2008.01.049.
  12. Ni, Mora J (2007). "A (2/3)^n Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain". INFORMS Journal on Computing. 19 (4): 596–606. CiteSeerX   10.1.1.77.5127 . doi:10.1287/ijoc.1060.0206. S2CID   122785013.
  13. Cowan, Wesley; Katehakis, Michael N. (January 2015). "Multi-armed bandits under general depreciation and commitment". Probability in the Engineering and Informational Sciences. 29 (1): 51–76. doi: 10.1017/S0269964814000217 .
  14. Scully, Ziv and Harchol-Balter, Mor and Scheller-Wolf, Alan (2018). "SOAP: One Clean Analysis of All Age-Based Scheduling Policies". Proceedings of the ACM on Measurement and Analysis of Computing Systems. 2 (1). ACM: 16. doi:10.1145/3179419. S2CID   216145213.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  15. Di Gregorio, Lorenzo and Frascolla, Valerio (October 1, 2019). Handover Optimality in Heterogeneous Networks. 5G World Forum. arXiv: 1908.09991v2 .{{cite conference}}: |access-date= requires |url= (help); |archive-url= requires |url= (help)CS1 maint: multiple names: authors list (link)

Related Research Articles

Gauss–Markov stochastic processes are stochastic processes that satisfy the requirements for both Gaussian processes and Markov processes. A stationary Gauss–Markov process is unique up to rescaling; such a process is also known as an Ornstein–Uhlenbeck process.

In physics, a Langevin equation is a stochastic differential equation describing how a system evolves when subjected to a combination of deterministic and fluctuating ("random") forces. The dependent variables in a Langevin equation typically are collective (macroscopic) variables changing only slowly in comparison to the other (microscopic) variables of the system. The fast (microscopic) variables are responsible for the stochastic nature of the Langevin equation. One application is to Brownian motion, which models the fluctuating motion of a small particle in a fluid.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

<span class="mw-page-title-main">Markov property</span> Memoryless property of a stochastic process

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov property, except that the meaning of "present" is defined in terms of a random variable known as a stopping time.

In physics, the S-matrix or scattering matrix relates the initial state and the final state of a physical system undergoing a scattering process. It is used in quantum mechanics, scattering theory and quantum field theory (QFT).

<span class="mw-page-title-main">Stopping time</span> Time at which a random variable stops exhibiting a behavior of interest

In probability theory, in particular in the study of stochastic processes, a stopping time is a specific type of “random time”: a random variable whose value is interpreted as the time at which a given stochastic process exhibits a certain behavior of interest. A stopping time is often defined by a stopping rule, a mechanism for deciding whether to continue or stop a process on the basis of the present position and past events, and which will almost always lead to a decision to stop at some finite time.

In probability theory and statistics, given a stochastic process, the autocovariance is a function that gives the covariance of the process with itself at pairs of time points. Autocovariance is closely related to the autocorrelation of the process in question.

<span class="mw-page-title-main">KMS state</span> Type of state in thermal systems

In the statistical mechanics of quantum mechanical systems and quantum field theory, the properties of a system in thermal equilibrium can be described by a mathematical object called a Kubo–Martin–Schwinger (KMS) state: a state satisfying the KMS condition.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

In mathematics, the theory of optimal stopping or early stopping is concerned with the problem of choosing a time to take a particular action, in order to maximise an expected reward or minimise an expected cost. Optimal stopping problems can be found in areas of statistics, economics, and mathematical finance. A key example of an optimal stopping problem is the secretary problem. Optimal stopping problems can often be written in the form of a Bellman equation, and are therefore often solved using dynamic programming.

In many-body theory, the term Green's function is sometimes used interchangeably with correlation function, but refers specifically to correlators of field operators or creation and annihilation operators.

The discrete phase-type distribution is a probability distribution that results from a system of one or more inter-related geometric distributions occurring in sequence, or phases. The sequence in which each of the phases occur may itself be a stochastic process. The distribution can be represented by a random variable describing the time until absorption of an absorbing Markov chain with one absorbing state. Each of the states of the Markov chain represents one of the phases.

Resonance fluorescence is the process in which a two-level atom system interacts with the quantum electromagnetic field if the field is driven at a frequency near to the natural frequency of the atom.

The partition function or configuration integral, as used in probability theory, information theory and dynamical systems, is a generalization of the definition of a partition function in statistical mechanics. It is a special case of a normalizing constant in probability theory, for the Boltzmann distribution. The partition function occurs in many problems of probability theory because, in situations where there is a natural symmetry, its associated probability measure, the Gibbs measure, has the Markov property. This means that the partition function occurs not only in physical systems with translation symmetry, but also in such varied settings as neural networks, and applications such as genomics, corpus linguistics and artificial intelligence, which employ Markov networks, and Markov logic networks. The Gibbs measure is also the unique measure that has the property of maximizing the entropy for a fixed expectation value of the energy; this underlies the appearance of the partition function in maximum entropy methods and the algorithms derived therefrom.

In statistical mechanics, the Griffiths inequality, sometimes also called Griffiths–Kelly–Sherman inequality or GKS inequality, named after Robert B. Griffiths, is a correlation inequality for ferromagnetic spin systems. Informally, it says that in ferromagnetic spin systems, if the 'a-priori distribution' of the spin is invariant under spin flipping, the correlation of any monomial of the spins is non-negative; and the two point correlation of two monomial of the spins is non-negative.

<span class="mw-page-title-main">Michael Katehakis</span> Greek American mathematician (born 1952)

Michael N. Katehakis is a Professor of Management Science at Rutgers University. He is noted for his work in Markov decision process, Gittins index, the multi-armed bandit, Markov chains and other related fields.

Photon statistics is the theoretical and experimental study of the statistical distributions produced in photon counting experiments, which use photodetectors to analyze the intrinsic statistical nature of photons in a light source. In these experiments, light incident on the photodetector generates photoelectrons and a counter registers electrical pulses generating a statistical distribution of photon counts. Low intensity disparate light sources can be differentiated by the corresponding statistical distributions produced in the detection process.

In computational solid state physics, Continuous-time quantum Monte Carlo (CT-QMC) is a family of stochastic algorithms for solving the Anderson impurity model at finite temperature. These methods first expand the full partition function as a series of Feynman diagrams, employ Wick's theorem to group diagrams into determinants, and finally use Markov chain Monte Carlo to stochastically sum up the resulting series.

In computer science, the Aharonov–Jones–Landau algorithm is an efficient quantum algorithm for obtaining an additive approximation of the Jones polynomial of a given link at an arbitrary root of unity. Finding a multiplicative approximation is a #P-hard problem, so a better approximation is considered unlikely. However, it is known that computing an additive approximation of the Jones polynomial is a BQP-complete problem.

Stochastic scheduling concerns scheduling problems involving random attributes, such as random processing times, random due dates, random weights, and stochastic machine breakdowns. Major applications arise in manufacturing systems, computer systems, communication systems, logistics and transportation, and machine learning, among others.

References