Stochastic dynamic programming

Last updated

Originally introduced by Richard E. Bellman in ( Bellman 1957 ), stochastic dynamic programming is a technique for modelling and solving problems of decision making under uncertainty. Closely related to stochastic programming and dynamic programming, stochastic dynamic programming represents the problem under scrutiny in the form of a Bellman equation. The aim is to compute a policy prescribing how to act optimally in the face of uncertainty.

Contents

A motivating example: Gambling game

A gambler has $2, she is allowed to play a game of chance 4 times and her goal is to maximize her probability of ending up with a least $6. If the gambler bets $ on a play of the game, then with probability 0.4 she wins the game, recoup the initial bet, and she increases her capital position by $; with probability 0.6, she loses the bet amount $; all plays are pairwise independent. On any play of the game, the gambler may not bet more money than she has available at the beginning of that play. [1]

Stochastic dynamic programming can be employed to model this problem and determine a betting strategy that, for instance, maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon.

Note that if there is no limit to the number of games that can be played, the problem becomes a variant of the well known St. Petersburg paradox.

An optimal betting strategy that maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon;
b
t
(
$
x
)
{\displaystyle b_{t}(\$x)}
represents the bet amount for game
t
{\displaystyle t}
when the gambler has $
x
{\displaystyle x}
at the beginning of that play. If the decision maker follows this policy, with probability 0.1984 she will attain a wealth of at least $6. Gambling game optimal policy.png
An optimal betting strategy that maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon; represents the bet amount for game when the gambler has $ at the beginning of that play. If the decision maker follows this policy, with probability 0.1984 she will attain a wealth of at least $6.

Formal background

Consider a discrete system defined on stages in which each stage is characterized by

Let represent the optimal cost/reward obtained by following an optimal policy over stages . Without loss of generality in what follow we will consider a reward maximisation setting. In deterministic dynamic programming one usually deals with functional equations taking the following structure

where and the boundary condition of the system is

The aim is to determine the set of optimal actions that maximise . Given the current state and the current action , we know with certainty the reward secured during the current stage and – thanks to the state transition function – the future state towards which the system transitions.

In practice, however, even if we know the state of the system at the beginning of the current stage as well as the decision taken, the state of the system at the beginning of the next stage and the current period reward are often random variables that can be observed only at the end of the current stage.

Stochastic dynamic programming deals with problems in which the current period reward and/or the next period state are random, i.e. with multi-stage stochastic systems. The decision maker's goal is to maximise expected (discounted) reward over a given planning horizon.

In their most general form, stochastic dynamic programs deal with functional equations taking the following structure

where

Markov decision processes represent a special class of stochastic dynamic programs in which the underlying stochastic process is a stationary process that features the Markov property.

Gambling game as a stochastic dynamic program

Gambling game can be formulated as a Stochastic Dynamic Program as follows: there are games (i.e. stages) in the planning horizon

Let be the probability that, by the end of game 4, the gambler has at least $6, given that she has $ at the beginning of game .

To derive the functional equation, define as a bet that attains , then at the beginning of game

For the functional equation is , where ranges in ; the aim is to find .

Given the functional equation, an optimal betting policy can be obtained via forward recursion or backward recursion algorithms, as outlined below.

Solution methods

Stochastic dynamic programs can be solved to optimality by using backward recursion or forward recursion algorithms. Memoization is typically employed to enhance performance. However, like deterministic dynamic programming also its stochastic variant suffers from the curse of dimensionality. For this reason approximate solution methods are typically employed in practical applications.

Backward recursion

Given a bounded state space, backward recursion( Bertsekas 2000 ) begins by tabulating for every possible state belonging to the final stage . Once these values are tabulated, together with the associated optimal state-dependent actions , it is possible to move to stage and tabulate for all possible states belonging to the stage . The process continues by considering in a backward fashion all remaining stages up to the first one. Once this tabulation process is complete, – the value of an optimal policy given initial state – as well as the associated optimal action can be easily retrieved from the table. Since the computation proceeds in a backward fashion, it is clear that backward recursion may lead to computation of a large number of states that are not necessary for the computation of .

Example: Gambling game

Forward recursion

Given the initial state of the system at the beginning of period 1, forward recursion( Bertsekas 2000 ) computes by progressively expanding the functional equation (forward pass). This involves recursive calls for all that are necessary for computing a given . The value of an optimal policy and its structure are then retrieved via a (backward pass) in which these suspended recursive calls are resolved. A key difference from backward recursion is the fact that is computed only for states that are relevant for the computation of . Memoization is employed to avoid recomputation of states that have been already considered.

Example: Gambling game

We shall illustrate forward recursion in the context of the Gambling game instance previously discussed. We begin the forward pass by considering

At this point we have not computed yet , which are needed to compute ; we proceed and compute these items. Note that , therefore one can leverage memoization and perform the necessary computations only once.

Computation of

We have now computed for all that are needed to compute . However, this has led to additional suspended recursions involving . We proceed and compute these values.

Computation of

Since stage 4 is the last stage in our system, represent boundary conditions that are easily computed as follows.

Boundary conditions

At this point it is possible to proceed and recover the optimal policy and its value via a backward pass involving, at first, stage 3

Backward pass involving

and, then, stage 2.

Backward pass involving

We finally recover the value of an optimal policy

This is the optimal policy that has been previously illustrated. Note that there are multiple optimal policies leading to the same optimal value ; for instance, in the first game one may either bet $1 or $2.

Python implementation. The one that follows is a complete Python implementation of this example.

fromtypingimportList,Tupleimportfunctoolsclassmemoize:def__init__(self,func):self.func=funcself.memoized={}self.method_cache={}def__call__(self,*args):returnself.cache_get(self.memoized,args,lambda:self.func(*args))def__get__(self,obj,objtype):returnself.cache_get(self.method_cache,obj,lambda:self.__class__(functools.partial(self.func,obj)),)defcache_get(self,cache,key,func):try:returncache[key]exceptKeyError:cache[key]=func()returncache[key]defreset(self):self.memoized={}self.method_cache={}classState:"""the state of the gambler's ruin problem"""def__init__(self,t:int,wealth:float):"""state constructor        Arguments:            t {int} -- time period            wealth {float} -- initial wealth        """self.t,self.wealth=t,wealthdef__eq__(self,other):returnself.__dict__==other.__dict__def__str__(self):returnstr(self.t)+" "+str(self.wealth)def__hash__(self):returnhash(str(self))classGamblersRuin:def__init__(self,bettingHorizon:int,targetWealth:float,pmf:List[List[Tuple[int,float]]],):"""the gambler's ruin problem        Arguments:            bettingHorizon {int} -- betting horizon            targetWealth {float} -- target wealth            pmf {List[List[Tuple[int, float]]]} -- probability mass function        """# initialize instance variablesself.bettingHorizon,self.targetWealth,self.pmf=(bettingHorizon,targetWealth,pmf,)# lambdasself.ag=lambdas:[iforiinrange(0,min(self.targetWealth//2,s.wealth)+1)]# action generatorself.st=lambdas,a,r:State(s.t+1,s.wealth-a+a*r)# state transitionself.iv=(lambdas,a,r:1ifs.wealth-a+a*r>=self.targetWealthelse0)# immediate value functionself.cache_actions={}# cache with optimal state/action pairsdeff(self,wealth:float)->float:s=State(0,wealth)returnself._f(s)defq(self,t:int,wealth:float)->float:s=State(t,wealth)returnself.cache_actions[str(s)]@memoizedef_f(self,s:State)->float:# Forward recursionvalues=[sum([p[1]*(self._f(self.st(s,a,p[0]))ifs.t<self.bettingHorizon-1elseself.iv(s,a,p[0]))# value functionforpinself.pmf[s.t]])# bet realisationsforainself.ag(s)]# actions                          v=max(values)try:self.cache_actions[str(s)]=self.ag(s)[values.index(v)]# store best actionexceptValueError:self.cache_actions[str(s)]=Noneprint("Error in retrieving best action")returnv# return expected total costinstance={"bettingHorizon":4,"targetWealth":6,"pmf":[[(0,0.6),(2,0.4)]foriinrange(0,4)],}gr,initial_wealth=GamblersRuin(**instance),2# f_1(x) is gambler's probability of attaining $targetWealth at the end of bettingHorizonprint("f_1("+str(initial_wealth)+"): "+str(gr.f(initial_wealth)))# Recover optimal action for period 2 when initial wealth at the beginning of period 2 is $1.t,initial_wealth=1,1print("b_"+str(t+1)+"("+str(initial_wealth)+"): "+str(gr.q(t,initial_wealth)))

Java implementation. GamblersRuin.java is a standalone Java 8 implementation of the above example.

Approximate dynamic programming

An introduction to approximate dynamic programming is provided by ( Powell 2009 ).

Further reading

See also

Related Research Articles

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains no other instances of "11" before the end.

<span class="mw-page-title-main">Pareto distribution</span> Probability distribution

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.

<span class="mw-page-title-main">Synthetic division</span> Algorithm for Euclidean division of polynomials

In algebra, synthetic division is a method for manually performing Euclidean division of polynomials, with less writing and fewer calculations than long division.

An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently, the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event. If the OR is greater than 1, then A and B are associated (correlated) in the sense that, compared to the absence of B, the presence of B raises the odds of A, and symmetrically the presence of A raises the odds of B. Conversely, if the OR is less than 1, then A and B are negatively correlated, and the presence of one event reduces the odds of the other event.

In the field of mathematical optimization, stochastic programming is a framework for modeling optimization problems that involve uncertainty. A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly. The goal of stochastic programming is to find a decision which both optimizes some criteria chosen by the decision maker, and appropriately accounts for the uncertainty of the problem parameters. Because many real-world decisions involve uncertainty, stochastic programming has found applications in a broad range of areas ranging from finance to transportation to energy optimization.

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basis. The Kronecker product is to be distinguished from the usual matrix multiplication, which is an entirely different operation. The Kronecker product is also sometimes called matrix direct product.

In mathematics, the kernel of a linear map, also known as the null space or nullspace, is the linear subspace of the domain of the map which is mapped to the zero vector. That is, given a linear map L : VW between two vector spaces V and W, the kernel of L is the vector space of all elements v of V such that L(v) = 0, where 0 denotes the zero vector in W, or more symbolically:

In theoretical physics, a minimal model or Virasoro minimal model is a two-dimensional conformal field theory whose spectrum is built from finitely many irreducible representations of the Virasoro algebra. Minimal models have been classified and solved, and found to obey an ADE classification. The term minimal model can also refer to a rational CFT based on an algebra that is larger than the Virasoro algebra, such as a W-algebra.

In information theory, the noisy-channel coding theorem, establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data nearly error-free up to a computable maximum rate through the channel. This result was presented by Claude Shannon in 1948 and was based in part on earlier work and ideas of Harry Nyquist and Ralph Hartley.

Stochastic dominance is a partial order between random variables. It is a form of stochastic ordering. The concept arises in decision theory and decision analysis in situations where one gamble can be ranked as superior to another gamble for a broad class of decision-makers. It is based on shared preferences regarding sets of possible outcomes and their associated probabilities. Only limited knowledge of preferences is required for determining dominance. Risk aversion is a factor only in second order stochastic dominance.

In the mathematical field of dynamical systems, a random dynamical system is a dynamical system in which the equations of motion have an element of randomness to them. Random dynamical systems are characterized by a state space S, a set of maps from S into itself that can be thought of as the set of all possible equations of motion, and a probability distribution Q on the set that represents the random choice of map. Motion in a random dynamical system can be informally thought of as a state evolving according to a succession of maps randomly chosen according to the distribution Q.

<span class="mw-page-title-main">Karger's algorithm</span> Randomized algorithm for minimum cuts

In computer science and graph theory, Karger's algorithm is a randomized algorithm to compute a minimum cut of a connected graph. It was invented by David Karger and first published in 1993.

In computer science and operations research, randomized rounding is a widely used approach for designing and analyzing approximation algorithms.

Sudoku codes are non-linear forward error correcting codes following rules of sudoku puzzles designed for an erasure channel. Based on this model, the transmitter sends a sequence of all symbols of a solved sudoku. The receiver either receives a symbol correctly or an erasure symbol to indicate that the symbol was not received. The decoder gets a matrix with missing entries and uses the constraints of sudoku puzzles to reconstruct a limited amount of erased symbols.

<span class="mw-page-title-main">Up-and-down design</span> Statistical experiment designs

Up-and-down designs (UDDs) are a family of statistical experiment designs used in dose-finding experiments in science, engineering, and medical research. Dose-finding experiments have binary responses: each individual outcome can be described as one of two possible values, such as success vs. failure or toxic vs. non-toxic. Mathematically the binary responses are coded as 1 and 0. The goal of dose-finding experiments is to estimate the strength of treatment (i.e., the "dose") that would trigger the "1" response a pre-specified proportion of the time. This dose can be envisioned as a percentile of the distribution of response thresholds. An example where dose-finding is used is in an experiment to estimate the LD50 of some toxic chemical with respect to mice.

LSH is a cryptographic hash function designed in 2014 by South Korea to provide integrity in general-purpose software environments such as PCs and smart devices. LSH is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP). And it is the national standard of South Korea.

In mathematics, the Khatri–Rao product or block Kronecker product of two partitioned matrices and is defined as

References

  1. This problem is adapted from W. L. Winston, Operations Research: Applications and Algorithms (7th Edition), Duxbury Press, 2003, chap. 19, example 3.