Stochastic programming

Last updated

In the field of mathematical optimization, stochastic programming is a framework for modeling optimization problems that involve uncertainty. A stochastic program is an optimization problem in which some or all problem parameters are uncertain, but follow known probability distributions. [1] [2] This framework contrasts with deterministic optimization, in which all problem parameters are assumed to be known exactly. The goal of stochastic programming is to find a decision which both optimizes some criteria chosen by the decision maker, and appropriately accounts for the uncertainty of the problem parameters. Because many real-world decisions involve uncertainty, stochastic programming has found applications in a broad range of areas ranging from finance to transportation to energy optimization. [3] [4]

Contents

Two-stage problems

The basic idea of two-stage stochastic programming is that (optimal) decisions should be based on data available at the time the decisions are made and cannot depend on future observations. The two-stage formulation is widely used in stochastic programming. The general formulation of a two-stage stochastic programming problem is given by:

where is the optimal value of the second-stage problem

The classical two-stage linear stochastic programming problems can be formulated as

where is the optimal value of the second-stage problem

In such formulation is the first-stage decision variable vector, is the second-stage decision variable vector, and contains the data of the second-stage problem. In this formulation, at the first stage we have to make a "here-and-now" decision before the realization of the uncertain data , viewed as a random vector, is known. At the second stage, after a realization of becomes available, we optimize our behavior by solving an appropriate optimization problem.

At the first stage we optimize (minimize in the above formulation) the cost of the first-stage decision plus the expected cost of the (optimal) second-stage decision. We can view the second-stage problem simply as an optimization problem which describes our supposedly optimal behavior when the uncertain data is revealed, or we can consider its solution as a recourse action where the term compensates for a possible inconsistency of the system and is the cost of this recourse action.

The considered two-stage problem is linear because the objective functions and the constraints are linear. Conceptually this is not essential and one can consider more general two-stage stochastic programs. For example, if the first-stage problem is integer, one could add integrality constraints to the first-stage problem so that the feasible set is discrete. Non-linear objectives and constraints could also be incorporated if needed. [5]

Distributional assumption

The formulation of the above two-stage problem assumes that the second-stage data is modeled as a random vector with a known probability distribution. This would be justified in many situations. For example, the distribution of could be inferred from historical data if one assumes that the distribution does not significantly change over the considered period of time. Also, the empirical distribution of the sample could be used as an approximation to the distribution of the future values of . If one has a prior model for , one could obtain a posteriori distribution by a Bayesian update.

Discretization

To solve the two-stage stochastic problem numerically, one often needs to assume that the random vector has a finite number of possible realizations, called scenarios, say , with respective probability masses . Then the expectation in the first-stage problem's objective function can be written as the summation:

and, moreover, the two-stage problem can be formulated as one large linear programming problem (this is called the deterministic equivalent of the original problem, see section § Deterministic equivalent of a stochastic problem).

When has an infinite (or very large) number of possible realizations the standard approach is then to represent this distribution by scenarios. This approach raises three questions, namely:

  1. How to construct scenarios, see § Scenario construction;
  2. How to solve the deterministic equivalent. Optimizers such as CPLEX, and GLPK can solve large linear/nonlinear problems. The NEOS Server, [6] hosted at the University of Wisconsin, Madison, allows free access to many modern solvers. The structure of a deterministic equivalent is particularly amenable to apply decomposition methods, [7] such as Benders' decomposition or scenario decomposition;
  3. How to measure quality of the obtained solution with respect to the "true" optimum.

These questions are not independent. For example, the number of scenarios constructed will affect both the tractability of the deterministic equivalent and the quality of the obtained solutions.

Stochastic linear programming

A stochastic linear program is a specific instance of the classical two-stage stochastic program. A stochastic LP is built from a collection of multi-period linear programs (LPs), each having the same structure but somewhat different data. The two-period LP, representing the scenario, may be regarded as having the following form:

The vectors and contain the first-period variables, whose values must be chosen immediately. The vector contains all of the variables for subsequent periods. The constraints involve only first-period variables and are the same in every scenario. The other constraints involve variables of later periods and differ in some respects from scenario to scenario, reflecting uncertainty about the future.

Note that solving the two-period LP is equivalent to assuming the scenario in the second period with no uncertainty. In order to incorporate uncertainties in the second stage, one should assign probabilities to different scenarios and solve the corresponding deterministic equivalent.

Deterministic equivalent of a stochastic problem

With a finite number of scenarios, two-stage stochastic linear programs can be modelled as large linear programming problems. This formulation is often called the deterministic equivalent linear program, or abbreviated to deterministic equivalent. (Strictly speaking a deterministic equivalent is any mathematical program that can be used to compute the optimal first-stage decision, so these will exist for continuous probability distributions as well, when one can represent the second-stage cost in some closed form.) For example, to form the deterministic equivalent to the above stochastic linear program, we assign a probability to each scenario . Then we can minimize the expected value of the objective, subject to the constraints from all scenarios:

We have a different vector of later-period variables for each scenario . The first-period variables and are the same in every scenario, however, because we must make a decision for the first period before we know which scenario will be realized. As a result, the constraints involving just and need only be specified once, while the remaining constraints must be given separately for each scenario.

Scenario construction

In practice it might be possible to construct scenarios by eliciting experts' opinions on the future. The number of constructed scenarios should be relatively modest so that the obtained deterministic equivalent can be solved with reasonable computational effort. It is often claimed that a solution that is optimal using only a few scenarios provides more adaptable plans than one that assumes a single scenario only. In some cases such a claim could be verified by a simulation. In theory some measures of guarantee that an obtained solution solves the original problem with reasonable accuracy. Typically in applications only the first stage optimal solution has a practical value since almost always a "true" realization of the random data will be different from the set of constructed (generated) scenarios.

Suppose contains independent random components, each of which has three possible realizations (for example, future realizations of each random parameters are classified as low, medium and high), then the total number of scenarios is . Such exponential growth of the number of scenarios makes model development using expert opinion very difficult even for reasonable size . The situation becomes even worse if some random components of have continuous distributions.

Monte Carlo sampling and Sample Average Approximation (SAA) Method

A common approach to reduce the scenario set to a manageable size is by using Monte Carlo simulation. Suppose the total number of scenarios is very large or even infinite. Suppose further that we can generate a sample of replications of the random vector . Usually the sample is assumed to be independent and identically distributed (i.i.d sample). Given a sample, the expectation function is approximated by the sample average

and consequently the first-stage problem is given by

This formulation is known as the Sample Average Approximation method. The SAA problem is a function of the considered sample and in that sense is random. For a given sample the SAA problem is of the same form as a two-stage stochastic linear programming problem with the scenarios ., , each taken with the same probability .

Statistical inference

Consider the following stochastic programming problem

Here is a nonempty closed subset of , is a random vector whose probability distribution is supported on a set , and . In the framework of two-stage stochastic programming, is given by the optimal value of the corresponding second-stage problem.

Assume that is well defined and finite valued for all . This implies that for every the value is finite almost surely.

Suppose that we have a sample of realizations of the random vector . This random sample can be viewed as historical data of observations of , or it can be generated by Monte Carlo sampling techniques. Then we can formulate a corresponding sample average approximation

By the Law of Large Numbers we have that, under some regularity conditions converges pointwise with probability 1 to as . Moreover, under mild additional conditions the convergence is uniform. We also have , i.e., is an unbiased estimator of . Therefore, it is natural to expect that the optimal value and optimal solutions of the SAA problem converge to their counterparts of the true problem as .

Consistency of SAA estimators

Suppose the feasible set of the SAA problem is fixed, i.e., it is independent of the sample. Let and be the optimal value and the set of optimal solutions, respectively, of the true problem and let and be the optimal value and the set of optimal solutions, respectively, of the SAA problem.

  1. Let and be a sequence of (deterministic) real valued functions. The following two properties are equivalent:
    • for any and any sequence converging to it follows that converges to
    • the function is continuous on and converges to uniformly on any compact subset of
  2. If the objective of the SAA problem converges to the true problem's objective with probability 1, as , uniformly on the feasible set . Then converges to with probability 1 as .
  3. Suppose that there exists a compact set such that
    • the set of optimal solutions of the true problem is nonempty and is contained in
    • the function is finite valued and continuous on
    • the sequence of functions converges to with probability 1, as , uniformly in
    • for large enough the set is nonempty and with probability 1
then and with probability 1 as . Note that denotes the deviation of set from set , defined as

In some situations the feasible set of the SAA problem is estimated, then the corresponding SAA problem takes the form

where is a subset of depending on the sample and therefore is random. Nevertheless, consistency results for SAA estimators can still be derived under some additional assumptions:

  1. Suppose that there exists a compact set such that
    • the set of optimal solutions of the true problem is nonempty and is contained in
    • the function is finite valued and continuous on
    • the sequence of functions converges to with probability 1, as , uniformly in
    • for large enough the set is nonempty and with probability 1
    • if and converges with probability 1 to a point , then
    • for some point there exists a sequence such that with probability 1.
then and with probability 1 as .

Asymptotics of the SAA optimal value

Suppose the sample is i.i.d. and fix a point . Then the sample average estimator , of , is unbiased and has variance , where is supposed to be finite. Moreover, by the central limit theorem we have that

where denotes convergence in distribution and has a normal distribution with mean and variance , written as .

In other words, has asymptotically normal distribution, i.e., for large , has approximately normal distribution with mean and variance . This leads to the following (approximate) % confidence interval for :

where (here denotes the cdf of the standard normal distribution) and

is the sample variance estimate of . That is, the error of estimation of is (stochastically) of order .

Applications and Examples

Biological applications

Stochastic dynamic programming is frequently used to model animal behaviour in such fields as behavioural ecology. [8] [9] Empirical tests of models of optimal foraging, life-history transitions such as fledging in birds and egg laying in parasitoid wasps have shown the value of this modelling technique in explaining the evolution of behavioural decision making. These models are typically many-staged, rather than two-staged.

Economic applications

Stochastic dynamic programming is a useful tool in understanding decision making under uncertainty. The accumulation of capital stock under uncertainty is one example; often it is used by resource economists to analyze bioeconomic problems [10] where the uncertainty enters in such as weather, etc.

Example: multistage portfolio optimization

The following is an example from finance of multi-stage stochastic programming. Suppose that at time we have initial capital to invest in assets. Suppose further that we are allowed to rebalance our portfolio at times but without injecting additional cash into it. At each period we make a decision about redistributing the current wealth among the assets. Let be the initial amounts invested in the n assets. We require that each is nonnegative and that the balance equation should hold.

Consider the total returns for each period . This forms a vector-valued random process . At time period , we can rebalance the portfolio by specifying the amounts invested in the respective assets. At that time the returns in the first period have been realized so it is reasonable to use this information in the rebalancing decision. Thus, the second-stage decisions, at time , are actually functions of realization of the random vector , i.e., . Similarly, at time the decision is a function of the available information given by the history of the random process up to time . A sequence of functions , , with being constant, defines an implementable policy of the decision process. It is said that such a policy is feasible if it satisfies the model constraints with probability 1, i.e., the nonnegativity constraints , , , and the balance of wealth constraints,

where in period the wealth is given by

which depends on the realization of the random process and the decisions up to time .

Suppose the objective is to maximize the expected utility of this wealth at the last period, that is, to consider the problem

This is a multistage stochastic programming problem, where stages are numbered from to . Optimization is performed over all implementable and feasible policies. To complete the problem description one also needs to define the probability distribution of the random process . This can be done in various ways. For example, one can construct a particular scenario tree defining time evolution of the process. If at every stage the random return of each asset is allowed to have two continuations, independent of other assets, then the total number of scenarios is

In order to write dynamic programming equations, consider the above multistage problem backward in time. At the last stage , a realization of the random process is known and has been chosen. Therefore, one needs to solve the following problem

where denotes the conditional expectation of given . The optimal value of the above problem depends on and and is denoted .

Similarly, at stages , one should solve the problem

whose optimal value is denoted by . Finally, at stage , one solves the problem

Stagewise independent random process

For a general distribution of the process , it may be hard to solve these dynamic programming equations. The situation simplifies dramatically if the process is stagewise independent, i.e., is (stochastically) independent of for . In this case, the corresponding conditional expectations become unconditional expectations, and the function , does not depend on . That is, is the optimal value of the problem

and is the optimal value of

for .

Software tools

Modelling languages

All discrete stochastic programming problems can be represented with any algebraic modeling language, manually implementing explicit or implicit non-anticipativity to make sure the resulting model respects the structure of the information made available at each stage. An instance of an SP problem generated by a general modelling language tends to grow quite large (linearly in the number of scenarios), and its matrix loses the structure that is intrinsic to this class of problems, which could otherwise be exploited at solution time by specific decomposition algorithms. Extensions to modelling languages specifically designed for SP are starting to appear, see:

They both can generate SMPS instance level format, which conveys in a non-redundant form the structure of the problem to the solver.

See also

Related Research Articles

<span class="mw-page-title-main">Fourier transform</span> Mathematical transform that expresses a function of time as a function of frequency

In physics and mathematics, the Fourier transform (FT) is a transform that converts a function into a form that describes the frequencies present in the original function. The output of the transform is a complex-valued function of frequency. The term Fourier transform refers to both this complex-valued function and the mathematical operation. When a distinction needs to be made the Fourier transform is sometimes called the frequency domain representation of the original function. The Fourier transform is analogous to decomposing the sound of a musical chord into terms of the intensity of its constituent pitches.

<span class="mw-page-title-main">Kalman filter</span> Algorithm that estimates unknowns from a series of measurements over time

For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, who was one of the primary developers of its theory.

<span class="mw-page-title-main">Hamilton–Jacobi equation</span> A reformulation of Newtons laws of motion using the calculus of variations

In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics.

<span class="mw-page-title-main">Stochastic gradient descent</span> Optimization algorithm

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

Particle filters, or sequential Monte Carlo methods, are a set of Monte Carlo algorithms used to find approximate solutions for filtering problems for nonlinear state-space systems, such as signal processing and Bayesian statistical inference. The filtering problem consists of estimating the internal states in dynamical systems when partial observations are made and random perturbations are present in the sensors as well as in the dynamical system. The objective is to compute the posterior distributions of the states of a Markov process, given the noisy and partial observations. The term "particle filters" was first coined in 1996 by Pierre Del Moral about mean-field interacting particle methods used in fluid mechanics since the beginning of the 1960s. The term "Sequential Monte Carlo" was coined by Jun S. Liu and Rong Chen in 1998.

In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .

Robust optimization is a field of mathematical optimization theory that deals with optimization problems in which a certain measure of robustness is sought against uncertainty that can be represented as deterministic variability in the value of the parameters of the problem itself and/or its solution. It is related to, but often distinguished from, probabilistic optimization methods such as chance-constrained optimization.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminating. Upon terminating at a given state, the reward achieved is the sum of the probabilistic expected rewards associated with every state from the actual terminating state to the ultimate terminal state, inclusive. The index is a real scalar.

An -superprocess, , within mathematics probability theory is a stochastic process on that is usually constructed as a special limit of near-critical branching diffusions.

In the theory of stochastic processes, filtering describes the problem of determining the state of a system from an incomplete and potentially noisy set of observations. While originally motivated by problems in engineering, filtering found applications in many fields from signal processing to finance.

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.

Mean-field particle methods are a broad class of interacting type Monte Carlo algorithms for simulating from a sequence of probability distributions satisfying a nonlinear evolution equation. These flows of probability measures can always be interpreted as the distributions of the random states of a Markov process whose transition probabilities depends on the distributions of the current random states. A natural way to simulate these sophisticated nonlinear Markov processes is to sample a large number of copies of the process, replacing in the evolution equation the unknown distributions of the random states by the sampled empirical measures. In contrast with traditional Monte Carlo and Markov chain Monte Carlo methods these mean-field particle techniques rely on sequential interacting samples. The terminology mean-field reflects the fact that each of the samples interacts with the empirical measures of the process. When the size of the system tends to infinity, these random empirical measures converge to the deterministic distribution of the random states of the nonlinear Markov chain, so that the statistical interaction between particles vanishes. In other words, starting with a chaotic configuration based on independent copies of initial state of the nonlinear Markov chain model, the chaos propagates at any time horizon as the size the system tends to infinity; that is, finite blocks of particles reduces to independent copies of the nonlinear Markov process. This result is called the propagation of chaos property. The terminology "propagation of chaos" originated with the work of Mark Kac in 1976 on a colliding mean-field kinetic gas model.

<span class="mw-page-title-main">Simulation-based optimization</span>

Simulation-based optimization integrates optimization techniques into simulation modeling and analysis. Because of the complexity of the simulation, the objective function may become difficult and expensive to evaluate. Usually, the underlying simulation model is stochastic, so that the objective function must be estimated using statistical estimation techniques.

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983, as a universal tool to data compression, but recently have been used to model data in different areas such as biology, linguistics and music.

Supersymmetric theory of stochastic dynamics or stochastics (STS) is an exact theory of stochastic (partial) differential equations (SDEs), the class of mathematical models with the widest applicability covering, in particular, all continuous time dynamical systems, with and without noise. The main utility of the theory from the physical point of view is a rigorous theoretical explanation of the ubiquitous spontaneous long-range dynamical behavior that manifests itself across disciplines via such phenomena as 1/f, flicker, and crackling noises and the power-law statistics, or Zipf's law, of instantonic processes like earthquakes and neuroavalanches. From the mathematical point of view, STS is interesting because it bridges the two major parts of mathematical physics – the dynamical systems theory and topological field theories. Besides these and related disciplines such as algebraic topology and supersymmetric field theories, STS is also connected with the traditional theory of stochastic differential equations and the theory of pseudo-Hermitian operators.

The separation principle is one of the fundamental principles of stochastic control theory, which states that the problems of optimal control and state estimation can be decoupled under certain conditions. In its most basic formulation it deals with a linear stochastic system

Dynamic discrete choice (DDC) models, also known as discrete choice models of dynamic programming, model an agent's choices over discrete options that have future implications. Rather than assuming observed choices are the result of static utility maximization, observed choices in DDC models are assumed to result from an agent's maximization of the present value of utility, generalizing the utility theory upon which discrete choice models are based.

In physics and mathematics, the Klein–Kramers equation or sometimes referred as Kramers–Chandrasekhar equation is a partial differential equation that describes the probability density function f of a Brownian particle in phase space (r, p). It is a special case of the Fokker–Planck equation.

(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting.

References

  1. Shapiro, Alexander; Dentcheva, Darinka; Ruszczyński, Andrzej (2009). Lectures on stochastic programming: Modeling and theory (PDF). MPS/SIAM Series on Optimization. Vol. 9. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). Mathematical Programming Society (MPS). pp. xvi+436. ISBN   978-0-89871-687-0. MR   2562798. Archived from the original (PDF) on 2020-03-24. Retrieved 2010-09-22.
  2. Birge, John R.; Louveaux, François (2011). Introduction to Stochastic Programming. Springer Series in Operations Research and Financial Engineering. doi:10.1007/978-1-4614-0237-4. ISBN   978-1-4614-0236-7. ISSN   1431-8598.
  3. Stein W. Wallace and William T. Ziemba (eds.). Applications of Stochastic Programming . MPS-SIAM Book Series on Optimization 5, 2005.
  4. Applications of stochastic programming are described at the following website, Stochastic Programming Community.
  5. Shapiro, Alexander; Philpott, Andy. A tutorial on Stochastic Programming (PDF).
  6. "NEOS Server for Optimization".
  7. Ruszczyński, Andrzej; Shapiro, Alexander (2003). Stochastic Programming. Handbooks in Operations Research and Management Science. Vol. 10. Philadelphia: Elsevier. p. 700. ISBN   978-0444508546.
  8. Mangel, M. & Clark, C. W. 1988. Dynamic modeling in behavioral ecology. Princeton University Press ISBN   0-691-08506-4
  9. Houston, A. I & McNamara, J. M. 1999. Models of adaptive behaviour: an approach based on state. Cambridge University Press ISBN   0-521-65539-0
  10. Howitt, R., Msangi, S., Reynaud, A and K. Knapp. 2002. "Using Polynomial Approximations to Solve Stochastic Dynamic Programming Problems: or A "Betty Crocker " Approach to SDP." University of California, Davis, Department of Agricultural and Resource Economics Working Paper.

Further reading