Exponential smoothing

Last updated

Exponential smoothing or exponential moving average (EMA) is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.


Exponential smoothing is one of many window functions commonly applied to smooth data in signal processing, acting as low-pass filters to remove high-frequency noise. This method is preceded by Poisson's use of recursive exponential window functions in convolutions from the 19th century, as well as Kolmogorov and Zurbenko's use of recursive moving averages from their studies of turbulence in the 1940s.

The raw data sequence is often represented by beginning at time , and the output of the exponential smoothing algorithm is commonly written as , which may be regarded as a best estimate of what the next value of will be. When the sequence of observations begins at time , the simplest form of exponential smoothing is given by the formulas: [1]

where is the smoothing factor, and .

Basic (simple) exponential smoothing

The use of the exponential window function is first attributed to Poisson [2] as an extension of a numerical analysis technique from the 17th century, and later adopted by the signal processing community in the 1940s. Here, exponential smoothing is the application of the exponential, or Poisson, window function. Exponential smoothing was first suggested in the statistical literature without citation to previous work by Robert Goodell Brown in 1956, [3] and then expanded by Charles C. Holt in 1957. [4] The formulation below, which is the one commonly used, is attributed to Brown and is known as "Brown’s simple exponential smoothing". [5] All the methods of Holt, Winters and Brown may be seen as a simple application of recursive filtering, first found in the 1940s [2] to convert finite impulse response (FIR) filters to infinite impulse response filters.

The simplest form of exponential smoothing is given by the formula:

where is the smoothing factor, and . In other words, the smoothed statistic is a simple weighted average of the current observation and the previous smoothed statistic . Simple exponential smoothing is easily applied, and it produces a smoothed statistic as soon as two observations are available. The term smoothing factor applied to here is something of a misnomer, as larger values of actually reduce the level of smoothing, and in the limiting case with = 1 the smoothing output series is just the current observation. Values of close to 1 have less of a smoothing effect and give greater weight to recent changes in the data, while values of closer to 0 have a greater smoothing effect and are less responsive to recent changes. In the limiting case with = 0, the output series is just flat or a constant as the observation at the beginning of the smoothening process .

There is no formally correct procedure for choosing . Sometimes the statistician's judgment is used to choose an appropriate factor. Alternatively, a statistical technique may be used to optimize the value of . For example, the method of least squares might be used to determine the value of for which the sum of the quantities is minimized. [6]

Unlike some other smoothing methods, such as the simple moving average, this technique does not require any minimum number of observations to be made before it begins to produce results. In practice, however, a "good average" will not be achieved until several samples have been averaged together; for example, a constant signal will take approximately stages to reach 95% of the actual value. To accurately reconstruct the original signal without information loss, all stages of the exponential moving average must also be available, because older samples decay in weight exponentially. This is in contrast to a simple moving average, in which some samples can be skipped without as much loss of information due to the constant weighting of samples within the average. If a known number of samples will be missed, one can adjust a weighted average for this as well, by giving equal weight to the new sample and all those to be skipped.

This simple form of exponential smoothing is also known as an exponentially weighted moving average (EWMA). Technically it can also be classified as an autoregressive integrated moving average (ARIMA) (0,1,1) model with no constant term. [7]

Time constant

The time constant of an exponential moving average is the amount of time for the smoothed response of a unit step function to reach of the original signal. The relationship between this time constant, , and the smoothing factor, , is given by the formula:

, thus

where is the sampling time interval of the discrete time implementation. If the sampling time is fast compared to the time constant () then, by using the Taylor expansion of the exponential function,

Choosing the initial smoothed value

Note that in the definition above, is being initialized to . Because exponential smoothing requires that at each stage we have the previous forecast, it is not obvious how to get the method started. We could assume that the initial forecast is equal to the initial value of demand; however, this approach has a serious drawback. Exponential smoothing puts substantial weight on past observations, so the initial value of demand will have an unreasonably large effect on early forecasts. This problem can be overcome by allowing the process to evolve for a reasonable number of periods (10 or more) and using the average of the demand during those periods as the initial forecast. There are many other ways of setting this initial value, but it is important to note that the smaller the value of , the more sensitive your forecast will be on the selection of this initial smoother value . [8] [9]


For every exponential smoothing method we also need to choose the value for the smoothing parameters. For simple exponential smoothing, there is only one smoothing parameter (α), but for the methods that follow there is usually more than one smoothing parameter.

There are cases where the smoothing parameters may be chosen in a subjective manner – the forecaster specifies the value of the smoothing parameters based on previous experience. However, a more robust and objective way to obtain values for the unknown parameters included in any exponential smoothing method is to estimate them from the observed data.

The unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the sum of squared errors (SSE). The errors are specified as for (the one-step-ahead within-sample forecast errors). Hence we find the values of the unknown parameters and the initial values that minimize


Unlike the regression case (where we have formulae to directly compute the regression coefficients which minimize the SSE) this involves a non-linear minimization problem and we need to use an optimization tool to perform this.

"Exponential" naming

The name 'exponential smoothing' is attributed to the use of the exponential window function during convolution. It is no longer attributed to Holt, Winters & Brown.

By direct substitution of the defining equation for simple exponential smoothing back into itself we find that

In other words, as time passes the smoothed statistic becomes the weighted average of a greater and greater number of the past observations , and the weights assigned to previous observations are proportional to the terms of the geometric progression

A geometric progression is the discrete version of an exponential function, so this is where the name for this smoothing method originated according to Statistics lore.

Comparison with moving average

Exponential smoothing and moving average have similar defects of introducing a lag relative to the input data. While this can be corrected by shifting the result by half the window length for a symmetrical kernel, such as a moving average or gaussian, it is unclear how appropriate this would be for exponential smoothing. They also both have roughly the same distribution of forecast error when α = 2/(k + 1). They differ in that exponential smoothing takes into account all past data, whereas moving average only takes into account k past data points. Computationally speaking, they also differ in that moving average requires that the past k data points, or the data point at lag k + 1 plus the most recent forecast value, to be kept, whereas exponential smoothing only needs the most recent forecast value to be kept. [11]

In the signal processing literature, the use of non-causal (symmetric) filters is commonplace, and the exponential window function is broadly used in this fashion, but a different terminology is used: exponential smoothing is equivalent to a first-order infinite-impulse response (IIR) filter and moving average is equivalent to a finite impulse response filter with equal weighting factors.

Double exponential smoothing (Holt linear)

Simple exponential smoothing does not do well when there is a trend in the data. [1] In such situations, several methods were devised under the name "double exponential smoothing" or "second-order exponential smoothing," which is the recursive application of an exponential filter twice, thus being termed "double exponential smoothing". This nomenclature is similar to quadruple exponential smoothing, which also references its recursion depth. [12] The basic idea behind double exponential smoothing is to introduce a term to take into account the possibility of a series exhibiting some form of trend. This slope component is itself updated via exponential smoothing.

One method, works as follows: [13]

Again, the raw data sequence of observations is represented by , beginning at time . We use to represent the smoothed value for time , and is our best estimate of the trend at time . The output of the algorithm is now written as , an estimate of the value of at time based on the raw data up to time . Double exponential smoothing is given by the formulas

And for by

where () is the data smoothing factor, and () is the trend smoothing factor.

To forecast beyond is given by the approximation:

Setting the initial value is a matter of preference. An option other than the one listed above is for some .

Note that F0 is undefined (there is no estimation for time 0), and according to the definition F1=s0+b0, which is well defined, thus further values can be evaluated.

A second method, referred to as either Brown's linear exponential smoothing (LES) or Brown's double exponential smoothing works as follows. [14]

where at, the estimated level at time t and bt, the estimated trend at time t are:

Triple exponential smoothing (Holt Winters)

Triple exponential smoothing applies exponential smoothing three times, which is commonly used when there are three high frequency signals to be removed from a time series under study. There are different types of seasonality: 'multiplicative' and 'additive' in nature, much like addition and multiplication are basic operations in mathematics.

If every month of December we sell 10,000 more apartments than we do in November the seasonality is additive in nature. However, if we sell 10% more apartments in the summer months than we do in the winter months the seasonality is multiplicative in nature. Multiplicative seasonality can be represented as a constant factor, not an absolute amount. [15]

Triple exponential smoothing was first suggested by Holt's student, Peter Winters, in 1960 after reading a signal processing book from the 1940s on exponential smoothing. [16] Holt's novel idea was to repeat filtering an odd number of times greater than 1 and less than 5, which was popular with scholars of previous eras. [16] While recursive filtering had been used previously, it was applied twice and four times to coincide with the Hadamard conjecture, while triple application required more than double the operations of singular convolution. The use of a triple application is considered a rule of thumb technique, rather than one based on theoretical foundations and has often been over-emphasized by practitioners. - Suppose we have a sequence of observations beginning at time with a cycle of seasonal change of length .

The method calculates a trend line for the data as well as seasonal indices that weight the values in the trend line based on where that time point falls in the cycle of length .

Let represent the smoothed value of the constant part for time , is the sequence of best estimates of the linear trend that are superimposed on the seasonal changes, and is the sequence of seasonal correction factors. We wish to estimate at every time mod in the cycle that the observations take on. As a rule of thumb, a minimum of two full seasons (or periods) of historical data is needed to initialize a set of seasonal factors.

The output of the algorithm is again written as , an estimate of the value of at time based on the raw data up to time . Triple exponential smoothing with multiplicative seasonality is given by the formulas [1]

where () is the data smoothing factor, () is the trend smoothing factor, and () is the seasonal change smoothing factor.

The general formula for the initial trend estimate is:

Setting the initial estimates for the seasonal indices for is a bit more involved. If is the number of complete cycles present in your data, then:


Note that is the average value of in the cycle of your data.

Triple exponential smoothing with additive seasonality is given by:

Implementations in statistics packages

See also


  1. 1 2 3 "NIST/SEMATECH e-Handbook of Statistical Methods". NIST. Retrieved 23 May 2010.
  2. 1 2 Oppenheim, Alan V.; Schafer, Ronald W. (1975). Digital Signal Processing. Prentice Hall. p. 5. ISBN   0-13-214635-5.
  3. Brown, Robert G. (1956). Exponential Smoothing for Predicting Demand. Cambridge, Massachusetts: Arthur D. Little Inc. p. 15.
  4. Holt, Charles C. (1957). "Forecasting Trends and Seasonal by Exponentially Weighted Averages". Office of Naval Research Memorandum. 52. reprinted in Holt, Charles C. (January–March 2004). "Forecasting Trends and Seasonal by Exponentially Weighted Averages". International Journal of Forecasting . 20 (1): 5–10. doi:10.1016/j.ijforecast.2003.09.015.
  5. Brown, Robert Goodell (1963). Smoothing Forecasting and Prediction of Discrete Time Series. Englewood Cliffs, NJ: Prentice-Hall.
  6. "NIST/SEMATECH e-Handbook of Statistical Methods, Single Exponential Smoothing". NIST. Retrieved 5 July 2017.
  7. Nau, Robert. "Averaging and Exponential Smoothing Models" . Retrieved 26 July 2010.
  8. "Production and Operations Analysis" Nahmias. 2009.
  9. Čisar, P., & Čisar, S. M. (2011). "Optimization methods of EWMA statistics." Acta Polytechnica Hungarica, 8(5), 73–87. Page 78.
  10. 7.1 Simple exponential smoothing | Forecasting: Principles and Practice.
  11. Nahmias, Steven (3 March 2008). Production and Operations Analysis (6th ed.). ISBN   978-0-07-337785-8.[ page needed ]
  12. "Model: Second-Order Exponential Smoothing". SAP AG . Retrieved 23 January 2013.
  13. " Double Exponential Smoothing". itl.nist.gov. Retrieved 25 September 2011.
  14. "Averaging and Exponential Smoothing Models". duke.edu. Retrieved 25 September 2011.
  15. Kalehar, Prajakta S. "Time series Forecasting using Holt–Winters Exponential Smoothing" (PDF). Retrieved 23 June 2014.
  16. 1 2 Winters, P. R. (April 1960). "Forecasting Sales by Exponentially Weighted Moving Averages". Management Science . 6 (3): 324–342. doi:10.1287/mnsc.6.3.324.
  17. "R: Holt–Winters Filtering". stat.ethz.ch. Retrieved 5 June 2016.
  18. "ets {forecast} | inside-R | A Community Site for R". inside-r.org. Archived from the original on 16 July 2016. Retrieved 5 June 2016.
  19. "Comparing HoltWinters() and ets()". Hyndsight. 29 May 2011. Retrieved 5 June 2016.
  20. tssmooth in Stata manual
  21. "LibreOffice 5.2: Release Notes – the Document Foundation Wiki".
  22. "Excel 2016 Forecasting Functions | Real Statistics Using Excel".

Related Research Articles

In mathematics, the Laplace transform, named after its discoverer Pierre-Simon Laplace, is an integral transform that converts a function of a real variable to a function of a complex variable .

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Taylor's theorem</span> Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a -times differentiable function around a given point by a polynomial of degree , called the -th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the truncation at the order of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation. There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.

<span class="mw-page-title-main">Pareto distribution</span> Probability distribution

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.

In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter". In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than the statistic, as to which of those probability distributions is the sampling distribution.

In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compared with working directly with probability density functions or cumulative distribution functions. There are particularly simple results for the moment-generating functions of distributions defined by the weighted sums of random variables. However, not all random variables have moment-generating functions.

In mathematics, Itô's lemma or Itô's formula is an identity used in Itô calculus to find the differential of a time-dependent function of a stochastic process. It serves as the stochastic calculus counterpart of the chain rule. It can be heuristically derived by forming the Taylor series expansion of the function up to its second derivatives and retaining terms up to first order in the time increment and second order in the Wiener process increment. The lemma is widely employed in mathematical finance, and its best known application is in the derivation of the Black–Scholes equation for option values.

<span class="mw-page-title-main">Window function</span> Function used in signal processing

In signal processing and statistics, a window function is a mathematical function that is zero-valued outside of some chosen interval. Typically, windows functions are symmetric around the middle of the interval, approach a maximum in the middle, and taper away from the middle. Mathematically, when another function or waveform/data-sequence is "multiplied" by a window function, the product is also zero-valued outside the interval: all that is left is the part where they overlap, the "view through the window". Equivalently, and in actual practice, the segment of data within the window is first isolated, and then only that data is multiplied by the window function values. Thus, tapering, not segmentation, is the main purpose of window functions.

Forecasting is the process of making predictions based on past and present data. Later these can be compared (resolved) against what happens. For example, a company might estimate their revenue in the next year, then compare it against the actual results creating a variance actual analysis. Prediction is a similar but more general term. Forecasting might refer to specific formal statistical methods employing time series, cross-sectional or longitudinal data, or alternatively to less formal judgmental methods or the process of prediction and resolution itself. Usage can vary between areas of application: for example, in hydrology the terms "forecast" and "forecasting" are sometimes reserved for estimates of values at certain specific future times, while the term "prediction" is used for more general estimates, such as the number of times floods will occur over a long period.

<span class="mw-page-title-main">Spline (mathematics)</span> Mathematical function defined piecewise by polynomials

In mathematics, a spline is a function defined piecewise by polynomials. In interpolating problems, spline interpolation is often preferred to polynomial interpolation because it yields similar results, even when using low degree polynomials, while avoiding Runge's phenomenon for higher degrees.

In mathematics, the matrix exponential is a matrix function on square matrices analogous to the ordinary exponential function. It is used to solve systems of linear differential equations. In the theory of Lie groups, the matrix exponential gives the exponential map between a matrix Lie algebra and the corresponding Lie group.

In the statistical analysis of time series, autoregressive–moving-average (ARMA) models provide a parsimonious description of a (weakly) stationary stochastic process in terms of two polynomials, one for the autoregression (AR) and the second for the moving average (MA). The general ARMA model was described in the 1951 thesis of Peter Whittle, Hypothesis testing in time series analysis, and it was popularized in the 1970 book by George E. P. Box and Gwilym Jenkins.

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. To better comprehend the data or to forecast upcoming series points, both of these models are fitted to time series data. ARIMA models are applied in some cases where data show evidence of non-stationarity in the sense of mean, where an initial differencing step can be applied one or more times to eliminate the non-stationarity of the mean function. When the seasonality shows in a time series, the seasonal-differencing could be applied to eliminate the seasonal component. Since the ARMA model, according to the Wold's decomposition theorem, is theoretically sufficient to describe a regular wide-sense stationary time series, we are motivated to make stationary a non-stationary time series, e.g., by using differencing, before we can use the ARMA model. Note that if the time series contains a predictable sub-process, the predictable component is treated as a non-zero-mean but periodic component in the ARIMA framework so that it is eliminated by the seasonal differencing.

In mathematics, Doob's martingale inequality, also known as Kolmogorov’s submartingale inequality is a result in the study of stochastic processes. It gives a bound on the probability that a submartingale exceeds any given value over a given interval of time. As the name suggests, the result is usually given in the case that the process is a martingale, but the result is also valid for submartingales.

In finance, indifference pricing is a method of pricing financial securities with regard to a utility function. The indifference price is also known as the reservation price or private valuation. In particular, the indifference price is the price at which an agent would have the same expected utility level by exercising a financial transaction as by not doing so. Typically the indifference price is a pricing range for a specific agent; this price range is an example of good-deal bounds.

In stochastic analysis, a rough path is a generalization of the notion of smooth path allowing to construct a robust solution theory for controlled differential equations driven by classically irregular signals, for example a Wiener process. The theory was developed in the 1990s by Terry Lyons. Several accounts of the theory are available.

In mathematics, a smooth maximum of an indexed family x1, ..., xn of numbers is a smooth approximation to the maximum function meaning a parametric family of functions such that for every α, the function is smooth, and the family converges to the maximum function as . The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, as and as . The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Chronological calculus is a formalism for the analysis of flows of non-autonomous dynamical systems. It was introduced by A. Agrachev and R. Gamkrelidze in the late 1970s. The scope of the formalism is to provide suitable tools to deal with non-commutative vector fields and represent their flows as infinite Volterra series. These series, at first introduced as purely formal expansions, are then shown to converge under some suitable assumptions.