Backtesting

Last updated

Backtesting is a term used in modeling to refer to testing a predictive model on historical data. Backtesting is a type of retrodiction, and a special type of cross-validation applied to previous time period(s). In quantitative finance, backtesting is an important step before deploying algorithmic strategies in live markets.

Contents

Financial analysis

In the economic and financial field, backtesting seeks to estimate the performance of a strategy or model if it had been employed during a past period. This requires simulating past conditions with sufficient detail, making one limitation of backtesting the need for detailed historical data. A second limitation is the inability to model strategies that would affect historic prices. Finally, backtesting, like other modeling, is limited by potential overfitting. That is, it is often possible to find a strategy that would have worked well in the past, but will not work well in the future. [1] Despite these limitations, backtesting provides information not available when models and strategies are tested on synthetic data.

Historically, backtesting was only performed by large institutions and professional money managers due to the expense of obtaining and using detailed datasets. However, backtesting is increasingly used on a wider basis, and independent web-based backtesting platforms [2] have emerged. Although the technique is widely used, it is prone to weaknesses. [3] Basel financial regulations require large financial institutions to backtest certain risk models.

For a Value at Risk 1-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table: [4]

backtesting exceptions 1Dx250 Backtesting exceptions 1Dx250.png
backtesting exceptions 1Dx250
1-day VaR at 99% backtested 250 days
Zone Number exceptions Probability Cumul
Green 0 8.11% 8.11%
1 20.47% 28.58%
2 25.74% 54.32%
3 21.49% 75.81%
4 13.41% 89.22%
Orange 5 6.66% 95.88%
6 2.75% 98.63%
7 0.97% 99.60%
8 0.30% 99.89%
9 0.08% 99.97%
Red 10 0.02% 99.99%
11 0.00% 100.00%
... ... ...

For a Value at Risk 10-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table:

backtesting exceptions 10Dx250 Backtesting exceptions 10Dx250.png
backtesting exceptions 10Dx250
10-day VaR at 99% backtested 250 days
Zone Number exceptions Probability Cumul
Green 0 36.02% 36.02%
1 15.99% 52.01%
2 11.58% 63.59%
3 8.90% 72.49%
4 6.96% 79.44%
5 5.33% 84.78%
6 4.07% 88.85%
7 3.05% 79.44%
8 2.28% 94.17%
Orange 9 1.74% 95.91%
... ... ...
24 0.01% 99.99%
Red 25 0.00% 99.99%
... ... ...

Backtesting through cross-validation in finance

Traditional backtesting evaluates a strategy on a single historical path. Although intuitive, this approach is sensitive to regime changes, path dependence, and look-ahead leakage. To address these limitations, practitioners adapt cross-validation (CV) methods to time-ordered financial data. Because financial observations are not independent and identically distributed (IID), randomized CV is inappropriate, motivating the use of specialized temporal CV procedures. [5]

Walk-forward / rolling-window backtesting

Walk-forward analysis divides historical data into sequential training and testing windows. A model is trained on an initial in-sample period, tested on the subsequent period, and the window is rolled forward repeatedly. [5]

Advantages

  • Provides a clear historical interpretation, as each testing period mirrors a realistic paper-trading scenario. [5]
  • Avoids look-ahead bias because the training set always predates the testing set; with trailing windows and proper purging, test samples remain fully out-of-sample. [6]
  • Enables robustness assessment across market regimes through periodic reoptimization, adapting to evolving volatility and price dynamics. [7]

Limitations

  • Relies on a single historical path, making results sensitive to sequencing and increasing overfitting risk. [8]
  • May not generalize to alternative market orderings, as reversing observations often yields inconsistent outcomes. [5]
  • Provides limited out-of-sample evaluation because each window uses only a subset of observations. [5]
  • Frequent reoptimization may overfit transient structures, overstating robustness. [7]

Purged cross-validation (with embargoing)

Purged cross-validation adapts k-fold CV to financial series by purging observations whose label-formation overlaps with the test fold and applying an embargo to avoid leakage from serial dependence. [6] Its purpose is not historical accuracy but evaluation across multiple out-of-sample stress scenarios. [5]

Advantages

  • Evaluates strategies across many alternative out-of-sample scenarios rather than one historical path. [5]
  • Uses each sample exactly once for testing, achieving maximal out-of-sample usage.
  • Prevents leakage through purging and embargoing.

Limitations

  • The training set does not trail the testing set, requiring careful purging/embargo to prevent leakage. [6]
  • Reduces effective sample size when labels span long periods. [5]
  • Still produces a single forecast per observation, yielding one performance path.

Combinatorial purged cross-validation (CPCV)

Combinatorial purged cross-validation partitions a time series into non-overlapping groups and evaluates combinations of these groups as test sets. Each fold is purged and embargoed, yielding a distribution of performance estimates and reducing selection bias inherent in walk-forward and standard CV methods. [5]

Advantages

  • Produces a distribution of performance statistics rather than a single path, improving inference. [5]
  • Lowers variance in Sharpe ratio estimates by averaging across many nearly uncorrelated paths.
  • Reduces sensitivity to specific windows or local market regimes.
  • Used to compute the Probability of backtest overfitting (PBO). [9]

Limitations

  • Computationally intensive due to the number of path combinations. [5]
  • Requires selecting the number and size of groups, which affects variance.
  • More complex to implement and typically relies on custom tooling.

Backtest statistics in quantitative finance

Backtests often produce performance metrics that appear statistically significant even when driven by noise. Because financial returns have low signal-to-noise ratio, non-normal characteristics, and regime dependence, backtest evaluation requires statistics that adjust for multiple trials, selection bias, and sampling error. [10]

General characteristics

General structural characteristics affecting reliability include: [10]


Performance

[10]

Time-weighted rate of return

The time-weighted rate of return (TWRR) is a measure of investment performance that isolates the return generated by the portfolio itself, independent of external cash flows. It divides the performance into subperiods defined by deposits or withdrawals and compounds the returns of those subperiods, ensuring that each interval contributes equally to the final result. Because TWRR removes the effect of investor-driven cash flows, it is commonly used to evaluate asset managers and compare investment strategies. This contrasts with the CAGR, which reflects the growth of an investor’s actual account value and is therefore sensitive to the timing and size of contributions and withdrawals.

Runs and drawdowns

Most investment strategies do not generate returns from an independent and identically distributed (IID) process. Because returns are not IID, they often exhibit sequences of same-direction outcomes, known as runs. For example: +1%, +0.8%, +0.5% represent a positive run, while –0.7%, –1.2%, –0.4% form a negative run. Such negative runs can significantly amplify downside risk, meaning that averages or standard deviations alone are insufficient to assess the strategy’s true risk profile. Instead, one must rely on risk measures that capture the impact of persistent patterns: [10]

Implementation shortfall

Implementation shortfall measures the erosion of performance due to execution frictions: [10]

Efficiency metrics

Overfitting and validation

Backtests are vulnerable to overfitting when many variations are tested. The Probability of Backtest Overfitting (PBO) quantifies this risk, often using CPCV. [9]

Classification-based metrics

Machine-learning-based strategies are evaluated with: [10]

Attribution

Performance attribution decomposes returns across risk categories (e.g., duration, credit, liquidity, sector). [10]

Limitations and pitfalls

Backtesting is widely used to evaluate historical performance, but it is vulnerable to several sources of error. Because backtests rely on historical data rather than controlled experiments, they cannot establish causality and may reflect patterns that occurred by chance. [11] [12]

Common sources of error

These issues reduce reliability even before considering sampling error or the risk of overfitting. [11]

Limits of backtesting as a research tool

Backtesting is frequently misused as an idea-generation tool. A backtest can evaluate a fully specified strategy but cannot explain why it should work or whether the economic rationale will persist. Iteratively modifying models in response to backtest outcomes increases the likelihood of overfitting, producing results that do not generalize out of sample. [12] [11]

Practical recommendations

While none of these practices fully eliminate overfitting, they help identify strategies with a higher likelihood of out-of-sample validity. [11]

Hindcast

Temporal representation of hindcasting. Hindcasting.jpeg
Temporal representation of hindcasting.

In oceanography [14] and meteorology, [15] backtesting is also known as hindcasting: a hindcast is a way of testing a mathematical model; researchers enter known or closely estimated inputs for past events into the model to see how well the output matches the known results.

Hindcasting usually refers to a numerical-model integration of a historical period where no observations have been assimilated. This distinguishes a hindcast run from a reanalysis. Oceanographic observations of salinity and temperature as well as observations of surface-wave parameters such as the significant wave height are much scarcer than meteorological observations, making hindcasting more common in oceanography than in meteorology. Also, since surface waves represent a forced system where the wind is the only generating force, wave hindcasting is often considered adequate for generating a reasonable representation of the wave climate with little need for a full reanalysis. Hydrologists use hindcasting for model stream flows. [16]

An example of hindcasting would be entering climate forcings (events that force change) into a climate model. If the hindcast showed reasonably-accurate climate response, the model would be considered successful.

The ECMWF re-analysis is an example of a combined atmospheric reanalysis coupled with a wave-model integration where no wave parameters were assimilated, making the wave part a hindcast run.

See also

References

  1. Bailey, Borwein, Lopez de Prado, Zhu (2014). "Pseudo-mathematics and financial charlatanism. Notices of the American Mathematical Society, Volume 61, Number 5, pp. 458-471" (PDF).{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
  2. "Example of web platform for cryptocurrency backtesting, providing historical trading simulations". Intralogic. Retrieved 2025-11-23.
  3. FinancialTrading (2013-04-27). "Issues related to back testing".
  4. "Supervisory framework for the use of "backtesting" in conjunction with the internal models approach to market risk capital requirements" (PDF). Basle Committee on Banking Supervision. January 1996. p. 14.
  5. 1 2 3 4 5 6 7 8 9 10 11 Lopez de Prado, Marcos (2018). "12". Advances in Financial Machine Learning. Wiley.
  6. 1 2 3 Lopez de Prado, Marcos (2018). "7". Advances in Financial Machine Learning. Wiley.
  7. 1 2 Pardo, Robert (2008). The Evaluation and Optimization of Trading Strategies. Wiley. pp. 38–39.
  8. Bailey, David H.; Borwein, Jonathan; Lopez de Prado, Marcos; Zhu, Qiji Jim (2014). "Pseudo-mathematics and ...". Notices of AMS. 61 (5): 458–471.
  9. 1 2 Bailey, David H.; Lopez de Prado, Marcos (2013). "The Probability of Backtest Overfitting". SSRN. doi:10.2139/ssrn.2326253.
  10. 1 2 3 4 5 6 7 8 Lopez de Prado, Marcos (2018). "14". Advances in Financial Machine Learning. Wiley.
  11. 1 2 3 4 Lopez de Prado, Marcos (2018). "11". Advances in Financial Machine Learning. Wiley.
  12. 1 2 "The Seven Sins of Quantitative Investing" (PDF). Deutsche Bank Markets Research. 2014. Retrieved 2025-11-28.
  13. Taken from p.145 of Yeates, L.B., Thought Experimentation: A Cognitive Approach, Graduate Diploma in Arts (By Research) dissertation, University of New South Wales, 2004.
  14. "Hindcast approach". OceanWeather Inc. Retrieved 22 January 2013.
  15. Huijnen, V.; J. Flemming; J. W. Kaiser; A. Inness; J. Leitão; A. Heil; H. J. Eskes; M. G. Schultz; A. Benedetti; J. Hadji-Lazaro; G. Dufour; M. Eremenko (2012). "Hindcast experiments of tropospheric composition during the summer 2010 fires over western Russia". Atmos. Chem. Phys. 12 (9): 4341–4364. Bibcode:2012ACP....12.4341H. doi: 10.5194/acp-12-4341-2012 . Retrieved 22 January 2013.
  16. "Guidance on Conducting Streamflow Hindcasting in CHPS" (PDF). NOAA. Retrieved 22 January 2013.