Home/Systematic Trading/Backtesting Discipline

Backtesting Discipline

model-buildingLevel 2 — Intermediate

What It Is

The rigorous process of testing systematic strategies on historical data in a way that produces reliable out-of-sample performance estimates — including walk-forward testing methodology, avoidance of look-ahead bias, realistic cost modeling, and explicit acknowledgment of multiple-comparison bias when testing many variations.

Correct Execution

Practitioner uses a strict separation between in-sample (parameter selection) and out-of-sample (performance evaluation) periods. Parameters are selected based on economic rationale, not backtest optimization — round numbers and economically motivated lookbacks are preferred over optimized parameters. All sources of look-ahead bias are identified and eliminated before trusting any result. Transaction costs include bid-ask spread, market impact, and borrowing costs (for short positions) — not just commission. Multiple tests on the same data are treated as a source of statistical inflation, and results are discounted accordingly.

Progression Levels

Diagnostic Tree

Coaching Cues

  • "Parameters should come from economic rationale, not optimization. Round numbers only."
  • "Walk-forward is the minimum. Even then, you should be suspicious." — Euan Sinclair framework
  • "Count your tests. Each one you run contaminates the dataset a little more."
  • "The best backtest you'll ever see is the in-sample result. Everything after that will be worse. Plan accordingly." — Euan Sinclair

Common Errors

  1. Survivorship bias: Testing equity strategies only on stocks currently in the index, excluding names that went bankrupt or were delisted. Inflates performance significantly, especially for value and momentum strategies.
  2. Point-in-time data errors: Using data with revisions (restated financials, corrected prices) rather than the data that would have been available at decision time. Creates look-ahead bias even when there is no intentional cheating.
  3. Treating transaction costs as zero or minimal: Realistic round-trip costs (spread + market impact) for a rebalancing strategy can easily exceed 0.5–1% per trade. Ignoring this in backtesting inflates performance for higher-turnover strategies.
  4. Not testing for regime robustness: A strategy that only backtests on a single historical period (e.g., 2010–2020) may be valid only for the specific regime of that period. Out-of-sample testing should include multiple regime episodes.

Edges

Conventional Wisdom Is Wrong

A Backtest Sharpe Of 1.0 From 100 Trials Is Below The Noise Level

When 100 strategy variations are tested on the same data, the expected best-result Sharpe ratio from purely random strategies is approximately 2.5. This means a backtest Sharpe of 1.0 — which most practitioners would celebrate — is actually below the noise level of 100 trials. The number of trials directly inflates the apparent quality of the best result, and almost no practitioners adjust for this.

What most people do
Test hundreds of strategy variants. Report the best-performing variant with its backtest Sharpe. Treat a Sharpe of 1.0+ as validation of strategy quality.
What the best do
Track the total number of strategy variants tested on any dataset. Calculate the expected best-result Sharpe from random strategies given that trial count. Require that the actual best result materially exceeds this noise floor before considering it a valid signal.
Why it's an edge: Eliminates an entire class of "strategies" that are simply winners in a 100-trial tournament of random noise. The practitioners who apply this test discover that most of their "good" backtests are statistically indistinguishable from random.
How to exploit: For any strategy presented with a backtest, ask: "How many variants were tested before this one was selected?" If the answer is >20, require a Bonferroni adjustment. The required minimum Sharpe ratio to clear the noise floor scales approximately as √(2 × ln(N)) × baseline_Sharpe, where N is the number of trials.
Euan Sinclair, "Find Edge and Trade Volatility," Outlier Podcast, 2022-11-08
🔑 Hidden Causal Lever

If The Stupid Versions Also Work, You're Finding Noise

A simple but devastating test for dataset contamination through multiple testing: randomly permute the strategy's entry/exit rules and test the permuted version on the same data. If the nonsense version also produces positive backtested returns, the dataset is contaminated — any positive result from it is noise, not signal. This test is almost never run.

What most people do
Test only the strategically motivated version of a rule. Assume positive returns validate the strategy logic.
What the best do
As a mandatory validation step, test the permuted/randomized version of any strategy on the same dataset. If the random version also generates positive returns, the dataset is contaminated and all results from it should be discarded.
Why it's an edge: One simple test that falsifies an entire line of research. Most practitioners don't run it because they fear what it will reveal.
How to exploit: For any candidate strategy, create three permuted variants (shuffled entry signals, random exit timing, inverted signal direction). If any two of three permuted variants produce positive backtest results on the same dataset, declare the dataset contaminated and stop testing. Move to a fresh out-of-sample period.
Euan Sinclair, "Find Edge and Trade Volatility," Outlier Podcast, 2022-11-08
🔑 Hidden Causal Lever

Regime-Stratified Backtesting Prevents Single-Regime Strategies From Masquerading As All-Weather

A strategy that backtests on 2010-2020 data looks robust on aggregate metrics but may have 100% of its alpha concentrated in the low-vol, QE-driven regime of that period. Aggregate backtest metrics (overall Sharpe, max drawdown) hide regime concentration. Only by separately evaluating performance in high-vol vs. low-vol, trending vs. mean-reverting, inflationary vs. deflationary periods can the regime specificity of a strategy be detected.

What most people do
Evaluate strategies on total-period aggregate metrics. Attribute good aggregate Sharpe to strategy robustness.
What the best do
Mandatory regime-stratified evaluation: separate the backtest returns by macro regime (growth/inflation quadrant), vol regime (high/low), and autocorrelation regime (trending/mean-reverting). A robust strategy should show positive expected value in at least 3 of 4 macro quadrants.
Why it's an edge: Prevents deploying strategies that are regime-specific as if they were all-weather, avoiding the inevitable catastrophic failure when the regime that supported them ends.
How to exploit: For every strategy evaluated, run a performance table by regime state. Format: rows = strategy months, columns = which regime state applied. Calculate Sharpe ratio within each regime bucket. Require that the Sharpe in the worst-performing regime bucket is not more than -0.5 (i.e., modestly negative is acceptable; catastrophic single-regime failure is not).
Corey Hoffstein / Jim Masturzo framework, Flirting with Models, 2021-04-10

Sources

  • Euan Sinclair, "Positional Option Trading" (Flirting with Models, S3E12), 2021-04-10 — backtesting rigor, parameter selection from economic rationale
  • Euan Sinclair, "Find Edge and Trade Volatility," Outlier Podcast, 2022-11-08 — multiple testing bias, dataset contamination
  • "What is Signal Timing Luck? (Regime Filters)," YouTube, 2025-11-07 — parameter fragility test, sensitivity analysis
  • Giuseppe Paleologo, "Multi-Manager Hedge Funds" (Flirting with Models, S7E11), 2024-09-02 — institutional-grade research process, factor validation frameworks