The rigorous process of testing systematic strategies on historical data in a way that produces reliable out-of-sample performance estimates — including walk-forward testing methodology, avoidance of look-ahead bias, realistic cost modeling, and explicit acknowledgment of multiple-comparison bias when testing many variations.
Practitioner uses a strict separation between in-sample (parameter selection) and out-of-sample (performance evaluation) periods. Parameters are selected based on economic rationale, not backtest optimization — round numbers and economically motivated lookbacks are preferred over optimized parameters. All sources of look-ahead bias are identified and eliminated before trusting any result. Transaction costs include bid-ask spread, market impact, and borrowing costs (for short positions) — not just commission. Multiple tests on the same data are treated as a source of statistical inflation, and results are discounted accordingly.
When 100 strategy variations are tested on the same data, the expected best-result Sharpe ratio from purely random strategies is approximately 2.5. This means a backtest Sharpe of 1.0 — which most practitioners would celebrate — is actually below the noise level of 100 trials. The number of trials directly inflates the apparent quality of the best result, and almost no practitioners adjust for this.
A simple but devastating test for dataset contamination through multiple testing: randomly permute the strategy's entry/exit rules and test the permuted version on the same data. If the nonsense version also produces positive backtested returns, the dataset is contaminated — any positive result from it is noise, not signal. This test is almost never run.
A strategy that backtests on 2010-2020 data looks robust on aggregate metrics but may have 100% of its alpha concentrated in the low-vol, QE-driven regime of that period. Aggregate backtest metrics (overall Sharpe, max drawdown) hide regime concentration. Only by separately evaluating performance in high-vol vs. low-vol, trending vs. mean-reverting, inflationary vs. deflationary periods can the regime specificity of a strategy be detected.