Home/Systematic Trading/Alternative Data Sourcing and Infrastructure

Alternative Data Sourcing and Infrastructure

model-buildingLevel 2 — Intermediate

What It Is

Alternative data sourcing is the discipline of identifying, acquiring, normalizing, and maintaining data sets that provide informational edge — including point-in-time correctness, symbology alignment, revision handling, and integration into a research pipeline.

Correct Execution

  • Maintain point-in-time correctness for all data: timestamps must reflect when data was actually observable, not when it was reported or revised
  • Handle corporate actions, restatements, and survivorship bias as first-class requirements, not afterthoughts
  • Match symbology consistently across data sets — rolling futures, corporate identifiers, and currency conventions differ between providers
  • Build a data pipeline that shortens time-to-alpha: the researcher should be testing ideas in hours, not rebuilding data infrastructure for weeks
  • Alpha is increasingly upstream: the edge is in the data pipeline, not in the model architecture

Progression Levels

Diagnostic Tree

Coaching Cues

  • "The magic isn't in the neural net or the XGBoost. It's in the data pipelines." — Angana Jacob, FWM S7E26
  • "Two firms can run the same model. The one with cleaner data and better feature construction should outperform. The competitive edge is upstream." — Angana Jacob, FWM S7E26
  • "Markets today are highly conditional — alpha comes from understanding WHEN a signal works, not just WHETHER it works. That requires data that spans domains." — Angana Jacob, FWM S7E26

Common Errors

  1. Building models before validating data quality: Research effort on a bad data set is wasted research; discovered problems late in the process cost more to fix → Audit data quality before building any signal; fix data before optimizing model
  2. Treating data revision histories as irrelevant: Fundamental data is frequently revised; using as-revised data for backtesting creates massive look-ahead bias → Use only as-reported, point-in-time data for all fundamentals-based backtesting
  3. Conflating more data with better data: Adding five poorly-integrated data sets does not improve alpha → Each new data set needs proper point-in-time handling, symbology alignment, and validation before it adds value

Edges

🔑 Hidden Causal Lever

The Competitive Edge Has Moved Upstream — Data Infrastructure Is Now the Moat

Two quant firms with access to the same alternative data sets and running the same signal architecture will diverge in performance based on data infrastructure quality — specifically, point-in-time correctness, corporate action handling, and symbology alignment. The modeling layer is commoditized; everyone has access to the same ML techniques and factor frameworks. The infrastructure layer — getting data clean, aligned, and point-in-time correct faster than competitors — is where the durable edge now lives. Most quant firms dramatically underinvest here relative to their investment in model architecture.

What most people do
Invest heavily in signal design, model architecture (neural nets, ensemble methods, gradient boosting); treat data infrastructure as engineering overhead; allocate more talent to modeling than to data quality.
What the best do
Build or acquire a data infrastructure that delivers pre-aligned, point-in-time correct, interoperable data sets; measure competitive advantage in terms of time-to-alpha (hypothesis to backtest in hours, not weeks).
Why it's an edge: Clean data with fast iteration generates more research output at lower cost than sophisticated models on dirty data. The firms that can test 10 hypotheses per day will discover more edges than those who take a week per hypothesis.
How to exploit: Audit your current time-to-alpha: pick a typical research hypothesis (e.g., "does earnings revision momentum predict forward returns in energy stocks?") and time the end-to-end process from idea to backtest result. If it takes more than 2 business days, the bottleneck is data infrastructure, not modeling skill. Fix the infrastructure first; then invest in model design.
Cross-domain parallel
In sports betting, the bettor who gets line access 10 minutes earlier than competitors has a significant edge on opening numbers — the data freshness advantage is the edge, not the model quality.
Angana Jacob, "Data as the True Competitive Moat," FWM S7E26, 2026-02-09
Conventional Wisdom Is Wrong

As-Reported vs As-Revised Data Is the Hidden Source of Backtest Illusion

Fundamental data (earnings, GDP, economic indicators) is frequently restated after initial release. A backtest run with "historical" fundamental data from a modern database is typically using the as-revised, corrected version — not the as-reported data that was available at the signal generation time. A strategy that earns 8% Sharpe using as-revised data may earn 0.3% Sharpe using only information that was actually observable at each historical point. This is not a minor adjustment — it is often the entire apparent edge. Point-in-time correctness for fundamental data is the most common source of spurious backtest performance in systematic equity strategies.

What most people do
Download historical fundamental data from Compustat or Bloomberg; run backtests using this data; present results without noting the revision-history problem.
What the best do
Use databases that maintain revision history timestamps (point-in-time databases); run all backtests using only as-reported data; accept that the resulting performance will be lower than the as-revised version and treat the difference as a look-ahead bias estimate.
Why it's an edge: Quant researchers who correctly handle data revisions eliminate the most common source of false discovery in systematic factor research. Their results may look worse on paper but survive live trading — which is the actual goal.
How to exploit: For any earnings-based signal, identify whether your data source has revision history. Test: take a company that restated earnings significantly (e.g., GE, Enron, Lucent). Check what value appears in your database for their earnings in year T when you query in year T+5. If it's the restated value, not the original reported value, your database has look-ahead bias for this signal. Fix before publishing any results.
Angana Jacob, FWM S7E26, 2026-02-09 — "if revisions aren't timestamped properly, your backtest looks amazing until it goes live."
💎 Elite-Only Behavior

Alpha Is Conditional — "WHEN a Signal Works" Is More Valuable Than "WHETHER It Works"

Most factor research asks "does earnings momentum work?" and answers with an average return over a long backtest. The more useful question is "when does earnings momentum work?" — meaning, what market conditions, regimes, or cross-domain states predict above-average factor performance. Answering this conditional question requires data that spans multiple domains simultaneously (equity + macro + credit + rates). A signal that has average Sharpe of 0.3 may have Sharpe of 1.2 in the right regime and -0.2 in the wrong one. Conditioning on regime is the difference between a marginal edge and a compelling one.

What most people do
Evaluate signals on unconditional expected return; blend signals into a portfolio; rebalance mechanically.
What the best do
Build the data infrastructure that enables conditional signal evaluation; identify regime states where each signal is active vs inactive; apply signals only in the conditions where they have demonstrated conditional edge.
Why it's an edge: Conditional signal evaluation requires multi-domain data integration that most researchers lack. Building it creates a compounding advantage because each new data set doesn't just add a new signal — it creates new conditioning variables that multiply the value of all existing signals.
How to exploit: Take your best-performing systematic signal. Split the historical backtest into quartiles based on: (1) equity volatility level, (2) credit spread level, (3) yield curve slope. Measure signal Sharpe in each quartile. If the signal Sharpe in the top-performing quartile is 3x the average, you have a conditioning opportunity. Build the conditional signal, validate out-of-sample, and use it to turn off the signal in unfavorable regimes.
Cross-domain parallel
In sports betting, a model that works best in specific game contexts (dome teams in bad weather, home underdogs on short rest) has conditional edge — the condition is as valuable as the signal itself.
Angana Jacob, FWM S7E26, 2026-02-09 — "markets today are highly conditional — alpha comes from understanding WHEN a signal works."

Sources

  • Angana Jacob, "Data as the True Competitive Moat," Flirting with Models S7E26, 2026-02-09 — full data infrastructure philosophy; point-in-time correctness; data as competitive moat; symbology in futures and options; research pipeline design
  • Katherine Glass-Hardenbergh, "All About Alternative Data," Flirting with Models S2E9, 2021-04-10 — alternative data taxonomy; evaluation framework for data vendors
  • Giuseppe Paleologo, "Quant Investing at Multi-Strat Hedge Funds," Odd Lots, 2025-06-23 — data infrastructure at multi-strategy hedge funds; alternative data in quant equity context