Home/Soccer Analytics/Training Data & the Low Sample Size Problem

Training Data & the Low Sample Size Problem

Data InfrastructureLevel 2 — Intermediate

What It Is

Football is an inherently low-sample-size sport for most individual skills. A striker might take 3 shots per game; over a 30-game season that's 90 shots — barely enough to profile finishing quality. A midfielder's long-range shots might total 15 in a season — statistically meaningless. Training data, particularly from 11v11 sessions (~90 additional minutes per week), dramatically increases sample size for skill profiling. But training data has a validity caveat: undefended training shots are not the same as defended game shots.

Correct Execution

Two approaches to the sample size problem: (1) aggregate multiple seasons of match data (3 seasons of Coutinho gives 226 long-range shots — usable but slow); (2) use training-level event data to supplement. When using training data, always filter by validity — only include training situations that approximate game conditions (e.g., 11v11 live play, not unopposed finishing drills). Report sample size prominently in any skill profile; flag profiles under 50 instances as unreliable.

Progression Levels

Diagnostic Tree

Coaching Cues

  • "Football gives you small samples. Training data is how you solve that — but only if the training data is valid." — Ted Knutson, 2018
  • "Sample size is part of the finding, not a footnote."

Common Errors

  1. Using a single season of data for rare-event metrics: One season of long-range shots is almost always too small. Aggregate multiple seasons or acknowledge the limitation.
  2. Treating all training data as equally valid: Unopposed drills, small-sided games, and 11v11 live play have very different validity coefficients.
  3. Reporting rates without sample sizes: A 35% long-range conversion from 6 shots is meaningless.

Sources

  • Ted Knutson, Barcelona Coach Analytics Summit, YouTube, 2018-11-18 — described the sample size constraint using Coutinho long-range shooting example (15 shots in one season → 226 over 3 seasons); cited NBA as having 2.5M training shots/year as the opposite extreme; emphasized validity filtering for training data