Home/Soccer Analytics/Filtering Training Data for Game-Valid Samples

Filtering Training Data for Game-Valid Samples

Data InfrastructureLevel 3 — Advanced

What It Is

Not all training data is equally predictive of match performance. An undefended finishing drill produces conversion rates 2-3x higher than defended match situations. Small-sided games create different decision contexts than 11v11 play. The "game validity coefficient" of a training drill measures how well performance in that drill predicts performance in match-equivalent situations. Before using training data to build skill profiles, each data source must be filtered or weighted by its validity coefficient.

Correct Execution

Validity assessment process: (1) categorize training sessions by drill type (unopposed technical, small-sided game, 11v11 live play, set-piece practice); (2) for each drill type, compute the correlation between drill performance and subsequent match performance on the same metric; (3) use this correlation as the validity coefficient; (4) weight training data by validity coefficient before mixing with match data. High-validity data (11v11 live play) can be treated nearly equivalently to match data. Low-validity data (unopposed drills) should be used only for technique assessment, not outcome prediction.

Progression Levels

Diagnostic Tree

Coaching Cues

  • "90% in an unopposed drill is not 90% in a game. Know the conversion factor." — Ted Knutson, 2018
  • "Validity isn't binary — it's a spectrum. Weight accordingly."

Common Errors

  1. Mixing high- and low-validity training data without weighting: 1000 unopposed shots should not count equally to 100 match shots for finishing profile purposes.
  2. Assuming validity is constant across players: Some players are remarkably consistent between training and matches; others diverge significantly.

Sources

  • Ted Knutson, Barcelona Coach Analytics Summit, YouTube, 2018-11-18 — described training data validity filtering using NBA example (2.5M training shots but best players go from 90-95% in training to 40-45% when defended in games); emphasized validity filtering as essential before mixing training and match data