Home/Soccer Analytics/Assertion-Based Model Validation

Assertion-Based Model Validation

Data InfrastructureLevel 3 — Advanced

What It Is

Building ordered-assertion test suites for football analytics models — not checking top-line accuracy, but verifying that the model respects known football truths. Example: a penalty-area shot should always have higher xG than an identical shot from 30 yards. These tests catch model failures that aggregate accuracy metrics miss, automate the "eye test," and create a contract between the model and football domain expertise. Every time an analyst says "that number is stupid," the fix goes in as a permanent test that runs on every retrain.

Correct Execution

Build a growing suite of ordered assertions: "all else equal, output A > output B." Start simple (close shot > far shot), then add context (press-breaking pass > safe back pass; completed pass through the lines > lateral pass in the same zone). Involve analysts and football people in writing assertions — they know what should be true even if they can't specify the exact numbers. Run the full suite on every model retrain. When a test fails, investigate: either the model has a bug, or the assertion was wrong — both are valuable conversations. Never train on the test cases (that defeats the purpose). Accumulate tests: every reported "stupid number" becomes a permanent test.

Progression Levels

Diagnostic Tree

Coaching Cues

  • "Every time someone says 'that number is stupid,' put a test in." — StatsBomb CTO, 2019
  • "Keep your hand up if you always write unit tests." (Nobody does — that's the problem.)
  • "When model and intuition disagree, both get questioned."
  • "Automate the eye test."

Common Errors

  1. Training on test cases: If your model is optimized to pass assertions, the assertions lose their independent diagnostic value. Keep test cases separate from training data.
  2. Only testing at the top of the leaderboard: "Messi is #1" is not a useful test — it will always pass. Test the model's understanding of football situations, not its ranking of famous players.
  3. Not accumulating tests: Tests are only valuable as a growing library. Every retrain should run the full historical suite.

Edges

💎 Elite-Only Behavior

Every "Stupid Number" a Coach Reports Should Become a Permanent Test Case

The most valuable model validation isn't top-line accuracy metrics — it's ordered assertions that verify the model respects known football truths. "A penalty-area shot should always have higher xG than an identical shot from 30 yards." These behavioral tests catch failures that aggregate accuracy misses, automate the "eye test," and create a permanent contract between the model and domain expertise. The key practice: every time an analyst or coach says "that number is stupid," the fix goes in as a permanent test.

What most people do
Validate models on aggregate accuracy (log-loss, AUC) and spot-check a few outputs visually.
What the best do
Build a growing library of ordered assertions ("all else equal, A > B") written by domain experts and run on every retrain. Treat every "stupid number" report as a test case to be codified permanently. Never train on the test cases. The library only grows — it never shrinks.
Why it's an edge: Models degrade silently during retraining. A well-maintained assertion suite catches regressions that aggregate metrics miss because the regression might be confined to a specific scenario (like direct corners) that barely moves the aggregate number but produces absurd individual outputs. The assertion library is institutional knowledge that prevents the same mistake twice.
How to exploit: Start an assertion library today. Ask coaches and analysts: "What should always be true about this model's output?" Write each answer as a test. Run the full suite on every model update. Track which tests fail most often — those reveal systematic model weaknesses.
StatsBomb CTO, StatsBomb Innovation in Football Conference, 2019-10-25. "Keep your hand up if you always write unit tests" — nobody does, and that's the problem.
Conventional Wisdom Is Wrong

Most Football Models Are Never Validated Against Known Football Facts — And Many Fail Basic Sanity Checks

A model can have excellent log-loss on held-out data but produce results that violate basic football knowledge. If your event valuation model says a penalty area shot is worth less than a midfield pass, or that headers are more valuable than feet shots from the same location, the model has learned a statistical artifact, not football. Behavioral assertion tests — verifying that model outputs match known football truths — catch these failures that standard ML metrics miss.

What most people do
Validate models using statistical metrics (AUC, log-loss, calibration) on held-out data and declare them ready.
What the best do
After standard validation, run a suite of behavioral assertion tests: penalty area shots > midfield shots, free kicks > open play from same distance, 1v1s > congested shots. Any failure indicates the model has learned a spurious pattern. These tests are written like unit tests and run on every model update.
Why it's an edge: A model that passes statistical validation but fails behavioral assertions will produce recommendations that coaches immediately reject as nonsensical. This destroys trust in analytics — a single "the model says midfield passes are more valuable than shots" finding can undermine years of credibility-building.
How to exploit: Build a football-specific behavioral test suite (20-30 assertions). Run it on every model before deploying. Treat any failure as a blocker. This catches problems before they reach stakeholders.
StatsBomb CTO, StatsBomb Innovation in Football Conference, 2019-10-25. Behavioral assertion testing for SARSA model validation.

Sources

  • StatsBomb CTO, StatsBomb Innovation in Football Conference, YouTube, 2019-10-25 — described building assertion-based test suites for xG and EPV models, involving analysts in assertion writing, and using test failure as a conversation between model and domain expertise