Building ordered-assertion test suites for football analytics models — not checking top-line accuracy, but verifying that the model respects known football truths. Example: a penalty-area shot should always have higher xG than an identical shot from 30 yards. These tests catch model failures that aggregate accuracy metrics miss, automate the "eye test," and create a contract between the model and football domain expertise. Every time an analyst says "that number is stupid," the fix goes in as a permanent test that runs on every retrain.
Build a growing suite of ordered assertions: "all else equal, output A > output B." Start simple (close shot > far shot), then add context (press-breaking pass > safe back pass; completed pass through the lines > lateral pass in the same zone). Involve analysts and football people in writing assertions — they know what should be true even if they can't specify the exact numbers. Run the full suite on every model retrain. When a test fails, investigate: either the model has a bug, or the assertion was wrong — both are valuable conversations. Never train on the test cases (that defeats the purpose). Accumulate tests: every reported "stupid number" becomes a permanent test.
The most valuable model validation isn't top-line accuracy metrics — it's ordered assertions that verify the model respects known football truths. "A penalty-area shot should always have higher xG than an identical shot from 30 yards." These behavioral tests catch failures that aggregate accuracy misses, automate the "eye test," and create a permanent contract between the model and domain expertise. The key practice: every time an analyst or coach says "that number is stupid," the fix goes in as a permanent test.
A model can have excellent log-loss on held-out data but produce results that violate basic football knowledge. If your event valuation model says a penalty area shot is worth less than a midfield pass, or that headers are more valuable than feet shots from the same location, the model has learned a statistical artifact, not football. Behavioral assertion tests — verifying that model outputs match known football truths — catch these failures that standard ML metrics miss.