A reinforcement learning approach to valuing every event in a match by iteratively propagating goal rewards backward through event sequences. Unlike EPV (which uses Markov state transitions within possessions), SARSA treats the entire match as one continuous sequence, eliminating possession boundaries and arbitrary temporal horizons. The model iteratively learns: shots are valuable because they lead to goals, certain passes are valuable because they lead to shots, tackles are valuable because they lead to passes — all without defining "possession" or setting a 10-event cutoff.
Start by assigning reward=1 to goals (or optionally, xG values to shots to speed convergence). Train a predictive model on the dataset. After each training pass, apply the SARSA update: project a small fraction of value from high-value events backward to their preceding events. Retrain on the updated values. Repeat until convergence. The neural network architecture should include an LSTM layer to feed in sequences of ~10 events as temporal context — the LSTM learns that a pass received after a through ball is different from a pass received after a lateral. The output layer should predict three classes: probability that the home team scores next, that the away team scores next, and that nobody scores next. This three-outcome structure enables modeling defensive intent (minimize opponent's "scores next" probability) separately from attacking intent.
Q-learning tries to find optimal strategy by controlling agents. You cannot control football players retrospectively. SARSA evaluates the strategy that already exists — the only valid approach for historical match data.