Saturday, August 13, 2022

Statistical Inference vs Statcast Models

Since I am currently working on a Stuff model (done in Stan, and hopefully publishing a writeup in their case studies documentation), I have been thinking a lot about the philosophical differences in models that are done with Statcast data versus models built with game level data.

In general, I feel like there is an implicit assumption that models built from Statcast data are always going to be better than ones built with just statistics. I think that viewpoint typically makes sense. Statcast data is more granular, which makes it more feasible to give individuals credit. Fielding statistics as a whole have benefited from Statcast data. It's really hard to properly attribute skill to fielders without granular level data, and new statistics such as Catch Probability and OAA are really impressive pieces of work that show how important Statcast data is and how much we were missing it beforehand.

However, I want to push back on the idea that Statcast models are always better than statistical models. I think this can be especially true in pitching, which may be surprising. Two places where Statcast models can be beat by a well calibrated statistical model is in deception and pitch mix interaction. Deception is something inherently visual and means different things to different people, which makes it hard to quantify. Furthermore we run the risk of overfitting to small data based on this, which can make the overconfident in their perceived edge against others and lead to ruin.

Asking if you want a good stats only model versus a good Statcast model is like the beer or tacos question. You want both, and ideally want them to converge to the same prediction. However, acknowledging the strengths and weaknesses of both approaches and not defaulting to one approach is beneficial in the long run.

No comments:

Post a Comment