Sunday, March 27, 2022

Introduction to Stan using Statcast Data

 This article was submitted to FanGraphs Community, but has not been picked up yet.

Over the past decade, Bayesian statistics has gotten more popular and experience in the subject has become valuable. In this article, I wanted to give a brief tutorial on Stan, an open-source Bayesian probabilistic programming language that is popular amongst many statisticians across many domains, including baseball.

The model that I will be estimating by using Stan is a logistic regression, where I want to model if a batted ball will be a home run based on launch angle and exit velocity. Obviously, there are many factors missing here, but for purposes of demonstration I wanted to keep it simple and just have two parameters.

A Stan model has three core blocks: “Data,” “Parameters,” and “Model.” There are other blocks to do different things like transform data or transform parameters, but this article will stick to the basics.



The data block is where we recreate the dataset we are working with in a generic way. The first entry is N, which is the number of observations in the dataset. One thing to note is that Stan is compiled in C++, so the user has to declare variable types. Since number of observations is discrete and positive, it is an integer, bounded below by 0. The next variable is home_run[N], which is our binary dependent variable that records 1 when the observation is a home run, and 0 otherwise. This variable is an integer and bounded below by 0 and above by 1, which means our set of possible values to be {0,1}. The final two parameters, exit_velocity and launch_angle, are vectors of length N.

Bayesian modeling is computationally intensive, and it is important to be thinking about how to write code to make the program run faster. To speed this model up, I use the “Transformed Data” block to standardize the dependent variables to have a mean of zero and a standard deviation of one. By adding this block, the runtime for this model was around five minutes, whereas not including the transformed data block took around twenty minutes. This may seem trivial, but once you work on larger datasets this speed increase is massive.

The parameter block is all of the parameters defined in the regression model. The three parameters we need to estimate here are alpha, beta(exit_velocity), and beta(launch_angle). These are all real data types.

The final block is the model block. This is where we define our model, as well as priors for the parameters. Stan does have built in support for logistic regression, so we use the bernoulli_logit() function to set the distribution for probability of hitting home run.

Setting priors for parameters is easy to program in Stan since there are prebuilt functions, but it can be difficult to know which distributions to use. For alpha parameter, I used prior kn
owledge based on the data to create a moderately informed prior. In this dataset, around 5% of the batted balls were hit for home runs. For a pitch hit with exit velocity and launch angles that are average (z score of 0), we would not expect for it to be a home run. Therefore, I chose alpha prior to be normally distributed with a mean of -3 (on the logit scale, p = 0.5 converts to a logit of -3), with standard deviation 5. For the beta parameters, I did not have a good idea about what the prior should be, so I chose normal(0,5) for beta(exit_velocity) and beta(launch_angle), which is weakly informed. An uninformed prior would be unform(0,5), where any possible value in the interval [0,5] is possible.

The Stan code can be used in a variety of different languages. For this article, I used R, but I scraped the data in Python via the pybaseball package. The code is in the GitHub link at the end of the article and the output is below:



The histogram plots the frequency of various values for each parameter, and the summary gives an analytic view, as well as some diagnostic tests. Note the Rhat values for alpha and beta_ev are 1.01, and not 1. This indicates that there is not perfect convergence, which occurs at Rhat = 1. Rhat values above 1.03 are concerning and indicate that more iterations must be done.

In conclusion, this is a simple introduction to Stan and Bayesian modeling. There are a couple great packages in R that do Bayesian estimation without requiring the user to write the model like we did here (brms/rstanarm), but using Stan is good for two reasons. One is the added flexibility, which was not important in this example, but can be as the problem that needs to be modeled gets more complex. The other is that writing code in Stan forces you to deeply understand the models you are working with. This has noticeably improved my understanding of statistics by forcing me to actively engage with the data and setting priors. Hopefully this article is helpful for those interested in starting to learn Bayesian modeling.


No comments:

Post a Comment