This article was submitted to FanGraphs Community, but has not been picked up yet.
Over the past decade, Bayesian statistics has gotten more
popular and experience in the subject has become valuable. In this article, I
wanted to give a brief tutorial on Stan, an open-source Bayesian probabilistic programming
language that is popular amongst many statisticians across many domains,
including baseball.
The model that I will be estimating by using Stan is a
logistic regression, where I want to model if a batted ball will be a home run
based on launch angle and exit velocity. Obviously, there are many factors
missing here, but for purposes of demonstration I wanted to keep it simple and just
have two parameters.
A Stan model has three core blocks: “Data,” “Parameters,”
and “Model.” There are other blocks to do different things like transform data
or transform parameters, but this article will stick to the basics.
The data block is where we recreate the dataset we are
working with in a generic way. The first entry is N, which is the number of
observations in the dataset. One thing to note is that Stan is compiled in C++,
so the user has to declare variable types. Since number of observations is
discrete and positive, it is an integer, bounded below by 0. The next variable
is home_run[N], which is our binary dependent variable that records 1 when the
observation is a home run, and 0 otherwise. This variable is an integer and
bounded below by 0 and above by 1, which means our set of possible values to be
{0,1}. The final two parameters, exit_velocity and launch_angle, are vectors of
length N.
Bayesian modeling is computationally intensive,
and it is important to be thinking about how to write code to make the program
run faster. To speed this model up, I use the “Transformed Data” block to
standardize the dependent variables to have a mean of zero and a standard
deviation of one. By adding this block, the runtime for this model was around five
minutes, whereas not including the transformed data block took around twenty
minutes. This may seem trivial, but once you work on larger datasets this speed
increase is massive.
The parameter block is all of the parameters defined in the regression model. The three parameters we need to estimate here are alpha, beta(exit_velocity), and beta(launch_angle). These are all real data types.
The final block is the model block. This
is where we define our model, as well as priors for the parameters. Stan does
have built in support for logistic regression, so we use the bernoulli_logit()
function to set the distribution for probability of hitting home run.
Setting priors for parameters is easy to program
in Stan since there are prebuilt functions, but it can be difficult to know
which distributions to use. For alpha parameter, I used prior kn
owledge based
on the data to create a moderately informed prior. In this dataset, around 5%
of the batted balls were hit for home runs. For a pitch hit with exit velocity
and launch angles that are average (z score of 0), we would not expect for it
to be a home run. Therefore, I chose alpha prior to be normally distributed
with a mean of -3 (on the logit scale, p = 0.5 converts to a logit of -3), with
standard deviation 5. For the beta parameters, I did not have a good idea about
what the prior should be, so I chose normal(0,5) for beta(exit_velocity) and
beta(launch_angle), which is weakly informed. An uninformed prior would be unform(0,5),
where any possible value in the interval [0,5] is possible.
The Stan code can be used in a variety of
different languages. For this article, I used R, but I scraped the data in Python
via the pybaseball package. The code is in the GitHub link at the end of the
article and the output is below:
The histogram plots the frequency of various
values for each parameter, and the summary gives an analytic view, as well as
some diagnostic tests. Note the Rhat values for alpha and beta_ev are 1.01, and
not 1. This indicates that there is not perfect convergence, which occurs at
Rhat = 1. Rhat values above 1.03 are concerning and indicate that more iterations
must be done.
In conclusion, this is a simple introduction
to Stan and Bayesian modeling. There are a couple great packages in R that do
Bayesian estimation without requiring the user to write the model like we did
here (brms/rstanarm), but using Stan is good for two reasons. One is the added
flexibility, which was not important in this example, but can be as the problem
that needs to be modeled gets more complex. The other is that writing code in
Stan forces you to deeply understand the models you are working with. This has noticeably
improved my understanding of statistics by forcing me to actively engage with the
data and setting priors. Hopefully this article is helpful for those interested
in starting to learn Bayesian modeling.
No comments:
Post a Comment