Here is my package for scraping NCAA Women's Soccer data. The data comes from wosoindependent.com
To install:
https://github.com/dgerth5/wncaasocR/blob/main/README.md
Here is my package for scraping NCAA Women's Soccer data. The data comes from wosoindependent.com
To install:
https://github.com/dgerth5/wncaasocR/blob/main/README.md
This article was submitted to the FanGraphs Community Blog, but not picked up yet.
With the Top 100 Prospects list out here at FanGraphs and
the SABR Analytics Conference coming up soon, I thought it would be exciting to
release a Top Prospects list using the methodology I presented at last year’s SABR
conference. For those who were not in attendance (I assume this is most readers,
and also recommend you go this year), the presentation I gave was on how to use
a version of the Kelly Criterion to rate prospects in the draft. The Kelly
Criterion is a formula that takes in probabilities of certain events and their
corresponding payoffs and returns a percentage of bankroll someone would risk for
this wager that would maximize the geometric growth rate. This is an important
concept in investment and advantage gambling and there is a lot of literature
on the subject.
Here, the setting that will be used is to imagine we are
betting on a 5 sided weighted die that represents a player’s possible outcomes,
with weights that represent how likely we are to roll a certain outcome. On the
Top 100 list, in addition to the scouting reports, for each player on the list
there is a corresponding distribution of outcomes for the player and a
probability of that even happening. The five categories are “Bust”, 40/45 FV,
50/55 FV, 60/65 FV, and 70+ FV, and we will let each one of these categories be
the face of the die.
The payoffs are a bit tricky and there are multiple ways of
doing this. I opted to do a simple payoff structure, where I assume a standard
$9M/WAR. The formula then for Payoff for a given FV is WAR/YEAR X $M/WAR X 6.
The 6 is for the 6 cost controlled years a team has.
From the Prospect Week Primer from early in the week, we
have the following table.
FV |
WAR/Year |
Payoff |
40/45 |
1.0 |
40.5 |
50/55 |
2.5 |
135 |
60/65 |
5.0 |
270 |
70+ |
7.0 |
378 |
Therefore, for each player, we get the Kelly Function:
f(x) = p_Bust*P(1-x) + p_40/45*P(1+40.5x) + p_50/55*P(1+135x)
+ p_60/65*P(1+270x) + p70*P(1+378x) that we have to maximize by finding x (the
percentage of bankroll) that gives us the highest value.
Ironically, while this is a math based exercise, I had to
use the eye test to gauge what the correct probabilities are for each outcome.
Here are the top 25 prospects based on the probability
distributions given, the payoff matrix calculated above and apply the Kelly
Criterion to them.
Name |
Pos |
Team |
Kelly |
FG Rank |
SPENCER TORKELSON |
1B |
DET |
0.924 |
5 |
JULIO RODRIGUEZ |
RF |
SEA |
0.899 |
4 |
ADLEY RUTSCHMANN |
C |
BAL |
0.879 |
1 |
JEREMY PENA |
SS |
HOU |
0.874 |
30 |
BOBBY WITT JR |
SS |
KCR |
0.849 |
2 |
RILEY GREENE |
RF |
DET |
0.849 |
6 |
GRAYSON RODRIGUEZ |
SP |
BAL |
0.849 |
3 |
VIDAL BRUJAN |
2B |
TBR |
0.848 |
55 |
AUSTIN MARTIN |
CF |
MIN |
0.848 |
56 |
STEVEN KWAN |
CF |
CLE |
0.848 |
57 |
SHEA LANGELIERS |
C |
ATL |
0.848 |
70 |
GERALDO PERDOMO |
SS |
ARI |
0.848 |
83 |
JOSH H SMITH |
2B |
TEX |
0.848 |
89 |
ALEK THOMAS |
LF |
ARI |
0.823 |
23 |
JOSH JUNG |
3B |
TEX |
0.798 |
9 |
GABRIEL MORENO |
C |
TOR |
0.798 |
10 |
TRISTON CASAS |
1B |
BOS |
0.798 |
16 |
NICK YORKE |
2B |
BOS |
0.798 |
29 |
BRYSON STOTT |
SS |
PHI |
0.798 |
34 |
TYLER SODERSTROM |
1B |
OAK |
0.798 |
36 |
CRISTIAN PACHE |
CF |
ATL |
0.798 |
72 |
GREG JONES |
CF |
TBR |
0.798 |
77 |
GEORGE KIRBY |
SP |
SEA |
0.798 |
28 |
SHANE BAZ |
SP |
TBR |
0.773 |
11 |
HENRY DAVIS |
C |
PIT |
0.772 |
22 |
You’ll notice that the top risers are the perceived “safe”
player types such as Steven Kwan and Geraldo Perdomo, who have a lower probability
of busting than average. This is one of the consequences of Kelly. For example,
if we are comparing two wagers where we have a 5% edge on, with one being a
-800 favorite and the other being a +650 underdog, Kelly prescribes to bet more
on the favorite than the underdog, even though there is the same amount of
edge.
To better separate true ability from just being farther
along the development curve, I separated the list into two groups, those whose
highest level reached was AA or higher, and then those who had not reached AA
yet. I then computed the average Kelly staking size for the two groups,
recorded in the table below.
Group |
Average
Kelly Size |
AA and above |
0.720 |
Below AA |
0.620 |
Then, I subtracted each player’s Kelly size from the average
Kelly size of their group to get a “level adjusted” Kelly size. The top 25 from
this method is below, with the full list in the GitHub link at the end of the
article.
Name |
Pos |
Team |
Highest
Level |
Kelly |
Level Adjusted Kelly |
FG Rank |
SPENCER TORKELSON |
1B |
DET |
AAA |
0.924 |
0.204 |
5 |
JULIO RODRIGUEZ |
RF |
SEA |
AA |
0.899 |
0.179 |
4 |
NICK YORKE |
2B |
BOS |
A+ |
0.798 |
0.178 |
29 |
TYLER SODERSTROM |
1B |
OAK |
A |
0.798 |
0.178 |
36 |
ADLEY RUTSCHMANN |
C |
BAL |
AAA |
0.879 |
0.159 |
1 |
JEREMY PENA |
SS |
HOU |
AAA |
0.874 |
0.154 |
30 |
HENRY DAVIS |
C |
PIT |
A+ |
0.772 |
0.152 |
22 |
BOBBY WITT JR |
SS |
KCR |
AAA |
0.849 |
0.129 |
2 |
RILEY GREENE |
RF |
DET |
AAA |
0.849 |
0.129 |
6 |
GRAYSON RODRIGUEZ |
SP |
BAL |
AA |
0.849 |
0.129 |
3 |
FRANCISCO ALVAREZ |
C |
NYM |
A+ |
0.748 |
0.128 |
7 |
ANTHONY VOLPE |
SS |
NYY |
A+ |
0.748 |
0.128 |
12 |
CORBIN CARROLL |
CF |
ARI |
A+ |
0.748 |
0.128 |
14 |
VIDAL BRUJAN |
2B |
TBR |
MLB |
0.848 |
0.128 |
55 |
AUSTIN MARTIN |
CF |
MIN |
AA |
0.848 |
0.128 |
56 |
STEVEN KWAN |
CF |
CLE |
AAA |
0.848 |
0.128 |
57 |
SHEA LANGELIERS |
C |
ATL |
AAA |
0.848 |
0.128 |
70 |
GERALDO PERDOMO |
SS |
ARI |
MLB |
0.848 |
0.128 |
83 |
JOSH H SMITH |
2B |
TEX |
AA |
0.848 |
0.128 |
89 |
JACK LEITER |
SP |
TEX |
R |
0.748 |
0.128 |
24 |
PATRICK BAILEY |
C |
SFG |
A+ |
0.747 |
0.127 |
76 |
ALEK THOMAS |
LF |
ARI |
AAA |
0.823 |
0.103 |
23 |
JOSH JUNG |
3B |
TEX |
AAA |
0.798 |
0.078 |
9 |
GABRIEL MORENO |
C |
TOR |
AAA |
0.798 |
0.078 |
10 |
TRISTON CASAS |
1B |
BOS |
AAA |
0.798 |
0.078 |
16 |
After applying the change, we have mostly the same names, though
with a slightly different ordering near the top of the list and a few new
players. One thing that is noticeable is that there are only two pitching
prospects here in the top 25. This is roughly inline with what the original top
100 list, but it’s good to note that pitchers have a relatively riskier profile
due to the injury risk, and therefore is hard to rank a pitching prospect
highly.
To wrap up, Kelly bet sizing favors players that have a low
risk of busting, which results in prospects that are “safer” to be rated higher
than the toolsier players with higher risk of not making the major leagues. In
fact, there is a 1-1 correspondence between bust percentage and Kelly size. This
is why Torkelson was the #1 prospect; he has the lowest chance of being a bust.
There are some implications here for farm system building. One is that it is
important to have significant amounts of depth in a farm system. It is not a
guarantee that both Adley and Grayson Rodriguez are All-Stars, and in fact it
is unlikely. Relying on the best possible outcome happening is not a
sustainable way to build a team.
From a draft standpoint, it makes sense to allocate most of the
draft pool to players with a low chance of busting. One reason why Nick Madrigal
is a great draft pick even though he may not have the best career is that he
had a low chance of busting given that his hit tool was so strong. His best
case scenario is not as valuable as others, but that’s ok because we are
getting a quality major leaguer for little money. This isn’t to say that
drafting high schoolers or high risk players should not happen. In fact, an
advantage a team with good coordination between the draft department and player
development may have over another team that just uses analytics is that they
will be able to identify high schoolers who may be risky from a model
standpoint, but have traits that PD thinks they develop well, and thus have a
lower risk than what the model says, which makes them more valuable.