Tuesday, March 29, 2022

R Package for Scraping NCAA Women's Soccer Data

Here is my package for scraping NCAA Women's Soccer data. The data comes from wosoindependent.com

To install:

https://github.com/dgerth5/wncaasocR/blob/main/README.md


Kelly Ranking the FanGraphs Top 100 Prospects

This article was submitted to the FanGraphs Community Blog, but not picked up yet.

With the Top 100 Prospects list out here at FanGraphs and the SABR Analytics Conference coming up soon, I thought it would be exciting to release a Top Prospects list using the methodology I presented at last year’s SABR conference. For those who were not in attendance (I assume this is most readers, and also recommend you go this year), the presentation I gave was on how to use a version of the Kelly Criterion to rate prospects in the draft. The Kelly Criterion is a formula that takes in probabilities of certain events and their corresponding payoffs and returns a percentage of bankroll someone would risk for this wager that would maximize the geometric growth rate. This is an important concept in investment and advantage gambling and there is a lot of literature on the subject.

Here, the setting that will be used is to imagine we are betting on a 5 sided weighted die that represents a player’s possible outcomes, with weights that represent how likely we are to roll a certain outcome. On the Top 100 list, in addition to the scouting reports, for each player on the list there is a corresponding distribution of outcomes for the player and a probability of that even happening. The five categories are “Bust”, 40/45 FV, 50/55 FV, 60/65 FV, and 70+ FV, and we will let each one of these categories be the face of the die.

The payoffs are a bit tricky and there are multiple ways of doing this. I opted to do a simple payoff structure, where I assume a standard $9M/WAR. The formula then for Payoff for a given FV is WAR/YEAR X $M/WAR X 6. The 6 is for the 6 cost controlled years a team has.

From the Prospect Week Primer from early in the week, we have the following table.

FV

WAR/Year

Payoff

40/45

1.0

40.5

50/55

2.5

135

60/65

5.0

270

70+

7.0

378

 

Therefore, for each player, we get the Kelly Function:

f(x) = p_Bust*P(1-x) + p_40/45*P(1+40.5x) + p_50/55*P(1+135x) + p_60/65*P(1+270x) + p70*P(1+378x) that we have to maximize by finding x (the percentage of bankroll) that gives us the highest value.

Ironically, while this is a math based exercise, I had to use the eye test to gauge what the correct probabilities are for each outcome.

Here are the top 25 prospects based on the probability distributions given, the payoff matrix calculated above and apply the Kelly Criterion to them.

Name

Pos

Team

Kelly

FG Rank

SPENCER TORKELSON

1B

DET

0.924

5

JULIO RODRIGUEZ

RF

SEA

0.899

4

ADLEY RUTSCHMANN

C

BAL

0.879

1

JEREMY PENA

SS

HOU

0.874

30

BOBBY WITT JR

SS

KCR

0.849

2

RILEY GREENE

RF

DET

0.849

6

GRAYSON RODRIGUEZ

SP

BAL

0.849

3

VIDAL BRUJAN

2B

TBR

0.848

55

AUSTIN MARTIN

CF

MIN

0.848

56

STEVEN KWAN

CF

CLE

0.848

57

SHEA LANGELIERS

C

ATL

0.848

70

GERALDO PERDOMO

SS

ARI

0.848

83

JOSH H SMITH

2B

TEX

0.848

89

ALEK THOMAS

LF

ARI

0.823

23

JOSH JUNG

3B

TEX

0.798

9

GABRIEL MORENO

C

TOR

0.798

10

TRISTON CASAS

1B

BOS

0.798

16

NICK YORKE

2B

BOS

0.798

29

BRYSON STOTT

SS

PHI

0.798

34

TYLER SODERSTROM

1B

OAK

0.798

36

CRISTIAN PACHE

CF

ATL

0.798

72

GREG JONES

CF

TBR

0.798

77

GEORGE KIRBY

SP

SEA

0.798

28

SHANE BAZ

SP

TBR

0.773

11

HENRY DAVIS

C

PIT

0.772

22

 

You’ll notice that the top risers are the perceived “safe” player types such as Steven Kwan and Geraldo Perdomo, who have a lower probability of busting than average. This is one of the consequences of Kelly. For example, if we are comparing two wagers where we have a 5% edge on, with one being a -800 favorite and the other being a +650 underdog, Kelly prescribes to bet more on the favorite than the underdog, even though there is the same amount of edge.

To better separate true ability from just being farther along the development curve, I separated the list into two groups, those whose highest level reached was AA or higher, and then those who had not reached AA yet. I then computed the average Kelly staking size for the two groups, recorded in the table below.

Group

Average Kelly Size

AA and above

0.720

Below AA

0.620

 

Then, I subtracted each player’s Kelly size from the average Kelly size of their group to get a “level adjusted” Kelly size. The top 25 from this method is below, with the full list in the GitHub link at the end of the article. 

Name

Pos

Team

Highest Level

Kelly

Level Adjusted Kelly

FG Rank

SPENCER TORKELSON

1B

DET

AAA

0.924

0.204

5

JULIO RODRIGUEZ

RF

SEA

AA

0.899

0.179

4

NICK YORKE

2B

BOS

A+

0.798

0.178

29

TYLER SODERSTROM

1B

OAK

A

0.798

0.178

36

ADLEY RUTSCHMANN

C

BAL

AAA

0.879

0.159

1

JEREMY PENA

SS

HOU

AAA

0.874

0.154

30

HENRY DAVIS

C

PIT

A+

0.772

0.152

22

BOBBY WITT JR

SS

KCR

AAA

0.849

0.129

2

RILEY GREENE

RF

DET

AAA

0.849

0.129

6

GRAYSON RODRIGUEZ

SP

BAL

AA

0.849

0.129

3

FRANCISCO ALVAREZ

C

NYM

A+

0.748

0.128

7

ANTHONY VOLPE

SS

NYY

A+

0.748

0.128

12

CORBIN CARROLL

CF

ARI

A+

0.748

0.128

14

VIDAL BRUJAN

2B

TBR

MLB

0.848

0.128

55

AUSTIN MARTIN

CF

MIN

AA

0.848

0.128

56

STEVEN KWAN

CF

CLE

AAA

0.848

0.128

57

SHEA LANGELIERS

C

ATL

AAA

0.848

0.128

70

GERALDO PERDOMO

SS

ARI

MLB

0.848

0.128

83

JOSH H SMITH

2B

TEX

AA

0.848

0.128

89

JACK LEITER

SP

TEX

R

0.748

0.128

24

PATRICK BAILEY

C

SFG

A+

0.747

0.127

76

ALEK THOMAS

LF

ARI

AAA

0.823

0.103

23

JOSH JUNG

3B

TEX

AAA

0.798

0.078

9

GABRIEL MORENO

C

TOR

AAA

0.798

0.078

10

TRISTON CASAS

1B

BOS

AAA

0.798

0.078

16

 

After applying the change, we have mostly the same names, though with a slightly different ordering near the top of the list and a few new players. One thing that is noticeable is that there are only two pitching prospects here in the top 25. This is roughly inline with what the original top 100 list, but it’s good to note that pitchers have a relatively riskier profile due to the injury risk, and therefore is hard to rank a pitching prospect highly.

To wrap up, Kelly bet sizing favors players that have a low risk of busting, which results in prospects that are “safer” to be rated higher than the toolsier players with higher risk of not making the major leagues. In fact, there is a 1-1 correspondence between bust percentage and Kelly size. This is why Torkelson was the #1 prospect; he has the lowest chance of being a bust. There are some implications here for farm system building. One is that it is important to have significant amounts of depth in a farm system. It is not a guarantee that both Adley and Grayson Rodriguez are All-Stars, and in fact it is unlikely. Relying on the best possible outcome happening is not a sustainable way to build a team.

From a draft standpoint, it makes sense to allocate most of the draft pool to players with a low chance of busting. One reason why Nick Madrigal is a great draft pick even though he may not have the best career is that he had a low chance of busting given that his hit tool was so strong. His best case scenario is not as valuable as others, but that’s ok because we are getting a quality major leaguer for little money. This isn’t to say that drafting high schoolers or high risk players should not happen. In fact, an advantage a team with good coordination between the draft department and player development may have over another team that just uses analytics is that they will be able to identify high schoolers who may be risky from a model standpoint, but have traits that PD thinks they develop well, and thus have a lower risk than what the model says, which makes them more valuable.