Wednesday, October 30, 2024
MLB

Similarity Score Package

Very early this morning, I tweeted out (@MaxSportsStudio) that I am fundraising for an organization called Heart 2 Hart Detroit.

It is my goal to make a difference in my community, and I’m hoping to raise $3,130 (313 is Detroit’s area code). Heart 2 Hart Detroit serves the homeless and needy. In 2019, the charity handed out…

14,200 lunches
140 winter coats
1,300 pairs of new underwear/thermal underwear
5,500 pairs of new socks
6,100 hygiene products (deodorant, razors, shampoo, soap, toothbrush, toothpaste, wet wipes etc.)
105 pairs of shoes
2,400 shirts, sweaters or hoodies
1025 bus tickets (4 hour rides)
36 bus passes (monthly)

If you donate at least $5, I will email you copies of my similarity score spreadsheets upon request (DM me on Twitter @MaxSportsStudio or email me maxsportingstudio@gmail.com).

There are seven different spreadsheets.

Most similar…

  • MLB careers (position players and pitchers)
  • position players through age-x season
  • pitchers through age-x season
  • MLB single-seasons
  • rookie seasons
  • MiLB single-seasons for position players
  • MiLB single-seasons for pitchers

The way they are calculated is actually quite simple. Let’s say you wanted to find the players most similar to player W. For all the metrics that are being accounted for, you take the stats of player W and subtract that number from each player’s stats and square the results and sum everything up (the squared results). In order to get each of metrics to be on somewhat of the same scale, I calculated Z-scores (that way a difference in plate appearances, for instance, doesn’t distort the similarity scores).

If player W’s K% was 1 standard deviation above average and his BB% was 2 standard deviations above average, the calculation would look like this.

(Player Y’s K% – 1)^2 + (Player Y’s BB% – 2)^2 = similarity score

Player Z’s K% -1)^2 + (Player Z’s BB% – 2)^2 = similarity score

The lower the similarity score, the more similar the players are to each other. In other words, the more similar a hitter is to Player W, the smaller the squared results will be and the smaller the similarity score (sum of squared results) will be.

The Z-scores should not be interpreted at all. I say that because the average used is the average for the players in that particular pool. In reality, a wRC+ of 100 is league average; however, it might be higher for players through their age-21 season in my sheet. I utilized Z-scores because they do an adequate job of preventing one metric from overshadowing the others (note that, in certain cases, I weight some stats to make them matter more).

At this point, I will now share (an) example(s) pertaining to all seven spreadsheets.

MLB careers

The stats that are considered for most similar MLB careers for position players are plate appearances (volume), BB%+ (walk rate divided by league average [it might be adjusted for AL/NL; I’m not sure]), K%+ (strikeout rate divided by league average), wRC+ (weighted runs created plus [measures batting performance relative to league average]), base running runs above average per plate appearance, fielding runs above average per game, positional runs above average per game (I wish I could do it by inning, but FanGraphs’ leaderboards makes that tricky).

Coefficients/multipliers for squared results

  • PA: 20
  • wRC+: 5
  • Pos/G: 2.5
  • BB%+, K%+, Fld/G: 1
  • BsR/PA: 0.5

Here are the most similar careers to Scott Rolen’s…

The position player similarity systems for major leaguers date back to 1913. The minimum plate appearance requirement is 2000, and the player must be retired in order to qualify.

Total players: 2,080

The stats that are considered for most similar MLB careers for pitchers are innings (volume), innings per game (helps differentiate between relievers and starters), BB%+, K%+, and ERA-.

Coefficients/multipliers for squared results

  • IP: 7.5
  • IP/G: 5
  • ERA-: 2.5
  • BB%+, K%+: 1

Here are the most similar careers to CC Sabathia’s…

The pitcher similarity systems for major leaguers date back to 1916. The minimum innings requirement is 500.

Total players: 1,802

Position players through age-x season

The stats that are considered for most similar position players through age-x season are the same as for the career spreadsheet for hitters.

The coefficients differ slightly, with less of an emphasis on plate appearances.

  • PA, wRC+: 5
  • BB%+, K%, Fld/G, Pos/G: 1
  • BsR/PA: 0.5

Here are the most similar players to Kris Bryant through his age-27 season…

The minimum plate appearance requirement and total number of players vary by age subset.

Pitchers through age-x season

The stats that are considered for most similar pitchers through age-x season are the same as for the career spreadsheet for pitchers.

Coefficients/multipliers for squared results

  • IP: 7.5
  • IP/G: 5
  • ERA-: 2.5
  • BB%+, K%+: 1

Here are the most similar pitchers to Aaron Nola through his age-26 season…

The minimum inning requirement and total number of players vary by age subset.

MLB single-seasons

The metrics for the most similar major league single-seasons are the same as the previous spreadsheets, with a lone exception: age is taken into account.

Coefficients/multipliers for squared results (position players)

  • Age: 10
  • wRC+: 7.5
  • PA: 2.5
  • BB%+, K%+, Fld/G, Pos/G: 1
  • BsR/PA: 0.5

Here are the most similar single-seasons to Ketel Marte’s 2019 campaign…

The minimum plate appearance requirement is 300.

Total single-seasons: 20,573

Coefficients/multipliers for squared results (pitchers)

  • IP: 7.5
  • Age, IP/G: 5
  • ERA-: 2.5
  • BB%+, K%+: 1

Here are the most similar single-seasons to Gerrit Cole’s 2019 campaign…

The minimum inning requirement is 50.

Total single-seasons: 21,912

Rookie seasons

The metrics and plate appearance requirement are the same as for the single-seasons; the coefficients are different though.

Coefficients/multipliers for squared results (position players)

  • wRC+: 7.5
  • Age, PA: 5
  • BB%+, K%+, Fld/G, Pos/G: 1
  • BsR/PA: 0.5

Here are the most similar rookie seasons to Fernando Tatis Jr.’s 2019 campaign…

Total rookie seasons: 2,130

Coefficients/multipliers for squared results (pitchers)

  • IP, IP/G: 7.5
  • Age, ERA-: 5
  • BB%+, K%+: 1

Here are the most similar rookie seasons to Giovanny Gallegos’ 2019 campaign…

Total rookie seasons: 3939

MiLB single-seasons for position players

The stats that are considered are age, BB%, K%, GB%, Spd, and wRC+, none of which are heavily affected by ballpark. The single-seasons are separated by minor league level (rookie domestic, rookie international, A-, A, A+, AA, AAA).

The coefficients are all one except age has a 1.5 coefficient for double-A and a 2 coefficient for triple-A.

Here are the most similar single-seasons to Royce Lewis’ 2019 campaign at A+…

The minor league similarity systems date back to 2007.

The minimum plate appearance requirement is 200. The total number of single-seasons varies by level.

MiLB single-seasons for pitchers

The stats that are considered are age, IP/G, BB%, K%, GB%, and xFIP (not affected much by ballpark because it doesn’t incorporate home runs; instead, xFIP estimates home runs allowed by looking at fly ball rate).

The coefficients are all one except age has a 2.0 coefficient for triple-A.

Here are the most similar single-seasons to Luis Patino’s 2019 campaign at A+..

The minimum inning requirement is 50. The total number of single-seasons varies by level.

All seven of these spreadsheets are fun to play around with.

Purpose of each one

  • MLB careers: to identify most similar careers (statistically speaking, one can say this player had a similar career to this player) and potentially spot players who may be worthy of being voted into the Hall of Fame (be careful, though, as BB% and K% don’t measure HOF worthiness)
  • Through age-x season: to see if a player is on a Hall of Fame trend and to try to project career WAR (FanGraphs for hitters)
  • MLB single-seasons: this one is really just for fun
  • Rookie seasons: this one is also just for fun
  • MiLB single-seasons: to see statistical comps for minor leaguers or former minor leaguers (could maybe try to project WAR through age-29 season)

A couple of relevant articles

To illustrate how useful these similarity scores can be, I am going to project Kris Bryant’s career fWAR based on the players we identified as being most similar to him through his age-27 season (excluding Andrew McCutchen because his career is still going on and George Burns because his career began in 1911) plus Jim Rey Hart, Lance Berkman, Jim Bottomley, Willie McCovey, and Sal Bando.

The p-value for the linear regression above is slightly less than 10%, which is higher than I’d like, but we will proceed anyway.

Based on the equation produced by the line of best fit, we’d expect Kris Bryant’s fWAR/600 PA to be 4.4 for the rest of his career (in his age-28 season and beyond).

The average ratio of plate appearances post- age-27 season to plate appearances through age-27 season is about 1.5 for the ten players.

If we apply that to Bryant, we’d expect him to step up to the plate around 4,775 more times in his career.

After that, we can divide 4,775 by 600 and multiply that quotient by 4.4. That gives a product of 35.1 fWAR, which can be added to Bryant’s WAR through age-27 (27.8).

62.9 wins above replacement is a rough estimate of what Kris Bryant’s career WAR could look like, which would put Bryant in Hall of Fame conversation.

With copies of my spreadsheets, you could do your own versions of what you see above. Obviously, you could change the coefficients as well. If you use my files for an article or tweet, just ensure that you cite me.

Once again, your donations to Heart 2 Hart Detroit would be much appreciated.

You can donate here.

Thanks, everyone!

Notes

  • In some of the spreadsheets, partial careers are included (through age-x and careers). For instance, Walter Johnson debuted in 1907; however, his age-28 through age-39 seasons are included in all of the databases. There is no easy way for me to combat this, as I wanted to include as many players as possible for which the data was available.
  • Two pitchers were mistakingly included in the most similar careers for position players (Red Ruffing and Warren Spahn).
  • I made it so a positive Z-score is always a good thing by dividing by a negative when necessary. For example, a higher BB%+ is a bad thing.
  • Age Z-score uses x-age season (doesn’t differentiate further than that)
  • Minor league stats are not adjusted for year (if only I had access to league average data for MiLB).
  • Minor league stats are not combined for a single-season if the player got traded to another team.