Wednesday, May 29, 2024
AnalysisMLB

An introduction to pDRC+

The world and its population unfortunately find themselves in a tough position right now. Globally, there are over 1.1 million cases of the coronavirus, and over 60,000 people have died as a result.

I myself will not be returning to school this spring, as ordered by Michigan governor Gretchen Whitmer.

In my free time, I’ve spent too many hours trying to develop new baseball metrics to evaluate players.

I think it has paid off, and with that, I am proud to present pDRC+.

pDRC+ stands for predictive deserved runs created plus.

You may or not be familiar with DRC+, a metric created by Baseball Prospectus. DRC+, similar to OPS+ and wRC+ in principle, is a rate statistic that seeks to convey a player’s expected contributions at the plate relative league average, adjusting for countless variables (ballpark, temperature, opponent quality, etc.).

wRC+ and OPS+ are different from DRC+ in the sense that they are calculated using real outcomes. As far as wRC+ and OPS+ go, a bloop single counts the same as a hard hit single.

On the other hand, DRC+, deserved runs created plus, is calculated off of expected outcomes on batted ball events (probability of a single, double, triple, etc.).

In spite of this meaningful contrast, DRC+ tends to agree with wRC+ and OPS+ most of the time.

For single-seasons of at least 300 plate appearances dating back to 2015 (n=1374), the r-squared for both DRC+ to wRC+ and DRC+ to OPS+ is about 0.9. The mean difference between a player’s DRC+ and wRC+ is about 6.4. For DRC+ and OPS+, it is about 6.2.

As documented by Jonathan Judge in this piece he wrote for Baseball Prospectus, team DRC+ has a stronger correlation to team runs/PA (1980-2018) than wRC+ and OPS+ do. It is better at predicting team runs/PA the following season, and it is the most reliable of the three metrics (strongest YoY correlation to itself).

The metric I have created, pDRC+, is built in such a way that it maximizes its ability to predict a hitter’s DRC+ in season n+1 while still remaining maintaining a strong correlation to DRC+ during the season.

There are seven components to pDRC+.

  • HBP%
  • BB%
  • K%
  • Barrels/PA %
  • % of BBEs that are ground balls or pop ups that leave the player’s bat at an EV of less than 90 mph
  • Average Exit Velocity
  • Sprint Speed

I calculated standardized scores (Z-scores) for each of the stats and then combined them into a single Z-score.

Combined Z-score = (HBP% Z-score * 0.05) + (BB% Z-score * 0.15) + (K% Z-score * 0.25) + (Barrels/PA % Z-score * 0.25) + (% of BBEs that are ground balls… * 0.1) + (Average Exit Velocity Z-score * 0.1) + (Sprint Speed Z-score * 0.1)

I determined the numbers being multiplied by the individual Z-scores through exploring what weighted combinations produce the strongest correlation to DRC+ in season n+1.

At this point, I’d like to talk about each factor included in the formula.

HBP% – getting hit by a pitch awards the batter first base, the same way a walk does. HBP% is not weighted as heavily as BB% in the combined Z-score equation because hit-by-pitches occur on a vastly more infrequent basis (league average HBP% in 2019 for non-pitchers: 1.1; league average BB% in 2019 for NPs: 8.7), and they lack the same reliability that walks possess. Some HBPs are purely random, requiring minimal amounts of skill. A walk tells you much more about a hitter, which is no surprise. When unintentional, walks are earned by the hitters. A batter forced the pitcher to throw at least four pitches.

BB% – how often a hitter walks is useful. Walk percentage alone has a moderately strong correlation to DRC+, and it even has some predictive value.

K% – strikeout rate has little correlation to DRC+ (r-squared of 0.0183), but when it is combined with other variables, it is very useful. A plate appearance ending in a strikeout is an out, unless it’s a dropped third strike.

Barrels/PA % – a barrel is a type of contact that often leads to positive outcomes for the hitter. In the 2019 regular season, 81.6% of barrels resulted in base hits, with more than half of barrels ending in home runs. A barrel requires a high exit velocity (“softest” hit barrel in 2019 had a 97.5 mph EV) and must be within a certain LA range (depends on the EV). Hitters that barrel the ball consistently tend to have higher DRC+s. A benefit of incorporating barrels is that the frequency in which they occur is not physically affected by ballpark (note that the result of the barrel is).

% of BBEs ending in ground balls or pop outs leaving bat at an EV of less than 90 mph – in 2019, these batted ball events turned into hits 10.5% of the time. Ground balls and pop outs are bad enough to begin with, especially when they are hit softly.

Average exit velocity – balls hit at a higher exit velocity translate into hits more often than those hit at lower EVs

2019 league AVG on balls leaving bat a “x” velocity

Sprint speed – “Currently, the metric includes “qualified runs” from these two categories:

• Runs of two bases or more on non-homers, excluding runs from second base when an extra-base hit happens.
• Home-to-first runs on ‘topped’ or ‘weakly hit’ balls” (MLB.com).

Sprint speed, like K%, has no correlation to DRC+ when looked at on its own. It is useful though when combined with other aspects. Players that are faster have a higher chance of beating out ground balls and gaining more bases through their speed (for instance, stretching a single into a double).

Combining all seven factors outlined above into one score allows one to better predict a hitter’s DRC+ in season n+1.

The correlation for combined Z-score to DRC+ (n+1) is slightly higher than the correlation of DRC+ to DRC+ (n+1), a difference of almost 0.05 (R).

Then, I regressed the combined Z-scores against DRC+. The line of best fit produced pDRC+ values.

For my sample of 824 back-to-back player-seasons (had to have 300+ PAs each year), pDRC+ values in season n were on average on 12.9 points away from DRC+ values in season n+1, whereas DRC+ values in season n were on average on 14.9 points away from DRC+ values in season n+1.

pDRC+ values in season n were on average 15.4 points away from wRC+ values in season n+1, whereas wRC+ values in season n were on average 18.7 points away from wRC++ values in season n+1.

When there was at least a 10 point difference between DRC+ and pDRC+ in season n, the extent to which pDRC+ demonstrated itself as a better predictor of DRC+ in season n+1 widens (average amount pDRC+ is off by is almost 5 points less than DRC+).

At this point, I’d like to illustrate a key difference between wRC+ (and OPS+) and pDRC+ (and DRC+).

Take this Dansby Swanson liner for example…

It left Swanson’s bat on a line at 106.6 mph, but it was hit right at Arenado.

After this plate appearance, Dansby Swanson’s wRC+ and OPS+ would decrease because Swanson recorded an out.

His DRC+ would probably increase because there was a high chance of that liner going for a single (or double for that matter).

His pDRC+ would increase because it was hit at a high exit velocity (it wasn’t a soft grounder or pop up).

Here is another example…

This fly ball left Trout’s bat at 93.4 mph.

His wRC+ and OPS+ would decrease because Trout flew out.

His DRC+ would (likely) decrease because BBEs with 90+ mph EV and 45+ LA don’t often lead to much of anything (.020 wOBA).

His pDRC+ would increase because Trout hit it hard (and it was not a soft grounder or pop up).

One last example…

Tim Locastro hits a soft grounder that goes for a walk-off single. It left his bat at 78.5 mph.

Locastro’s wRC+ and OPS+ both go up because he singled.

His DRC+ may or may not have gone up (depends on how Baseball Prospectus assigned probabilities for this BBE).

His pDRC+ would see the biggest drop, as it was a softly hit grounder.

A difference between DRC+ and pDRC+ is that pDRC+ drops for the grounder more than DRC+ would (.118 wOBA for grounders w/ EV of less than 90 mph). In other words, DRC+ (presumably) drops more for a hard-hit fly ball (like the one Trout hit) than it does for a soft grounder (like Locastro’s).

pDRC+ realizes the ball was hit harder on that fly ball and even though it had a lower chance of being a hit, Trout just missed a XBH or home run (too high of LA). Trout knew it too. After all, one can’t hit a homer on a ground ball.

pDRC+ rewards batters for hitting the ball hard even when the chance of a hit is small. That hitter might see better results in the future (maybe that player needs to alter his launch angle some). Hitting the ball hard is always a valuable skill.

Here are two example of pDRC+ at work…

In 2018, Jesus Aguilar posted a 135 DRC+, which is very good. His pDRC+ was a much less impressive 109. While Aguilar performed well in the batted ball components of pDRC+, he’s very slow (1.61 StDev below average for sprint speed), he strikes out often (0.62 StDev above average), and the frequency in which he walks does not really make up for his tendency to whiff (0.51 StDev above average). In 2019, Aguilar recorded a 97 DRC+.

In 2018, DJ LeMahieu had a 98 DRC+. His pDRC+ was much better (117). LeMahieu hit the ball super hard (1.34 StDev above average for exit velocity). He rarely hit soft grounders and pop ups (1.73 StDev below average) and didn’t strike out much (1.30 StDev below average). His DRC+ in 2019 was 128.

Highest pDRC+ single-seasons since 2015 (min. 300 PA):

  1. 2019 Mike Trout (169)
  2. 2018 Mike Trout (167)
  3. 2018 Mookie Betts (165)
  4. 2017 Aaron Judge (163)
  5. 2015 Mike Trout (158)
  6. 2016 Mike Trout (155)
  7. 2015 Bryce Harper (154)
  8. 2019 Christian Yelich (154)
  9. 2016 David Ortiz (154)
  10. 2017 Mike Trout (153)

Highest pDRC+ last season:

  1. Mike Trout (169)
  2. Christian Yelich (154)
  3. Cody Bellinger (151)
  4. Anthony Rendon (148)
  5. Mookie Betts (145)
  6. Aaron Judge (145)
  7. Nelson Cruz (139)
  8. Yordan Alvarez (138)
  9. Carlos Santana (137)
  10. Juan Soto (137)

Guys one might expect to hit at a higher level next season given a significant positive difference between pDRC+ and DRC+:

Name / pDRC+ / DRC+

  • Dansby Swanson / 121 / 97
  • C.J. Cron / 124 / 101
  • JaCoby Jones / 110 / 87
  • Ian Desmond / 107 / 86
  • Ryan O’Hearn / 101 / 80
  • Robinson Cano / 109 / 89
  • Lorenzo Cain / 106 / 87
  • Kevin Kiermaier / 93 / 75
  • Jackie Bradley Jr. / 104 / 86
  • Marcell Ozuna / 129 / 111

Guys one might expect to hit at a lower level next season given a significant negative difference between pDRC+ and DRC+:

  • Omar Narvaez / 91 / 123
  • Eric Sogard / 88 / 112
  • Mitch Garver / 126 / 149
  • Alex Bregman / 135 / 157
  • Tim Anderson / 92 / 113
  • Eugenio Suarez / 111 / 132
  • Pete Alonso / 121 / 141
  • Nolan Arenado / 117 / 136
  • Charlie Blackmon / 112 / 129
  • Gleyber Torres / 108 / 124

One way pDRC+ is better than the other metrics is that is the most reliable of the three.

The r-squared for pDRC+ YoY is significantly higher than it is for DRC+ and wRC+ (and OPS+). I didn’t feel it was necessary to the graph for OPS+ because OPS+ and wRC+ are so close to each other in terms of the final outcome.

To review, understand and remember these things:

  • If you see a big difference between a player’s DRC+ and pDRC+, it does not necessarily mean that player was the victim of bad luck; what it does mean is that you’d expect his future DRC+ to be closer to his pDRC+ than his DRC+
  • If you see a big difference between a player’s pDRC+ and wRC+/OPS, it possible that the player was lucky/unlucky; with that being said, DRC+ is the best stat to see if a player got lucky/unlucky
  • pDRC+ is more predictive of future offensive performance (wRC+, OPS+, and DRC+) than wRC+, OPS+, and DRC+
  • pDRC+ is by no means perfect; I’m sure there are ways it can be enhanced; each player is unique, and it’s virtually impossible to capture all aspects of a player’s game objectively; some players completely overhaul their swing from one year to the next
  • The league average for pDRC+ is 98.5 right now; that values comes from a regression, meaning it eliminates as much variability (squares) as it can; if I changed it, it would create more error. In an ideal world, it would be 100, but the ultimate goal is to predict DRC+ in season n+1 to the maximum precision

Hope you enjoyed reading. If you’d like a copy of the spreadsheet, DM me on Twitter (@MaxSportsStudio).

Note: all data comes from three great baseball websites: Baseball Savant, FanGraphs, and Baseball-Reference

10 thoughts on “An introduction to pDRC+

  • Henry Bogardus

    Cool stuff. You ever think about trying to give a bigger importance to speed based on how often the batter puts the ball in play (i.e. less home runs, strikeouts, walks means higher speed coefficient)?

    • Max Goldstein

      This is an interesting point, Henry. I’ve thought about it, and it’s definitely something I will consider if/when I look to improve upon the current version of pDRC+. FWIW, r-squared between 2019 difference in pDRC+ and DRC+ and % of plate appearances ending in a ball put in player is roughly zero.

  • Would you be willing to share a database of results

    • Max Goldstein

      Sure! I could send it to you through Twitter (@MaxSportsStudio) or your email. Whatever you’d prefer works for me.

  • This is a great write up – super informative for someone who has never worked with large datasets before to get an idea of your process.

    If you’ll forgive a question from a data-novice, can you explain how you verified the predictive value of pDRC+? I imagine that after using the initial dataset to establish weights, etc, you would have checked your results against a second, blind dataset or something similar?

    If this was addressed in the article, I apologize for overlooking that.

    Thanks!

    • Max Goldstein

      I did not think to verify the predictive value of pDRC+, which could prove to be problematic. I am new to this process as well, having just completed AP Stats last month. It would be nice if I had a larger sample to work with, but Statcast wasn’t instituted until 2015. While it is possible that my model is overfitted to the data in my sample, the large amount of consecutive player-seasons I was working with exceeded 800.

  • About your numbers (Ie coefficients) you multiplied each variable by. What was your sample size to get to those through trial and error? And did you break that sample into a training set and a validation set? If so how large was each one (ie what was the % breakdown relative to sample size)…

    • Max Goldstein

      My sample size was 824 consecutive player-seasons (had to have 300+ PAs in both seasons). I didn’t break that sample into a training set and validation set. Right now, I am working on a pitch metric, and that is something I will look to do.

      • Did you use different seasons for the same player in your sample? For example, Mike Trout 2019-2020 and Mike Trout 2018-2019 as two different data points?

        • Max Goldstein

          Yes, that is correct.

Comments are closed.