Comparing the Three WAR Measures
Many now treat WAR as a definite measure of success of a player. However, it is important to realize, like with any other statistic developed in the Sabermetric Revolution, it is not perfect. As BaseballReference explains in its glossary entry on WAR:
There is no one way to determine WAR. There are hundreds of steps to make this calculation, and dozens of places where reasonable people can disagree on the best way to implement a particular part of the framework. We have taken the utmost care and study at each step in the process, and believe all of our choices are well reasoned and defensible. But WAR is necessarily an approximation and will never be as precise or accurate as one would like.
We present the WAR values with decimal places because this relates the WAR value back to the runs contributed (as one win is about ten runs), but you should not take any full season difference between two players of less than one to two wins to be definitive (especially when the defensive metrics are included).[1]
With top Sabermetric websites like BaseballReference, Fangraphs, and Baseball Prospectus all having different calculations of measuring the wins above a replacement player, we do not know the ordinal rankings of players based on any one version of WAR. Additionally, even if we are 99% sure that one player is better than another, we are much less sure of the difference in production in terms of WAR between the two players from one measure alone.
In order to more accurately learn from rWAR (BaseballReference version), fWAR (Fangraphs version), and WARP (Baseball Prospectus version), we need to see the differences in characteristics between the measures. Note that I am not trying to find which measure is best of the three; to do this would probably be quite subjective, especially considering that the actual WAR formulas are black boxes to me. I am merely looking at trends in the data, specifically hitter data in this article. I plan to look at this further with pitchers in a future article.
Are there relationships between the metrics?
One of the most basic questions to consider is if one measure of WAR can imply something about another measure. To explore this question, I took all of the hitters who finished in the top 35 for the 2013 season in any one of the three statistics, and compared their results:
The light blue line represents the 45˚ line, and the black line represents the linear line of best fit when comparing rWAR and WARP. The y = mx + b form of the black line and the R^2 value for the equation (indicator of how well the data points map to the line of best fit, where an R^2 of 1 means that every data point lies on the line of best fit) are written at the top. If a player’s data point lies on the 45˚ line, this means the rWAR and WARP of the specific player were equal this year. If it lies above the 45˚ line, then the player’s WARP was greater than his rWAR, and if it lies below the 45˚ line, the player’s WARP was less than his rWAR.
There are a few key insights to notice. First, rWAR and WARP are not the same, as there are observations that do not lie on the 45˚ line. Second, we cannot derive either of the two statistics accurately from the other (since the R^2 value is small). Third, neither measure is consistently greater or less than the other (since there are some observations that lie above the 45˚ line and some that lie below the 45˚ line). This implies that rWAR and WARP use similar replacement level thresholds, but their actual calculations of deriving wins differs.
I have included the other two graphs of this nature below, the same insights apply for them.
How differently distributed are the metrics?
Since WAR measures the value of a player over a theoretical replacement level player, it is important to see how many players outperform the replacement level one. We gathered from the previous section that the three statistics incorporate similar replacement level players. So, if this is true, we should expect the measures to have roughly the same amount of people at each win level.
To test this, I looked at the distribution of the top 200, top 100, top 50, and top 25 players according to each metric. To not clutter the story with graphs, I only included the histograms of the top 200 and top 25 in this article:
(*if chart is too small to read, click to enlarge)
For the top 200 players, there is little difference in the distribution between the three statistics. There are slightly more players in the top 200 that have fWAR in the 45 bin than there are in the rWAR and WARP 45 bins, but this is minimal given that we know these measures are not the same.
When we blow up the image of the top 200 players into the top 25 players, however, there is a clearer difference in distribution between the 3 statistics.
(*if chart is too small to read, click to enlarge)
WARP is the most bottom heavy of the bunch, having 5 players in the 5.095.68 bin while the other measures have none in their equivalent bins, and consistently has a lesser portion of the data in higher bins. This gives us some context into Mike Trout’s season; even though he performed similarly according to all three metrics, WARP actually gives him the most credit, since the difference between his WAR and any other top player’s WAR is greatest when comparing WARPs.
Additionally, a more theoretical idea we can take away from these data is that all three WAR approximations approach similar limits (if not the same limit) for their distribution functions when taking a large enough sample of the population. However, as we magnify the data enough, we can see differences in the distributions of the three WAR approximations. This means that fWAR, rWAR, and WARP all have similar processes for determining player value, but not the same.
What types of players seem to differ the most between the statistics? What kinds of players’ WAR values seem to be the most uniform?
Now that we see that we cannot derive the value of one statistic from another very accurately, and all three distribution functions approach similar limits, where does the breakdown occur between the three? Can we characterize the differences in results to valuing specific kinds of skill sets differently? To explore this question, I once again used players who finished in the top 35 in 2013 in any one of the 3 WAR measures in my sample, but I also used players who finished in the middle 30 for any of the statistics, and players who finished in the bottom 35, to characterize the entire population of MLB players better. I only used the middle 30 because there was much less overlap in the middle between the three approximations than there were in the top and bottom tiers, so there was already plenty of a characterization of the middle portion of the population.
After gathering all of this data, I calculated the standard deviations between the three metrics for every player. The players who have the lowest standard deviations are the ones that have the most uniform WAR measures, while the players who have the highest standard deviations are the ones who have WAR measures that fluctuate the most. The table below shows the players whose standard deviations are more than 1 deviation away from the average deviation for a player in the sample, where the italicized group contains the least variant players in the sample, and the bold group contains the most variant.
Name 
Team 
fWAR 
rWAR 
WARP 
AVG WAR 
STDEVA 
Phillies 
0.7 
0.7 
0.63 
0.677 
0.040 

Angels 
0.7 
0.7 
0.78 
0.727 
0.046 

Brewers 
1.1 
1 
1.06 
1.053 
0.050 

Giants 
1.3 
1.4 
1.41 
1.370 
0.061 

Reds 
0.8 
0.9 
0.77 
0.823 
0.068 

Royals 
0.9 
0.8 
0.95 
0.883 
0.076 

Royals 
1.4 
1.5 
1.58 
1.493 
0.090 

Rays 
1.4 
1.4 
1.57 
1.457 
0.098 

Mets 
6 
5.8 
5.87 
5.890 
0.101 

Astros 
1.1 
1.3 
1.16 
1.187 
0.103 

Pirates 
1.1 
1.2 
0.99 
1.097 
0.105 

Yankees 
1 
0.8 
0.8 
0.867 
0.115 

Mariners 
0.7 
0.9 
0.66 
0.753 
0.129 

Nationals 
4.6 
4.8 
4.84 
4.747 
0.129 

Nationals 
0.4 
0.4 
0.65 
0.483 
0.144 

Rays 
1.2 
1.3 
1.5 
1.333 
0.153 

Giants 
1.3 
1 
1.23 
1.177 
0.157 

Athletics 
1.2 
1.1 
1.42 
1.240 
0.164 

Nationals 
0.7 
0.9 
0.56 
0.720 
0.171 

Marlins 
1.4 
1.4 
1.7 
1.500 
0.173 

Phillies 
1 
0.8 
1.15 
0.983 
0.176 

Marlins 
1.9 
2.1 
2.26 
2.087 
0.180 

Phillies 
1.4 
1.7 
1.74 
1.613 
0.186 

Nationals 
1.2 
0.9 
0.85 
0.983 
0.189 

AVERAGE 
0.3 
0.3 
0.36 
0.331 
0.1211 

Nationals 
1.5 
0 
1.47 
0.990 
0.857 

Twins 
0.2 
0.6 
1.12 
0.240 
0.861 

White Sox 
1.5 
2 
0.27 
1.257 
0.890 

Cardinals 
0.3 
0.1 
1.42 
0.407 
0.900 

Royals 
2.6 
3.2 
1.41 
2.403 
0.911 

Yankees 
6 
7.6 
6.04 
6.547 
0.912 

Royals 
1.1 
0.3 
0.74 
0.220 
0.923 

Mets/Pirates 
4.1 
5 
3.12 
4.073 
0.940 

Braves 
0.5 
1.3 
0.88 
0.560 
0.942 

Athletics 
7.7 
8 
6.19 
7.297 
0.970 

Red Sox 
5.4 
6.5 
4.53 
5.477 
0.987 

Cubs 
0.4 
0.5 
1.72 
0.607 
1.064 

Braves 
4.7 
6.8 
5.41 
5.637 
1.068 

Brewers 
3.4 
3.9 
5.55 
4.283 
1.125 

Reds 
5.2 
4.2 
6.45 
5.283 
1.127 

Rays 
5.4 
5.1 
3.31 
4.603 
1.130 

Diamondbacks 
4.6 
6.1 
3.73 
4.810 
1.199 

Brewers 
7.6 
8.4 
6.04 
7.347 
1.200 

Diamondbacks 
3.6 
3.5 
1.47 
2.857 
1.202 

Orioles 
3.4 
3.7 
1.47 
2.857 
1.210 

Pirates 
8.2 
8.2 
6.03 
7.477 
1.253 

Pirates 
4.6 
5.4 
2.85 
4.283 
1.304 

Rangers 
2.5 
4.9 
5.27 
4.223 
1.504 

AVERAGE 
3.5 
3.8 
2.91 
3.409 
1.064 
There are some trends from this subset that would likely carry over to the entire population. The first is that increased playing time causes an increase in the variation between WAR measurements. Intuitively, this may seem obvious, since any difference in two WAR formulas should only be accentuated by larger amounts of data for a player. However, this means something important: for an individual player, the three different WAR measures do not approach the same limit as his playing time increases, since the results become more varied.
The second, and probably more interesting trend in these groupings, is the positional similarities in each group. In the bold group, most of the players are middle infielders and very strong defensive outfielders, while in the italicized group, most of the players are catchers, corner infielders, and batfirst outfielders. Note that this does not necessarily mean that the three different WAR approximations drastically differ in valuing specific positions. More likely, since different positions are expected to have different skill sets, positional groupings generally show which areas seem to vary the most across the three formulas for WAR. Middle infielders tend to be faster players that have good gloves, and not as much power. Perhaps the three WAR measures differ most in these typical strengths of middle infielders. This would explain why a middle infielder like Kelly Johnson appears in the italicized group, since Kelly Johnson is atypical as a middle infielder (above average power, and below average speed and glove).
This is just a starting point for comparing these three WAR measures. As WAR is used to analyze players so often now, this is a deep topic that can definitely (and should be!) expanded. The results from this article are important and a good starting point to answer this large question, and hopefully this inspires you to not just take a WAR measure at face value, but consider the deeper meaning behind a WAR value of a player.
[1] http://www.baseballreference.com/about/war_explained.shtml
Data courtesy of Baseball Prospectus, Fangraphs, and BaseballReference
Featured Image courtesy of http://www.zimbio.com
5 Responses to “Comparing the Three WAR Measures”
You have the rWAR values under the WARP heading and visaversa.
Sorry for the mistake. It is now fixed. Thanks for catching it.
Nice article. My view of the list of the “high standard deviation players” is that they tend to be players that derive a large proportion of their value from defense. This theory could be explained by saying that the three systems vary most greatly in how they evaluate a player’s defensive ability.
Thanks, George. That is definitely a good point, though that doesn’t hold for all positions. For instance, Manny Machado had a standard deviation of 0.247, just missing the least variant group. It is also noteworthy that catchers missed the “high standard deviation” players altogether, while being prominent in the low standard deviations. So, to expand on your idea, this data would lead me to believe that the measures for specific positions differ greatly. I do not know if the different approximations’ defensive metrics measure specific positions drastically differently, or if it’s a similar formula for all positions that just so happens to be most variant when balls are hit up the middle.
[…] rWAR, and Baseball Prospectus’s WARP) for hitters. You can find the article here, though I’ll summarize the major […]