Comparing the Three WAR Measures

Many now treat WAR as a definite measure of success of a player. However, it is important to realize, like with any other statistic developed in the Sabermetric Revolution, it is not perfect. As Baseball-Reference explains in its glossary entry on WAR:

There is no one way to determine WAR. There are hundreds of steps to make this calculation, and dozens of places where reasonable people can disagree on the best way to implement a particular part of the framework. We have taken the utmost care and study at each step in the process, and believe all of our choices are well reasoned and defensible. But WAR is necessarily an approximation and will never be as precise or accurate as one would like.

We present the WAR values with decimal places because this relates the WAR value back to the runs contributed (as one win is about ten runs), but you should not take any full season difference between two players of less than one to two wins to be definitive (especially when the defensive metrics are included).[1]

With top Sabermetric websites like Baseball-Reference, Fangraphs, and Baseball Prospectus all having different calculations of measuring the wins above a replacement player, we do not know the ordinal rankings of players based on any one version of WAR. Additionally, even if we are 99% sure that one player is better than another, we are much less sure of the difference in production in terms of WAR between the two players from one measure alone.

In order to more accurately learn from rWAR (Baseball-Reference version), fWAR (Fangraphs version), and WARP (Baseball Prospectus version), we need to see the differences in characteristics between the measures. Note that I am not trying to find which measure is best of the three; to do this would probably be quite subjective, especially considering that the actual WAR formulas are black boxes to me. I am merely looking at trends in the data, specifically hitter data in this article. I plan to look at this further with pitchers in a future article.

Are there relationships between the metrics?

One of the most basic questions to consider is if one measure of WAR can imply something about another measure. To explore this question, I took all of the hitters who finished in the top 35 for the 2013 season in any one of the three statistics, and compared their results:

rWAR vs. WARP

The light blue line represents the 45˚ line, and the black line represents the linear line of best fit when comparing rWAR and WARP. The y = mx + b form of the black line and the R^2 value for the equation (indicator of how well the data points map to the line of best fit, where an R^2 of 1 means that every data point lies on the line of best fit) are written at the top. If a player’s data point lies on the 45˚ line, this means the rWAR and WARP of the specific player were equal this year. If it lies above the 45˚ line, then the player’s WARP was greater than his rWAR, and if it lies below the 45˚ line, the player’s WARP was less than his rWAR.

There are a few key insights to notice. First, rWAR and WARP are not the same, as there are observations that do not lie on the 45˚ line. Second, we cannot derive either of the two statistics accurately from the other (since the R^2 value is small). Third, neither measure is consistently greater or less than the other (since there are some observations that lie above the 45˚ line and some that lie below the 45˚ line). This implies that rWAR and WARP use similar replacement level thresholds, but their actual calculations of deriving wins differs.

I have included the other two graphs of this nature below, the same insights apply for them.

fWAR vs. rWAR

fWAR vs. WARP

How differently distributed are the metrics?

Since WAR measures the value of a player over a theoretical replacement level player, it is important to see how many players outperform the replacement level one. We gathered from the previous section that the three statistics incorporate similar replacement level players. So, if this is true, we should expect the measures to have roughly the same amount of people at each win level.

To test this, I looked at the distribution of the top 200, top 100, top 50, and top 25 players according to each metric. To not clutter the story with graphs, I only included the histograms of the top 200 and top 25 in this article:

(*if chart is too small to read, click to enlarge)

top 200

For the top 200 players, there is little difference in the distribution between the three statistics. There are slightly more players in the top 200 that have fWAR in the 4-5 bin than there are in the rWAR and WARP 4-5 bins, but this is minimal given that we know these measures are not the same.

When we blow up the image of the top 200 players into the top 25 players, however, there is a clearer difference in distribution between the 3 statistics.

(*if chart is too small to read, click to enlarge)

top 25

WARP is the most bottom heavy of the bunch, having 5 players in the 5.09-5.68 bin while the other measures have none in their equivalent bins, and consistently has a lesser portion of the data in higher bins. This gives us some context into Mike Trout’s season; even though he performed similarly according to all three metrics, WARP actually gives him the most credit, since the difference between his WAR and any other top player’s WAR is greatest when comparing WARPs.

Additionally, a more theoretical idea we can take away from these data is that all three WAR approximations approach similar limits (if not the same limit) for their distribution functions when taking a large enough sample of the population. However, as we magnify the data enough, we can see differences in the distributions of the three WAR approximations. This means that fWAR, rWAR, and WARP all have similar processes for determining player value, but not the same.

What types of players seem to differ the most between the statistics? What kinds of players’ WAR values seem to be the most uniform?

Now that we see that we cannot derive the value of one statistic from another very accurately, and all three distribution functions approach similar limits, where does the breakdown occur between the three? Can we characterize the differences in results to valuing specific kinds of skill sets differently? To explore this question, I once again used players who finished in the top 35 in 2013 in any one of the 3 WAR measures in my sample, but I also used players who finished in the middle 30 for any of the statistics, and players who finished in the bottom 35, to characterize the entire population of MLB players better. I only used the middle 30 because there was much less overlap in the middle between the three approximations than there were in the top and bottom tiers, so there was already plenty of a characterization of the middle portion of the population.

After gathering all of this data, I calculated the standard deviations between the three metrics for every player. The players who have the lowest standard deviations are the ones that have the most uniform WAR measures, while the players who have the highest standard deviations are the ones who have WAR measures that fluctuate the most. The table below shows the players whose standard deviations are more than 1 deviation away from the average deviation for a player in the sample, where the italicized group contains the least variant players in the sample, and the bold group contains the most variant.

Name

Team

fWAR

rWAR

WARP

AVG WAR

STDEVA

Laynce Nix

Phillies

-0.7

-0.7

-0.63

-0.677

0.040

Chris Nelson

Angels

-0.7

-0.7

-0.78

-0.727

0.046

Alex Gonzalez

Brewers

-1.1

-1

-1.06

-1.053

0.050

Jeff Francoeur

Giants

-1.3

-1.4

-1.41

-1.370

0.061

Ryan Ludwick

Reds

-0.8

-0.9

-0.77

-0.823

0.068

Jamey Carroll

Royals

-0.9

-0.8

-0.95

-0.883

0.076

Billy Butler

Royals

1.4

1.5

1.58

1.493

0.090

Jose Lobaton

Rays

1.4

1.4

1.57

1.457

0.098

David Wright

Mets

6

5.8

5.87

5.890

0.101

J.D. Martinez

Astros

-1.1

-1.3

-1.16

-1.187

0.103

Jose Tabata

Pirates

1.1

1.2

0.99

1.097

0.105

Brent Lillibridge

Yankees

-1

-0.8

-0.8

-0.867

0.115

Carlos Triunfel

Mariners

-0.7

-0.9

-0.66

-0.753

0.129

Jayson Werth

Nationals

4.6

4.8

4.84

4.747

0.129

Jhonatan Solano

Nationals

-0.4

-0.4

-0.65

-0.483

0.144

Kelly Johnson

Rays

1.2

1.3

1.5

1.333

0.153

Angel Pagan

Giants

1.3

1

1.23

1.177

0.157

John Jaso

Athletics

1.2

1.1

1.42

1.240

0.164

Scott Hairston

Nationals

-0.7

-0.9

-0.56

-0.720

0.171

Christian Yelich

Marlins

1.4

1.4

1.7

1.500

0.173

Casper Wells

Phillies

-1

-0.8

-1.15

-0.983

0.176

Adeiny Hechavarria

Marlins

-1.9

-2.1

-2.26

-2.087

0.180

Carlos Ruiz

Phillies

1.4

1.7

1.74

1.613

0.186

Tyler Moore

Nationals

-1.2

-0.9

-0.85

-0.983

0.189

AVERAGE

0.3

0.3

0.36

0.331

0.1211

Anthony Rendon

Nationals

1.5

0

1.47

0.990

0.857

Chris Parmelee

Twins

-0.2

0.6

-1.12

-0.240

0.861

Jeff Keppinger

White Sox

-1.5

-2

-0.27

-1.257

0.890

Daniel Descalso

Cardinals

-0.3

0.1

1.42

0.407

0.900

Lorenzo Cain

Royals

2.6

3.2

1.41

2.403

0.911

Robinson Cano

Yankees

6

7.6

6.04

6.547

0.912

Alcides Escobar

Royals

1.1

0.3

-0.74

0.220

0.923

Marlon Byrd

Mets/Pirates

4.1

5

3.12

4.073

0.940

Dan Uggla

Braves

0.5

-1.3

-0.88

-0.560

0.942

Josh Donaldson

Athletics

7.7

8

6.19

7.297

0.970

Dustin Pedroia

Red Sox

5.4

6.5

4.53

5.477

0.987

Darwin Barney

Cubs

0.4

-0.5

-1.72

-0.607

1.064

Andrelton Simmons

Braves

4.7

6.8

5.41

5.637

1.068

Jean Segura

Brewers

3.4

3.9

5.55

4.283

1.125

Shin-Soo Choo

Reds

5.2

4.2

6.45

5.283

1.127

Ben Zobrist

Rays

5.4

5.1

3.31

4.603

1.130

Gerardo Parra

Diamondbacks

4.6

6.1

3.73

4.810

1.199

Carlos Gomez

Brewers

7.6

8.4

6.04

7.347

1.200

A.J. Pollock

Diamondbacks

3.6

3.5

1.47

2.857

1.202

J.J. Hardy

Orioles

3.4

3.7

1.47

2.857

1.210

Andrew McCutchen

Pirates

8.2

8.2

6.03

7.477

1.253

Starling Marte

Pirates

4.6

5.4

2.85

4.283

1.304

Ian Kinsler

Rangers

2.5

4.9

5.27

4.223

1.504

AVERAGE

3.5

3.8

2.91

3.409

1.064

There are some trends from this subset that would likely carry over to the entire population. The first is that increased playing time causes an increase in the variation between WAR measurements. Intuitively, this may seem obvious, since any difference in two WAR formulas should only be accentuated by larger amounts of data for a player. However, this means something important: for an individual player, the three different WAR measures do not approach the same limit as his playing time increases, since the results become more varied.

The second, and probably more interesting trend in these groupings, is the positional similarities in each group. In the bold group, most of the players are middle infielders and very strong defensive outfielders, while in the italicized group, most of the players are catchers, corner infielders, and bat-first outfielders. Note that this does not necessarily mean that the three different WAR approximations drastically differ in valuing specific positions. More likely, since different positions are expected to have different skill sets, positional groupings generally show which areas seem to vary the most across the three formulas for WAR. Middle infielders tend to be faster players that have good gloves, and not as much power. Perhaps the three WAR measures differ most in these typical strengths of middle infielders. This would explain why a middle infielder like Kelly Johnson appears in the italicized group, since Kelly Johnson is atypical as a middle infielder (above average power, and below average speed and glove).

This is just a starting point for comparing these three WAR measures. As WAR is used to analyze players so often now, this is a deep topic that can definitely (and should be!) expanded. The results from this article are important and a good starting point to answer this large question, and hopefully this inspires you to not just take a WAR measure at face value, but consider the deeper meaning behind a WAR value of a player.


[1] http://www.baseball-reference.com/about/war_explained.shtml

Data courtesy of Baseball Prospectus, Fangraphs, and Baseball-Reference

Featured Image courtesy of http://www.zimbio.com

5 Responses to “Comparing the Three WAR Measures”

  1. Morris Greenberg

    Sorry for the mistake. It is now fixed. Thanks for catching it.

    Reply
  2. George

    Nice article. My view of the list of the “high standard deviation players” is that they tend to be players that derive a large proportion of their value from defense. This theory could be explained by saying that the three systems vary most greatly in how they evaluate a player’s defensive ability.

    Reply
  3. Morris Greenberg

    Thanks, George. That is definitely a good point, though that doesn’t hold for all positions. For instance, Manny Machado had a standard deviation of 0.247, just missing the least variant group. It is also noteworthy that catchers missed the “high standard deviation” players altogether, while being prominent in the low standard deviations. So, to expand on your idea, this data would lead me to believe that the measures for specific positions differ greatly. I do not know if the different approximations’ defensive metrics measure specific positions drastically differently, or if it’s a similar formula for all positions that just so happens to be most variant when balls are hit up the middle.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS

%d bloggers like this: