# Comparing the Three WAR Measures, Part II

In my debut article last week for Batting Leadoff, I started to analyze trends in the three major WAR sources (Fangraphs’ fWAR, Baseball-Reference’s rWAR, and Baseball Prospectus’s WARP) for hitters. You can find the article here, though I’ll summarize the major conclusions:

- We cannot predict any one of the WAR measures for an individual hitter very well from purely knowing an alternative WAR statistic. Additionally, the three measures do not have obvious relationships between each other (such as one consistently awarding players with higher WAR values than the others). This likely means that all three approximations have similar replacement level thresholds, and differ more in the actual assigning of wins for hitter production.
- While all three measures’ distribution functions for hitters approach similar limits, when we consider smaller samples of the population, the differences in the statistics becomes clearer.
- Increased playing time leads to a greater variance between the measures. This means that the formulas for the three different hitter WAR approximations differ enough that they do not approach the same limit for an individual player as his playing time increases.
- Players who play up the middle (second baseman, shortstops, and centerfielders), tend to have much more variant outcomes between the three WARs than other positions do. This is likely a result of those positions involving similar expected skill sets.

So, now I would like to consider pitcher WAR through a similar lens. Since in the last article, we found that increased playing time increases the variance between the three WAR statistics, I wanted to avoid combining starters and relievers in this analysis, since their inning totals are drastically different. This article will focus on starting pitchers; I will likely explore relievers in a future article if the results are different enough from starting pitcher trends.

For the purposes of this article, I labeled a player as a starting pitcher if he was the starting pitcher in at least 30% of his appearances on the 2013 season. One thing to note here is that since I made this reliever exclusion, I am working with much less data than I did in the last article, because every team has roughly 13 hitters on their team at any one time, but only 5 starting pitchers.

**Are there relationships between the metrics?**

Similarly to the last article, I was interested in seeing whether we could predict anything about an individual’s value for a metric given that we know one of his other WAR measures. This particular question is likely more interesting for pitchers because the metrics treat pitchers very differently, largely due to how the different WAR approximations treat non-DIPs stats. These differences in valuing defensively independent pitching statistics are likely to cause more noticeable relationships between the WAR approximations than hitter WAR differences do.

Similar to the last article, I started with players who finished in the top 30 in any of the WAR measures (I changed the sample size from 35 in the last article to 30 now since there are fewer starting pitchers). However, I decided to do a more thorough analysis of relationships between the metrics, by extending my data set to include pitchers who finished in the bottom 30 in any of the metrics, and players who finished in the middle 25. This sample better characterizes the population, since it includes the good, the bad, and the average pitchers of 2013.

You may notice I included a quadratic line of best fit instead of a linear one (like I did in the last article). The reasoning for this is reflected by the 45˚ line; for a top pitcher, his rWAR is almost always greater than his WARP, whereas for a replacement/sub-replacement level pitcher, his WARP usually exceeds his rWAR. Thus, there actually is a relationship between rWAR and WARP. Good players tend to have higher rWARs than WARPs whereas worse pitchers tend to have higher WARPs than rWARs. My reasoning behind this outcome is that rWAR uses larger scalars than WARP uses for specific factors, making good results accentuated positively by rWAR when compared to WARP, and making bad results accentuated negatively.

Now that we see different pitcher WAR approximations can have relationships, how do the other WAR metrics interact?

For WARP and fWAR, a pitcher’s WARP is rarely less than his fWAR. Additionally, the quadratic line of best fit for WARP vs. fWAR for pitchers is a better predictor than any line of best fit for the other graphs we have looked at, with an R^{2} value of 0.9191. Thus, for a pitcher like **Gio Gonzalez**, who had an fWAR of 3.1 and a WARP of 3.32, while the difference in his value seems virtually the same to the naked eye, he actually has unusual results for the two measures given the relationship between them. If we plug in his 3.32 WARP value into the quadratic line of best fit equation (written on the top right corner of the graph), he would be expected to have a 4.5 fWAR on the season, good for 14^{th} in the league, instead of 43^{rd} like he actually was. Since there is a consistent, predictable difference between the two approximations, it is likely that their replacement level thresholds are different, roughly 0.3718 to be exact if we accept the quadratic line of best fit as a reasonable estimate for the relationship between the two statistics.

Lastly, for fWAR and rWAR, the results seem to be similar to the first article; there is not as clear of a relationship between the two metrics as there were in the two above examples:

**How differently distributed are the metrics?**

Once again, I would like to consider the differences in how often different WAR levels can be obtained across the three measures. Should the results be different from before since we can actually see a clearer relationship between some of the WAR approximations? To test this, I used a similar approach, though I used fewer observations in this sample due to there being less starting pitchers in the league:

(*If chart is too small, click to enlarge)

Consistent with the previous section, the top 150 and top 18 WARP pitchers have lower values than both the top 150 and top 18 fWAR and rWAR pitchers have. This means that the three distribution functions do not approach a similar limit, since WARP is lower, for both the top 150 and top 18. This differs greatly from our findings on hitter WAR.

**What types of players seem to differ the most between the statistics? What kinds of players’ WAR values seem to be the most uniform?**

** **To find the pitchers who vary the most and least, I used the same criteria as I did in the previous article (include any pitcher whose standard deviation is more than one deviation away from the average deviation).

It is apparent that pitchers who log more innings have more variant WAR approximations, similarly to how hitters who play more have more variant WAR. However, I wanted to look for more than this, as I found for hitters with similar positions.

I approached finding the characteristics of the two different groups using relative risk. The basic idea behind relative risk is to try to identify how much likelier a person with a specific trait is to have a certain outcome to a situation than a random person in the population who does not possess the trait.

Initially, when I compared all of the players from the variant population to the entire population, I did not have much success using relative risk. I did not find any statistics in which the players with inconsistent WAR measures behaved much differently than the rest of the population. However, I noticed that players from my sample who were taken in the same initial WAR grouping (the good, the bad, or the average ones) tended to behave similarly. So, I further subdivided the variant group into 3 groups, and instead of comparing each of the 3 groups to the rest of the population, I compared them to the other players with similar WAR performances, but lower variance. This resulted in more fruitful findings:

In terms of BABIP and LOB%, both the top performing variant players and the bottom performing variant players have a much greater percentage of their observations in a specific quartile than does the rest of that sample of the population.

However, the top players with variant WAR have characteristics that are exactly the opposite of the bottom variant players’ habits. This may seem weird at first, but, actually makes a great deal of sense. BABIP and LOB% can be considered measures that combine skill and luck. On the one hand, better pitchers could be better in BABIP and LOB% as a result of causing hitters to generate weaker contact. On the other hand, BABIP and LOB% rely on many uncontrollable factors from the pitcher’s perspective.

Players who have low BABIPs (in quartile 1) and high LOB% (in quartile 4) for a specific season tend to be players who had luckier seasons. Thus, variant players in the top group often are in quartile 1 in BABIP and quartile 4 in LOB% because the different WAR measures try to adjust for these luck factors differently. Note that this does not mean that top pitchers with high deviations for their 2013 WARs were “lucky”. It just means that they pitched in a way where the three WAR measures disagree on how to separate out the luck and skill involved. If one WAR measure credits BABIP and LOB% as 100% skill and 0% luck, while another credits them as 50% skill and 50% luck, pitchers who have more extreme values for the year in BABIP and LOB% should have more variant WAR statistics.

On the opposite side of the spectrum, pitchers who have high BABIPs (in quartile 4) and low LOB% (in quartile 1) tend to be players who had unlucky seasons. Just like for the pitchers who behave in a way a “lucky” pitcher would, the WAR approximations likely differ on how to credit the value of a specific pitcher in the context of plays where there is a greater involvement of other players on the team.

So, why do bottom pitchers who have variant measures tend to behave like unlucky pitchers and top pitchers who have variant measures tend to behave like lucky players? Why can’t it be the other way around? The simple answer is that top performing pitchers do not get very unlucky, and bottom performing pitchers do not get very lucky. The reason why the characteristics of variant players are so dramatically different between top and bottom players is a result of their different WAR levels, not the fact that their WAR measures vary.

This all leads to a striking (pun-intended) conclusion: we still do not know how to correctly factor in defensively dependent outcomes for WAR. I am not saying that this conclusion means we should abandon the measure completely. Instead, I am saying that we can do more than strictly take a specific WAR value of a player and associate that value as a definite measure of his performance. Take, for example, **Mat Latos** and **Hisashi Iwakuma**. If we just look at their fWARs, Latos would appear to be the better pitcher, with a 4.4 fWAR to Iwakuma’s 4.2. However, Iwakuma had a ridiculous .252 BABIP and 81.90% LOB%, while Latos had a more normal .299 BABIP, and 74.60% LOB%. Because of Iwakuma’s extreme results in both measures, we should realize that his fWAR value is less likely to reflect his value than Latos’ fWAR is. Indeed, this is probably true, as Iwakuma varied more than any other pitcher in my sample, a lot due to leading the AL with a 7.0 rWAR.

Hopefully, this piece inspires you to more thoroughly analyze WAR, and dig deeper when trying to determine player value in terms of wins. I have only scraped the surface in analyzing WAR here, and for such a popular measurement, it should be analyzed further.

Featured Image courtesy of http://www.zimbio.com

## Leave a Reply