Friday, May 28, 2010

THE FALLACIES OF DEFENSIVE METRICS

"Basketball scorers count mechanical errors, but those are a record of objective facts: team A has the ball, then team B has the ball... But the fact of a baseball error is that no play has been made but that the scorer thinks it should have. It is, uniquely, a record of opinions" -- Bill James

If you’ve ever instinctively recoiled at a statistic that could reckon Derek Jeter the worst shortstop in baseball or doubted a respected front-office's decision to replace Mike Lowell with Adrian Beltre and Jason Bay with Mike Cameron, a recent article in New York magazine will vindicate your skepticism.

In “Database Loaded,” (New York, April 18, 2010) author Will Leitch sheds light on the arcane world of defensive metrics. And in explaining how they operate, the author unearths many of the subjective judgments, speculative assumptions, and dubious conclusions that characterize them.

Space and audience constrain Leitch from an exhaustive accounting of their infirmities or a thorough consideration of their more far-reaching implications. But in following him one step further, the flaws he identifies beg the question whether (1) the obstacles to quantifying players' defense aren't inherent and insuperable and whether (2) metrics that further presume to calibrate it in runs notwithstanding possess any real utility at all.

Now, this isn’t to join the obscurants who summarily dismiss sabermetrics and/or disparage those who formulate them. As the foundation of the old baseball Establishment’s power and privileges crumble and fade and the grizzled, tobacco-chewing lifers abdicate power to the overeducated Excel specialists, the last redoubt of the old-time religion is defended by the Yahoos.

Item: Among the phalanx of pitch-fork populists who dominate WFAN’s airwaves, one recently derided sabermetricians as “40-year-old virgins, who work at Burger King and live in their mother’s basement.” (About which one might object, "Joe, you've overestimated by at least twenty-five years. By the law of averages, at least a few sabermetricians had to have gone to high school with your daughter.")

THE SABERMETRIC REVOLUTION & THE NEW FRONTIER
Yet in their nostalgia for the pecking order of high school, what the Yahoos overlook is that the athletically-challenged "nerds" and "geeks" they once scorned have acceded to the GM's throne in organization after organization for a reason. The quantitative analysis at which they excelled held elementary insights into the game and profound wisdom about its players the cigar-chompers had ignored. One day, long ago, visionaries like Bill James had cleared a path through the trees and opened for them a new world.


James had detected the biases and inadequacies inherent in old statistical staples like RBIs, batting averages, and fielding errors-- fallacies that were perhaps self-evident but that no one had pondered thoroughly enough to recognize. For example, the conventional wisdom regarded a .300 batting average and 100 RBIs as the hallmarks of great hitter. No one had stopped to see the obvious: that a hitter cannot earn a “run batted in” unless the players preceding him in the lineup have reached base before him. 100 RBIs could as easily attach to a player of Lou Gehrig’s caliber as Wally Pipp’s. Indeed, during the 1920s, it did. Pipp amassed 109 RBIs in 1923, however, chiefly because the player hitting third, one ahead of him, reached base in an extraordinary 54.5% (.545) of his plate appearances that year. (Guess who he was?)

By contrast, newer metrics like RISP and OBA—a hitter’s batting average with “runners-in-scoring-position” (RISP) and on-base percentage (OBA)— judged hitters far more on their individual merits and thus served as more effective and reliable analytic tools. Leitch observes how commonly recognized RBI's falsities have become. But I, for one, am still old enough to recall a time when through the 70s and 80s, Phil Rizzuto, Frank Messer, and Bill White could broadcast an entire season of Yankee games without ever once discussing a player’s on-base percentage or commenting on his plate discipline or mentioning his proficiency at accumulating walks.

Probably not until Michael Lewis’ Moneyball illustrated how the Oakland Athletics capitalized on OBA to acquire players the marketplace undervalued did the newer metrics enter the broadcast vernacular. But by then, the Wall Street whiz kids running things had discarded OBA and were seeking out newer tools to identify players whose price didn’t reflect their value. After all, once an innovation develops into a staple, it no longer confers a competitive advantage.

Enter the New Frontier, defensive metrics—statistics that can tell the forward-thinking GM which players’ deft fieldwork, sweeping range, and supple glove can prevent runs but whose salary doesn't yet reflect the win values his skills confer.

The two defensive metrics which have acquired the greatest currency, Leitch explains, are Runs Saved (RS) and UZR (Ultimate Zone rating). The former, RS, John Dewan’s company, Baseball Info Solutions, devised, with help, no doubt, from Bill James. (Baseball Info Solution’s website lists the Master as a consultant.) The latter, UZR, sabermetrician Michael Lichtman, developed; the findings, for which, his FanGraphs website publishes. While Leitch happens to focus on Dewan’s RS metric, it also notes that in logic, method, scale, and above all, in the totals they assign most players, the two metrics differ very little.

Leitch's article carries significance for a number of reasons. Although baseball journalists and commentator cite or mention these defensive metrics all the time, few actually have taken the time to explain how they operate: Leitch does. Secondly, Leitch identifies much of the technological void that besets their method and the fundamental guesswork that necessarily burden their data. Above all, Leitch conveys, without necessarily stating it explictly, how misleading defensive metrics' pretense is to tabulate a mathematically certain result.

The "Runs Saved" statistic may suggest a mirror image of the "Runs Created" figure James conceived for hitting [(Hits + Walks) * TB/ (At Bats +Walks)] (since refined). They both, after all, reckon their totals in baseball's precious currency-- runs. But the result of tallying up a players' Runs Created and Runs Saved to derive a total picture of his run value wouldn't differ too much from adding the price of a Rogers Centre's ticket and one from Yankee Stadium to project the total cost of a Blue Jays-Yankees home-and-home series. Sure, both franchises denominate their tickets in dollars. Only the Blue Jays' ticket prices reflect Canadian currency.


No, the analogy isn't exact because prices are set not derived. Still, enough fanciful conjecture, subjective judgment, speculative assumption, and fundamental bias riddle the "Runs Saved" model that the defects raise the question whether the final numbers tally anything useful at all.

RUNS SAVED: THE NEW DOMINO THEORY?
According to Leitch, Dewan's "Runs Saved" statistic propelled much of the Boston Red Sox's off-season and in particular, inspired the organization's decision to replace Lowell with Beltre and Bay with Cameron. I find it difficult to believe that a GM of Epstein's competence and with the critical intelligence a legal education should instill could examine the RS metric's methodology and take them at face value. Though it wouldn’t be the first time the hubris of the New Frontier's Best and Brightest courted a tragic fall.

Whatever the case may be, let’s use Boston’s hot-corner to illustrate a few of the RS metric's many frailties.

Baseball Info Solution judges the two third-baseman as follows: “Adrian Beltre made 26 more plays than the average third-baseman, saving 21 runs [while] Mike Lowell made 23 fewer plays than the average, costing his team 18 runs.”

• Beltre +26 made plays vs average 3d baseman-- which translates into a Runs Saved = 21
· Lowell -23 made plays vs. the average 3d baseman-- translating into a RS = (-18)

The above numbers encourage, but do not justify, the following deduction. Had Beltre played third-base instead of Lowell in 2009, the Red Sox would have allowed 39 less runs (21 – (-18)) and in turn, won about five more games.

Actual Red Sox Runs Allowed in 2009 (Lowell) = 736
Projected Runs Allowed in 2009 (Beltre) = 687.
(RS^2/RA^2 + RS^2) = expected winning percentage = Δ(.031) = 5 wins

The problem, however, is the human bias upon which the numbers rest.


Recall what James once said about the error statistic: "It is, uniquely, a record of opinions." Elide the "uniquely" and one could as easily have characterized Runs Saved's methodology. Subjectivity and conjecture abound.

THE DATA

According to Leitch’s article, Dewan employs 15 to 20 video scouts and charges them with watching every single game played during a season and logging every ball’s destination on a grid divided into 3,000 zones and superimposed over the playing field.

“'Each of our video scouts has a computer screen with a replica of the field... We mark the exact location and velocity of everything,'" Dewan told Leitch, "At the end of the season, Dewan has a complete log of every [] hit to each of roughly 3,000 zones." Exact? No, not really. 3,000 zones spread over the 90,000 sq. ft. of the average MLB ballpark's fair territory still leaves a considerable margin for error. In baseball, remember, mere inches determine seasons. Even less exactitude characterizes hits' velocity and their type. See below.

A hypothetical record looks as follows:

• Location = Zone 187° /290’
· Velocity = Medium (or Fast or Slow)
· Type = Fly Ball (or Grounder, or Line Drive, or Fliner.)

What the hell is a “fliner,” you ask? Or what speed qualifies as “medium”? Or upon reviewing the hit recorded above, would all 20 video scouts, for example, have agreed independently on its taxonomy? And by the way, where does the above log account for the angle of trajectory, spin, and precise vector. How does the model adjust for variations in ballparks’ s square footage? What about where the relevant fielder was positioned when the ball landed in Zone 187° /290’?


APPLYING THE DATA


Once they compile the raw data, they apply it as follows. From what I can gather, the computer model, more or less, simulates the computer matrix AVM Systems devised for the Oakland Athletics and the operation of which Lewis describes in Moneyball.


The market for security derivatives, evidently, inspired the model and informs its logic. Derivatives-- at least, as I understand them--value and trade in an individual securities' component parts. Likewise, each subdivision among Dewan's 3,000 zones receives a run value. The value stems from the average run contribution balls hit to that zone have produced over the last ten years, given x outs and y runners on base.


To see how this works, first imagine a hypothetical 2009 game. (Keep in mind: I've arbitrarily invented the actual run values for purposes of the illustration.) In the first inning, Derek Jeter hits the ball and winds up on second base with no out. Jeter, accordingly, has created 0.7 runs for the Yankees because over the last ten years, every MLB team has scored, on average, 0.7 runs when a lead-off runner has reached 2nd base with no outs.


Of course, Johnny Damon, next at-bat, will alter the game situation and in turn, the Yankees' likelihood of scoring runs. Damon either will reach base or generate an out. The first increases the Yankees chance to score; the latter diminishes it. So let's say, Damon hits a seeing-eye single (grounder) that that travels (fast) to Zone#187/290 through the hole between third and short. Whether Jeter or not scores, Damon will receive a Runs Created number equal to the average number of runs teams have scored in identical situations over the last ten years. Because over the last ten years, every hit recorded at Zone#200 with a runner on second-base and no one out has resulted in the hitting team scoring on average 1.1 runs, Damon will receive a Runs Created value of + 0.4. Conversely, on the fielding team, the out Mike Lowell failed to record will earn him a negative Runs Saved value. Rather than simply assigning Lowell the inverse (-0.4), the model will assign his Run Saved debit based upon the average performance of 3d baseman in fielding fast grounders hit to Zone #187/290 with a runner on second base and no one out and the average run credit/debit it realized for his team.


Sounds eminently valid, no? The problem, of course, is you can't simply equate Damon's batted ball to Zone 187/290 with every other one that travelled there over the last ten years. Each one has its own precise speed, angle of trajectory, vector, spin, and type. And because our current technology does not (or cannot) permit us to measure the latter variables with precision, Dewan's log doesn't record them and human error necessarily enters instead.

Neither, of course, does the model, account for the Red Sox’s positioning of Lowell? It assumes the difference between where Lowell and every other third baseman throughout the league has positioned himself the last ten years doesn't differ significantly in distance. Further imprecisions stem from definition. Dewan's metric, evidently, doesn't register recorded outs. (It can't right? Who records the out, 3d baseman or 1st baseman). RS's metric unit is "plays made." But what precisely qualifies as "a play made"? Does merely fielding the ball qualify as a "play made"? What if Lowell caught the ball but didn't throw it because he wanted to hold Jeter at 3b?

Does the system account for the play that saves the run but doesn't record the out? Or what if Lowell fielded the ball but then Damon's speed forced an errant throw at 1B? Is that a "play made"? What if Youklis' agile stretch at 1b compensated for Lowell's throw? Is that a "play made"? What if Lowell holds the ball, baits Jeter, and tags him running to 3rd? Is that a "play made"?

As you can see, the “Plus/Minus” figure still may carry some heuristic value. But with all of its flaws, converting the number into Runs Saved further compounds the conjecture and the unreliability of the result. Think about it. How many shortstops' defense actually improves with age? Jeter's Runs Saved total was -23 as recently as 2007 (and as low as -28 in 2005) but tallied a +2 in 2009. Hey, the Captain's accomplishments never ceases to amaze me. But could Jeter actually have defied everything we know about the shortstops' historical performance, skill regression, and the aging process and not only dramatically elevated his defensive play at the age of 35 but improved it by 25 runs besides. And if not, what value is there in a metric that insists upon calculating its value in a "runs saved" that doesn't share an identity with the runs scored on the field?

HITING V. FIELDING: INCOMMENSURABLES?
Advances in technology may augment these metrics' accuracy but they may never match the validity and utility of hitting statistics. In fact, by calibrating players’ defense in runs, defensive metrics obscure the fundamental difference between fielding and hitting and encourages the misconception that a perfect identity exists in their scales.

Leitch writes,

“A hitter’s job is easy to quantify: He succeeds in getting on base (or hitting the ball out of the park) or he doesn’t. But countless variables can affect whether a fielder even has a chance to make an error: where’s he’s positioned, how quickly he reacts to the ball of the bat, what route he takes toward it, how quickly he gets the ball out of his glove, how hard he throws the ball to another fielder… But how do you decide whether the left-fielder should have been in a better position to catch a shallow, looping fly ball…It’s an extremely structured method of collating subjective judgments."


Leitch is right, of course, but not precisely correct. Batting statistics do indeed owe their simplicity and their validity to quantifying (1) an empirical fact—a batter either hits the ball or doesn’t— and (2) a uniform and binary outcome— he either reaches base or he induces an out.

Yet strictly speaking, fielding statistics also measure a verifiable either/or result as well: (1) a fielder either "made the play" (so defined) or he didn't. As argued above, the ambiguity, in part, springs from the definition. What qualifies, that is, as a "made play" deserving of '+'? Simply fielding the ball or actually recording the out? If Lowell dives, gloves the ball, and bounds to his feet, does he have to throw Damon out at 1st or does holding Jeter at 3d qualify? Does "plays made" quantify the smart but prosaic play, the time when Lowell held the ball and yielded the infield hit to prevent the runner from scoring, and how does it weigh it against the consistently spectatacular but foolish play that sacrifices a run to an out? (Think Robinson Cano in the latter instance)?

But the inchoate definition points to a more endemic ambiguity. I imagine the metric tabulates "plays made" rather than "recorded out" because a recorded out frequently implicates multiple players the ultimate credit for which, it's impossible to assign. If Lowell happens to record the out only because Youklis' deftness saves the errant throw, does the 1b deserve the credit or does the 3b for giving Youklis the chance to begin with?

Hitters enter the batter’s box alone. Hit, walk, strikeout, or homerun— on his team, he alone, controls the outcome. Opposite him, only the pitcher compares in agency; and eight teammates still support him. Isolating, let alone, quantifying a single fielder’s influence on his team’s fate, on the other hand, introduces a range of variables and a labyrinth of complexity belonging to another order of magnitude entirely.

Compare the relative simplicity in isolating the dependent variables that influence hitting stats. In the above hypothetical, Johnny Damon's Rund Created credit of 0.4, for example, depends upon the pitcher, the ballpark, or the opposition's fielding proficiency, among other factors. (And the unbalanced schedule means that Damon's runs creation incorporates a disproportionately greater number of AL East pitchers, ballparks, and opponents' fielders over the last ten years than average.) But then again, the batted ball of every other hitter in the American League is disproportionately weighted in an identical respect and it's an average of them that determines the run value of Damon's batted ball.

More importantly, the most influential variable of all-- the pitcher whom Damon faces-- changes once, at least, every game. As a consequence, we can't attribute Damon's total Runs Created to his aberrant success against any one particular pitcher. Conversely, a pitcher's Fielding Independent Pitching Statistics--strikeouts, walks, homeruns-- enjoy a parallel independence because of the myriad number and variety of hitters he faces.

Not so with fielders however. Mike Lowell played 107 games at third-base in 2009. He played almost all of them with Nick Green or Julio Lugo at SS, Youklis at 1B, and Jason Bay in the outfield. How much of Lowell's -23 "plays made" and -18 Runs Saved is owed to the inadequacies of his adjacent fielders? If Adrian Beltre had had to play alongisde of Nick Green and Jason Bay in 2009, how many fewer "plays made" would he have recorded in Zones around 3d baseman, for example, or in foul left-field because he had to shade to his left to compensate for his shortstop's and left-fielder's shortcomings? The Red Sox replaced all three this off-season. Perhaps, they apprehended the inherent difficulty of assigning responsibility for the inordinate number of balls that fell on that side of that side of the field to any one of them.

PITCHER OR FIELDER: WHOSE BALL IS IT ANYWAY?
But it isn't merely a matter of separating one fielder's responsibility from another. Beneath rhe runs saved concept lies our ignorance of the respective contribution of fielders' and pitchers' on the ball in play.

Recall the insight Voros McCracken discovered about pitching that Moneyball's chapter on Chad Bradford discusses. The reason why Earned Run Average doesn't gauge pitcher's performance accurately is because as it turns out, hits-- and by extension, the runs generated from them-- depends too greatly on both the fortuity of the ball’s bounce and his fielders' proficiency. McCracken found that the ERA's of pitchers like Greg Maddux and Randy Johnson fluctuated too widely from year to reflect their efficacy. Only outcomes they alone controlled-- strike outs, walks, and home runs-- showed the uniformity indicative of their consistent dominance on the mound.

This doesn't mean the pitcher exerts zero influence on whether a hitter sets a ball in play. It only means we haven't yet been able to quantify it. Quoth James, Baseball Abstract, p. 885 "A pitcher does have some input into the hits/inning ration behind him, other than that which is reflected in the home run and strike out column."

The converse of which is necessarily true as well. Upon balls set in play, we can't quantify the relative influence of the fielder either. How much of Mike Lowell's and Adrian Beltre's '+/-' and "Runs Saved" totals, therefore, reflect the idiosyncratic abilities and deficiencies of the Red Sox's and Mariners' pitching staffs? Seattle yielded the fewest runs (689) of any team in the AL in 2009. So too, their pitching staff's park-adjusted ERA+ of 113 led the league. To whom, do we credit the feat: the guy on the mound or the eight men surrounding him?

Why is this signficant? Why do we need to quantify the relative contribution of pitcher and fielder? Because what does it matter that Beltre saved 21 runs above the average 3rd baseman if you can’t calculate how this 21 Run Saved figures in the 689 runs his team allowed? One cannot simply deduce that with the average 3d baseman the Mariners would have allowed 21 more (689 +21 = 710) because we don't know how many runs the Mariners' pitchers would have prevented regardless. Consequently, we can't translate Beltre's Runs Saved figure into the most important knowledge of all: how many wins did the Mariners' accrue because of him?

E PLURUBUS UNUM: A ROMANTIC'S DISSENT
Baseball assumed the title of "The American Pastime" for a myriad reasons owing to the institution's history and pedigree. But one reason often overlooked is symbolic one--the identity its character shares in common with the nation itself.

For alone among the country's three professional team sports, baseball embodies America in microcosm. It captures the tension roiling within a pluralistic society committed, at once, to forging a cohesive nation out of a motley immigrant people and at the same time, to honoring the principle of individual freedom. In team athletics, the confrontation between pitcher and batter, resembling as it does individual sport like a tennis match, is inimitable. From this individual contest suspended within the matrix of a group competition evolves baseball's unique fetish of numbers. What's more, upon it rests our county's peculiar fixation with individual merit and our celebration of the pastime as a Platonic arena which rewards it.

It perhaps also drives the compulsion to break the game down into its component parts and to place a value on each. More ominously, it may also enourage the hubris that we can. The scientific method, you see, is a seductive temptress. No sooner did Newton, for example, demonstrate we could apprehend much of our world through mathematical formula than did many come to believe we could explain all of it as such. That with scientific advances and technological progress, we could unlock the key to human history, to war and peace, to surfeit and famine, to eternal economic plenty, political freedom, and human happiness and to reduce it to universally applicable formula. Then, one day a mustachioed tyrrant (or two) actually tried to implement it. And in trying to order the fate of millions, his sacrifice of million gave wicked illustration to the limits of empirical formula in comprehending social behavior and in measuring human motive and desire. The Age of Reason had made fools of us all.

This isn't to imply anything sinister about sabermetrics of course or to make common cause with the Yahoos who deny its revelations. I merely wish to sound a cautionary note. To warn how little space lies between James' insight that statistics contains hidden truths about the game the naked eye cannot see to the leap that statistics contain all of them. To dispute that with the latest technology and the right formula we can quantify and ultimately explain everything from the precise interrelation of hitting and fielding on runs to how adjacent fielders' proficiency affect their performance to how and why hitters inexplicably enter slumps and just as precipitately emerge from them.

For at some point we will reach the tipping point. Imponderables like herd behavior and attribution bias and bystander effect will manifest to thwart logic and reason. And the mysteries that enshroud the game will announce they forever elude us.

And when the day arrives, may we have the wisdom to recognize it. More importantly, may we have the gratitude to welcome it. For that's the day the science of baseball will yield to the art of baseball and it will inspire the silent awe worthy of the sublime.