THE BOOK--Playing The Percentages In Baseball

Monday, July 20, 2009

John Dewan (and research assistant) speak!

By .(JavaScript must be enabled to view this email address), 07:19 PM

Great job over at SOSH for the Q&A. Here are their detailed answers, including the first one on UZR:

Getting back to your question, whatâ€™s the difference between my system and UZR? While I donâ€™t know for sure if the current version of UZR is an extension of my original UZR, or if it was independently developed, the bottom line is that they are based on the exact same concept. Both systems break the field into small areas and look at the probabilities of plays being made in those areas. The differences lie in the various adjustments that are made.

My research assistant, Ben Jedlovec, prepared the following:

Based on my understanding of both systems,

Similarities

* Both use BIS Data. UZR started with STATS data, but the most commonly referenced version uses BIS data.
* Both have the same idea- break down balls in play by type, location, velocity.
* Both are measured on an above/below average scale.
* Both have runs saved systems with components for GDP, OF Arms, Range.
* We use similar run value multipliers at each position.
* Both are available online (Fangraphs or Bill James Online).

Technical Differences

* UZR uses multi-year samples, while Plus/Minus adjusts for year-to-year league changes. As teams are increasingly recognizing the importance of a strong defense, the league as a whole will be stronger defensively. It is important to handle this trend appropriately.
* Plus/Minus uses smaller, more precise zones, or â€œbucketsâ€? of plays.
* UZR has several minute adjustments, such as batter hand, pitcher hand, base/out state, and pitcher groundball/flyball tendencies. We remain focused on the value contributed to the team in the playerâ€™s specific context.
* Park adjustments are handled differently- I believe UZR applies blanket adjustment across all buckets, while Plus/Minus has park factors in form of more precise buckets. A ball hit 395 feet to Vector 190 that stays in the park is only compared to all other balls hit 395 feet to Vector 190 that stay in the park. If it leaves the park, it neither helps nor hurts the fielder. Also, we added the â€œManny Adjustmentâ€?, which removes fly balls hit unreachably high off a wall. We named the system after the Green Monsterâ€™s most notable victim, who went from being by far the worst left fielder in baseball before the adjustment to being only arguably the worst left fielder after the adjustment.
* Plus/Minus accommodates plays where the first baseman holds the runner and middle infielders are covering second on hit-and-run plays. UZR adjusts for all base/out states.
* The two systems apply the run values at different stages in the calculations. UZR applies runs right away, while we convert to Enhanced PM then apply the Run Factors.
* Plus/Minus is a little more aggressive in awarding credit/penalty. An example: 100 balls in a â€˜bucketâ€™ (specified type, velocity, location), 30 fielded by the 2B, 20 by the 1B, 50 go through for singles. On a groundout to the second baseman, we give +50/(50+30) = 5/8 = +.625. UZR gives +50/100 = +.50. On a single through both fielders, Plus/Minus gives -30/80 = -.375 to the 2B, and -20/70 = -.29 to the 1B. UZR gives -30/100 = -.3 to the 2B, and -20/100 = -.2 to the 1B. You could make an argument for either method of accounting, but neither one is better than the other. The differences are the greatest at the middle infield positions, where overlap between fielders is the highest.

Fundamental Differences

* Runs Saved includes Bunt Runs Saved for corner infielders, pitcher fielding (Plus/Minus and holding runners), and catcher fielding (handling the pitching staff and the running game).
* Runs Saved measures the extra impact of HR Saving Catches. Runs Saved will add other Defensive Misplay/Good Fielding Play runs in the future.

A fairly descriptive and fair review, considering the kind of competing nature of the two systems. Good job to Ben.

When it comes down to it, I give our overall plus/minus numbers similar credibility as other baseball numbers, like batting average or on-base percentage. In my new book, The Fielding Bibleâ€”Volume II, we developed Runs Saved. I think of Runs Saved as the Runs Created of defense in that it encompasses a wide variety of methods. I give Runs Saved similar credibility to Runs Created.

Batting Average is a fair comparison for reliability. OBP is not. I prefer my answer: OBP tells you as much after 200 PA (50 games) as UZR tells you after 400 BIP (100 games). Both of which tell you half the story at those points in time.

On offense I believe weâ€™re measuring 80-90 percent of the true ability of players. On defense, I believe weâ€™re at about the 60 percent level. But weâ€™re still at the tip of the iceberg in terms of precision and a ton more can be done, especially defensively. As new forms of data become available, weâ€™ll be able to enhance our defensive systems. One example: BIS has now developed a batted ball timer, which we believe will greatly improve the accuracy of our system.

Hang time! Finally. I wonder if that’s in the feed that Fangraphs or HardballTimes gets.

And then.. the chat ended. I don’t know what happened. Too bad, because there was alot of good questions out there.

• Sabermetrics • Fielding

#1 BenJ (see all posts) 2009/07/21 (Tue) @ 05:06

John’s traveling and having trouble finding time to polish off the rest of his answers and post them. There are plenty of excellent (and tough!) questions in the queue, so check back at SOSH in the coming days…

#2 MGL (see all posts) 2009/07/21 (Tue) @ 09:25

I am in the middle of the chat over at SOSH. I have not yet gotten to the answers - I am just finishing up with the questions, which are excellent ones. They have a great readership over there - like us.

Here is one question, or at least a similar one, which I hear/read probably a hundred times a year, more or less.

What kinds of factors skew statistical analyses of defense?

I’ll give a concrete example: Jacoby Ellsbury’s defense has fallen off a cliff this year, according to most advanced metrics. I find it hard to believe that he was an elite defender in 2008 and is now a poor one. Can you make an educated guess whether one year’s rating is more likely to be an aberration than the other? If so, what factors would you look for as signs that a particular player’s rating is an aberration (or, inversely, is especially likely to be accurate)?

My response is very important because many people, even astute followers of sabermetric stats and principles, still do not get what it means to present a metric which represents a sample of a player’s performance, or I should say, a measurement of that performance (I qualified that because in addition to random fluctuation in the “value” of the performance, we also have random - or even biased - variation in the metric because of measurement error).

Let me re-state the problem succintly:

Player A has a stellar UZR (or OPS, or lwts, or ERA, or FIP, or whatever) one year and then a really bad one the next year. Or vice versa. Player is not hurt in one of those years and healthy in the other, as far as we know. How can this be! Dial 911! How can a player be fantastic with the glove one year (say, according to UZR) and terrible the next year. That just can’t be! There must be something wrong with the metric!

When a player has a good or great UZR in any time period, it does NOT mean that he is a good or great player and it does not even mean that he had a good or great year with the glove. What does it mean, you ask? Well, it doesn’t mean anything other than he had a good or great UZR. It really doesn’t. He may be a bad or average defensive player who had a good 147 games with the glove (woop-de-do_. It may be a bad or average defensive player who did NOT have a good or great year with the glove but the way we measure defense with UZR just got it wrong. That does NOT mean that UZR is a bad measurement device. It just means that it is not a PERFECT measurement device. It is the same thing with offense. People think that the measurement of offense is “perfect” because we have these neat little bins that offensive performance goes into (singles, doubles, triples). Well, guess what? For the purposes of measuring offensive talent and using that measurement to predict future offensive performance, those neat little bins are crap. A player could have a good or great year on lwts or OPS and he could be a bad player who either DID have a good or great year on offense, OR he could be an average (or bad) player who did NOT have a great year on offense, but we were fooled into thinking that he did because of those misleading artificial neat little offensive bins. If Tony Womack has a good year on lwts or OPS because he got a whole bunch of bleeders through the IF, a crapload of bloop hits to the OF, and 2 or 3 HR balls that went just over the fence on a windy day in Wrigley Field… Well, you get the idea.

So a good or great UZR doesn’t mean ANYTHING other than a good or great UZR. And the next year, when the same player, even with the same talent, has a bad UZR, guess what? That doesn’t mean anything either. It doesn’t mean he had a great defensive year and then a bad one. It doesn’t mean that he was a great defensive player in one year and a poor one in the other year. It doesn’t mean anything. It just means that he had a good or great UZR in one year and a poor one in another year. Nothing more and nothing less.

The sooner people think that way, the sooner they will not become so incredulous at the small percentage of players for whom that is the case (wild fluctuations in UZR or any other metric from one time period to another).

So how does the thinking process go, you ask, if you don’t want us to think this way? Good question and I’m glad you asked. It should go EXACTLY like this (the thinking process):

Player A has a UZR of +10 in 2008. I have no idea whether he is a good or great player, an average or poor one. I really don’t. AND, I have no idea whether he actually had a good season with the glove, and average one or even a poor one. (Yes, it is entirely possible that a player with a UZR of +10 could have had a poor season with the glove - it really is). So what do I do now?

The same thing you do with any other measure of sample performance. You make some inferences if it suits your needs. Our best guess is that player is a pretty good offensive player if that is all we have to go on. If player B is -10 in 2009, then we guess that player A is better than player B. Do we know any of that? Of course we don’t. THAT is why we don’t jump out of our skin if player A is -10 next year. He might have been a true -10 in the first place. Or a true 0. Or -3. Or +4 and the -10 in 2010 is “wrong.” We don’t know the answer to any of those questions!

One of your questions at this point might be, “Well if a player can be a true -10 and have a +10 in an entire season, there must be something wrong with the metric.” My answer is how the heck do you know. The metric could be perfect and the only thing we are seeing is random fluctuation in sample performance.

Let’s say that we had a sophisticated device for
measuring the result of a coin flip. Let’s call it, the “looking at the coin lying on the floor” device. OK, we flip a coin 50 times and it comes up 28 heads and 22 tails. No big deal, right? Now we flip it again 50 times and it comes up 23 heads and 27 tails. Oh my God, there must be something wrong with our measuring device! Get my point? I hope so.

Or, even a great measurement device in baseball, at least, has LOTS of measurement error. SO part of that +10 one year and -10 the next year could be fluctuations in performance and part of it could be measurement error. How do you know how much measurement error constitutes a “bad measuring device” with regard to evaluating defense? You don’t. You really don’t. I don’t. If I told you that 10% of all players who don’t change their true talent defensive value from one year to the next differ by more than 11 runs in their UZR per 150 from one year to another. Only 10%. So that -10 to +10 (a 20 run difference) does not occur very often. Is that a good number? That 10%. How about 5%? 15%? You have no idea do you. Neither do I. I can tell you that there is around a 50% regression in UZR from one year to another for a full time player. Is that good or bad for a baseball metric? You have no idea. Neither do I.

Let’s just say for the sake of argument that 5% is great. 95% of all players are withing 11 runs from one year to the next. That means that a +10 to -10 is a rare event - maybe 1% or less of all players will have that kind of year to year fluctuation due to whatever. That seems pretty good to me.

Now, given that is the case, you will likely find at least a few players in that category every year. So you can point to those players - like Jacoby - and say, “How can UZR be a good metric! Look at Ellsbury (and 3 or 4 other players). They are great one year (NOT!) and terrible another (NOT!). UZR has to be a lousy metric!”

OK, getting back to how you should think of these numbers - as just numbers - and then you can make some inferences using those numbers, if it suits your needs and/or your fancy. We have our _10 player and if we want to we can say that our best estimate is that he is a pretty good defender. In fact, of all +10 UZR players, maybe 70% of them will be +5 or or next year, and 10% of them will be 0 or less. Just numbers. That’s all.

Now, the next year he is -10. Instead of that 911 call, it still means nothing. Just numbers. We now have a player that is zero after 2 years. More reliable data because we have 2 years rather than one. I am thrilled. Do I care that he was a +10 one year and a -10 the next year? Absolutely, positively not! I don’t even remember that he WAS a +10. I only know that he was zero in his last 2 years (or only 2 years if I have no more data). So if I want to make an inference now for whatever reasons, it is that he is an average defensive player. After last year, I thought there was a pretty good chance that he was a good one. Now I “changed my mind” because I have more data. So what? Doesn’t mean anything to me that he was a +10 last year and a -10 this year (OK, if you want to get sophisticated on me, you can do some adjustments if you think he may have been injured in his -10 year or he learned some new defensive technique in his +10 year or some such thing - and of course, technically you should weight the two years to account for changes in true talent as time goes on). Absolutely nothing.

That is how you should think of these things.

And if I ever read or hear a question like that (“How could so-and-so be a +9 this year and a -11 last year?”) again, I am going to scream and write another diatribe…

#3 MGL (see all posts) 2009/07/21 (Tue) @ 10:19

The chat had like 2 or 3 answers (with a lot of good questions) and ended 5 days ago? Anyone know what happened?

I agree that the answer to the question about similarities and differences between plus/minus and UZR was pretty good. I’ll make some comments one by one:

Similarities

* Both use BIS Data. UZR started with STATS data, but the most commonly referenced version uses BIS data.

Correct.

* Both have the same idea- break down balls in play by type, location, velocity.
* Both are measured on an above/below average scale.
* Both have runs saved systems with components for GDP, OF Arms, Range.

Correct. Technically, UZR does not include GDP and OF arm runs, but I compute those separately for Fangraphs and I have been doing that for 15 years or so.

* We use similar run value multipliers at each position.
* Both are available online (Fangraphs or Bill James Online).

Correct, I guess.

Technical Differences

* UZR uses multi-year samples, while Plus/Minus adjusts for year-to-year league changes. As teams are increasingly recognizing the importance of a strong defense, the league as a whole will be stronger defensively. It is important to handle this trend appropriately.

Of course it is important to do everything appropriately. I could do it the way they do it by changing the code in about two minutes. I have gone back and forth. Because I think there are biases in the data from one year to another, I prefer to do it the way I currently do it - zeroing out every position each year. If someone wants to, they can do the league adjustments from year to year. I don’t recall seeing any evidence that the league has been getting better defensively over the last few years. I would not be surprised if that were the case.

It is a little arrogant of them to assume that I am not doing it “appropriately” and they are. As if I never thought about the way they do it and as if their way is more rigorous…

* Plus/Minus uses smaller, more precise zones, or â€œbucketsâ€? of plays.

I don’t know that they do, but if they use distance and vector, say in degress, that is a lot of buckets and if they are not using a “smoothing” function or some such adjustment, they are asking for trouble. You simply CANNOT use that many buckets without a smoothing function. And if you are using a smoothing function, you ALMOST might as well just use larger (fewer) buckets. Again, the use of the word “precise” is a little arrogant and misleading. I don’t think my buckets are less “precise.” They maybe just larger and fewer (encompassing more area of the field).

And if yo are going to be using tiny areas on the field, your data recording better be incredibly accurate. And that just isn’t the case. Another reason why it is probably a mistake to use such small buckets.

* UZR has several minute adjustments, such as batter hand, pitcher hand, base/out state, and pitcher groundball/flyball tendencies. We remain focused on the value contributed to the team in the playerâ€™s specific context.

While I fully admit that most of those adjustments I use are fairly trivial, “we remain focused on the value…” What a bunch of “Madison Avenue” BS! I can’t believe that those words actually came out of this guy’s mouth (or keyboard, I guess). I am NOT in a competition with these guys, so I don’t really care, but that statement is incredibly petty.

* Park adjustments are handled differently- I believe UZR applies blanket adjustment across all buckets, while Plus/Minus has park factors in form of more precise buckets.
That is pretty much the case with UZR. I use a blanket park adjustment for LF, CF, and RF and one park adjustment for the entire IF.

If they use park adjustments for each of hundreds of thousands of buckets, that is really problematic, I think. Talk about sample size problems. I have sample size problems when I do the blanket adjustments. To quote my Grandmother, “You think you have sample size problems!”

A ball hit 395 feet to Vector 190 that stays in the park is only compared to all other balls hit 395 feet to Vector 190 that stay in the park.

I don’t know if he is referring to park adjustments or computing plus/minus. Again, if he is referring to park adjustments, that is WAY to small a sample to get any meaningful park adjustments. If he is referring to just the regular computations, is he implying that UZR uses balls that leave the park in its computations? THat would be ridiculous!

If it leaves the park, it neither helps nor hurts the fielder.

Really?

Also, we added the â€œManny Adjustmentâ€?, which removes fly balls hit unreachably high off a wall. We named the system after the Green Monsterâ€™s most notable victim, who went from being by far the worst left fielder in baseball before the adjustment to being only arguably the worst left fielder after the adjustment.

I am going to add that - basically eliminating balls off a wall that can’t be caught. Right now, those are somewhat incorporated into the park adjustments of course.

* Plus/Minus accommodates plays where the first baseman holds the runner and middle infielders are covering second on hit-and-run plays. UZR adjusts for all base/out states.

Correct. I don’t adjust for hit and runs though or when the runner is stealing (and presumably one or more middle infielder is out of position). I could though.

* The two systems apply the run values at different stages in the calculations. UZR applies runs right away, while we convert to Enhanced PM then apply the Run Factors.

I don’t know, but it probably matters little.

* Plus/Minus is a little more aggressive in awarding credit/penalty. An example: 100 balls in a â€˜bucketâ€™ (specified type, velocity, location), 30 fielded by the 2B, 20 by the 1B, 50 go through for singles. On a groundout to the second baseman, we give +50/(50+30) = 5/8 = +.625. UZR gives +50/100 = +.50. On a single through both fielders, Plus/Minus gives -30/80 = -.375 to the 2B, and -20/70 = -.29 to the 1B. UZR gives -30/100 = -.3 to the 2B, and -20/100 = -.2 to the 1B. You could make an argument for either method of accounting, but neither one is better than the other. The differences are the greatest at the middle infield positions, where overlap between fielders is the highest.

Don’t have the time to follow this, but I think there are differences in the way we compute credits and debits. We have discussed these differences before.

Fundamental Differences

* Runs Saved includes Bunt Runs Saved for corner infielders, pitcher fielding (Plus/Minus and holding runners), and catcher fielding (handling the pitching staff and the running game).

Actually, I probably never mentioned this before, but I do bunts for corner infielders separately (and then just add it in with the rest of the numbers). I started doing that that 2 or 3 years ago. I don’t do pitcher and catcher defense, at least as part of traditional UZR. I could of course, although catcher defense incorporates lots of other things besides fielding batted balls of course.

* Runs Saved measures the extra impact of HR Saving Catches. Runs Saved will add other Defensive Misplay/Good Fielding Play runs in the future.

Yes, those HR saves and “other plays” are great things to include. However, the way they describe how they treat those things, if you can understand their explanations (I cannot) in FB II) leave a lot to be desired, IMO. Or maybe I just don’t understand how they are handled.

There are some large similarities, but the bottom line is weâ€™re not measuring exactly the same pieces of the puzzle, and weâ€™re accounting for them differently.

That is true, although I think you can truthfully say, “The bottom line is that we are measuring essentially the same thing, and although we are accounting for them with some differences, the accounting is essentially the same as well.”

I have some more (critical) comments on this next post by Dewan:

“Over time, we have all developed a feel for what baseball data means. For example, looking for a player with a long career I randomly picked Juan Pierre flipping through my Bill James Handbook. In 2004 he hit .326 for the Marlins. One year later with the same team, he hit exactly 50 points lower (.276). With the wisdom of hindsight, but even at the time, we know his real ability is somewhere in between.”

Although his point is well-taken, we do NOT know his real ability. We can estimate it and we can make some inferences pertaining to it, but by no means, and in no way, shape or form, do we KNOW that his real ability. is somewhere in between the two numbers. I don’t know why he makes a statement which is obviously not true.

So it is, for the most part, with our plus/minus numbers. But it can still vary from year to year and a player’s true ability generally lies between the fluctuations.

That is basically true, although I am not thrilled with his choices of words.

Another example: if a player has a plus/minus of +3 after five games, he has played well in those five games.

No, no, no and no! That might be true if we had a perfect measuring device. But we don’t and it it just as likely (or at least somewhat as likely) that he did NOT play well and that we measured that performance poorly.

It’s like going, say, 10-for-20 in those five games. There’s no question that he played well.

Well, yes and know. There is no question, according to the way that performance is categorized (hits and outs). But if your definition of “playing well” or “playing poorly” is based on how hard you hit the ball, for example, then his statement is not true of course, but I’ll give him the benefit of the doubt.

But the sample size is small and, in that limited timeframe, provides only a minuscule amount of insight into the player’s true ability.

No problem there.

Like other numbers in baseball, a small sample size tells you what a player is doing, but the larger the sample size gets, the more you know about what he is really capable of doing.

Again, nothing terribly wrong here, although I don’t like his choices of words. Again, he forgets about measurement error which is important in defensive evaluation, because we don’t have those neat convenient little buckets, like hits and outs, that we do for offense. We only have, for example, “a hard ground ball hit to the 130 degree vector.” And if anyone thinks that the recording of that piece of data, and thus the “measurement” we need to compute UZR or plus/minus, is perfect, or even close to perfect, well, I have news for you…

#4 Colin Wyers (see all posts) 2009/07/21 (Tue) @ 11:03

I did a little study on year-to-year correlation for UZR recently:

http://www.hardballtimes.com/main/blog_article/how-reliable-is-uzr/

By far I would say that the biggest element to the “unreliability” of UZR compared to offense is the vastly smaller number of chances a typical player gets on defense relative to offense.

And as a quasi-impartial observer here, I just have to say that I was rather taken aback at how pretty much every difference between UZR and Plus/Minus, I would side with UZR. I particularly think that you shouldn’t do fewer years of baseline data AND smaller buckets of hit locations.

#5 dan (see all posts) 2009/07/21 (Tue) @ 11:24

I loved reading this discussion (if you can call it that), but somebody needs to get MGL a coke. The man’s fingers must be exhausted.

#6 MGL (see all posts) 2009/07/21 (Tue) @ 11:34

Dan, thanks (I think). I just got done with 5 straight days of golf tournaments, so it was a welcome relief to sit down in front of a computer and only have to use my fingers. Not that playing golf is the most energetic of sports (some people think it is a pastime and not a sport)...

#7 dkappelman (see all posts) 2009/07/21 (Tue) @ 17:13

Just wanted to second what Dan said. Really insightful response to the Dewan Q&A and I’m going to put a link to it in the FanGraphs glossary.

#8 Nick (see all posts) 2009/07/21 (Tue) @ 19:19

Holy crap MGL! Great info.

#9 Dan Brooks (see all posts) 2009/07/21 (Tue) @ 20:53

Can we repost this at SoSH?

#10 BenJ (see all posts) 2009/07/22 (Wed) @ 01:50

Dan Brooks (#9)- Frisbetarian (Chuck) asked that everyone hold off with follow-ups until John gets a chance to respond to the rest of questions. I say give him some time.

I’m not going to begin to respond to everything MGL said. I wish I had time to. A couple quick points-

The chat had like 2 or 3 answers (with a lot of good questions) and ended 5 days ago? Anyone know what happened?

See Post #1.

“If it leaves the park, it neither helps nor hurts the fielder.”

Really?

Yes… That’s one of the benefits of using smaller “buckets”. Maybe we’ll have to have a long park-effects discussion sometime.

While I fully admit that most of those adjustments I use are fairly trivial, â€œwe remain focused on the value…â€? What a bunch of â€œMadison Avenueâ€? BS! I canâ€™t believe that those words actually came out of this guyâ€™s mouth (or keyboard, I guess). I am NOT in a competition with these guys, so I donâ€™t really care, but that statement is incredibly petty.

MGL, I think you’re grossly misinterpreting the statement. There’s the ongoing issue of measuring ‘value’ vs. ‘skill’. A lot of the adjustments you make (GB/FB ratios of pitchers, the way you credit/debit plays, etc.) serve to isolate the fielder’s ‘skill’, while Plus/Minus goes more for ‘value’. Hence the above statement. We’ve confirmed this statement with our internal comparisons similar to what Colin did with UZR, but with our numbers too.

About the sample sizes: it’s a trade-off. Say we have Bucket A and Bucket B. Bucket A is closer to the leftfielder, so Bucket B is a tougher play. Is it possible some balls from Bucket A placed in Bucket B because of measurement error? Sure. There could also be balls from Bucket C in Bucket B. However, the average ball in Bucket B is still a tougher play than a ball in Bucket A. Because Plus/Minus doesn’t subdivide by as many variables, our sample sizes are large enough for this to hold up.

We could throw Buckets A, B, and C together and treat them all the same. Plus/Minus trades smaller samples for increased ‘precision’, for lack of a better word. I wouldn’t say that one way is right and the other isn’t; and it doesn’t make a huge difference either way.

#11 Tangotiger (see all posts) 2009/07/22 (Wed) @ 01:55

â€œIf it leaves the park, it neither helps nor hurts the fielder.â€?

Really?

Yesâ€¦ Thatâ€™s one of the benefits of using smaller â€œbucketsâ€?. Maybe weâ€™ll have to have a long park-effects discussion sometime.

I think MGL was purely totally sarcastic there. You may as well have said: “If the pitcher walks a batter, it neither helps nor hurts the fielder.”

(Notwithstanding the “just over the eight-foot fence” HR).

#12 Tangotiger (see all posts) 2009/07/22 (Wed) @ 02:01

Re: the size of the buckets.

MGL is right, insofar that he echoes what I’ve been saying: you must use a continuous function.

For example, say you have five sequentially adjacent zones, with the following out rates:
95%, 85%, 65%, 75%, 55%

In no way can that third and fourth zone be accurate. If you make your buckets too small, then you will get these kind of baseline out rates. If you make them too big, you lose important information (notably the balls hit at the edges of the zones).

So, you need to make sure that the out rates follow some function.

***

As for value v skill, I don’t think that Ben’s statements are accurate, and again, I agree with MGL’s characterization.

If Barry Zito or Derek Lowe is pitching, and you make the adjustment based on their identities (or not), this is not a choice of skill v value. It is purely one of value.

The discussion of skill ONLY comes into play when you apply regression or Bayes. Otherwise, it’s always value.

#13 Dan Brooks (see all posts) 2009/07/22 (Wed) @ 02:12

Ben-

Yeah, I didn’t mean repost it now, just repost it when you guys are finished writing replies. =)

#14 MGL (see all posts) 2009/07/22 (Wed) @ 02:47

Yes, I was being sarcastic with the “really” comment. Of course I don’t include balls that leave the park in the data!

Ditto what Tango says about the buckets and sample size issues.

Also ditto what he says about the adjustments and value versus talent. To say that there is something wrong with any of those adjustments or even that there is some justification for not doing them is just not right. I fully admit that they are not a big deal and add very little overall, but to say that not using them is somehow better in terms of value or talent is disingenuous. For example, I use an adjustment for the G/F ratio of the pitchers. The reason is that I have found that even if two ground balls are recorded as medium, the one from a high ground ball pitcher tends to be slightly slower (and therefore easier to field) than one from a flyball pitcher. So that kind of adjustment is correct from a value as well as a talent perspective. I simply allows us to identify the speed of batted balls (flies and grounders) with more precision than the stringers provide (easy, hard, and medium). Another example is the bases and outs state. By adjusting for all bases and outs states, it allows us to proxy the position of the fielders (through no choice of their own). How can you say that NOT doing that kind of adjustment is better than doing it, from any perspective? That is simply not possible, assuming that the adjustments are done properly. Of course we want to know the position of the fielders or some approximation, if that positioning is NOT a choice by the fielders. It allows us to do the computations with more precision.

Again, I am not saying that these adjustments are particularly important or necessary. What I AM saying is that for them to say that NOT doing them is better is just plain wrong. If you want to say, “We don’t do them and we don’t think it takes away much at all from the eventual results,” then fine. But to say, “We don’t do it and we think not doing it is BETTER,” is just plain crap, to be blunt.

And again, yes, the sample size is in fact a trade-off. And honestly, I don’t know the exact numbers of buckets they use and what the numbers look like for each bucket, but I will re-iterate that if there are too many buckets and the samples in each buckets are less than, say several hundred balls, there is potentially a problem. For them to say (and I don’t know that they are saying it) that using lots of buckets without a smoothing function is definitely the way to go, then I am skeptical to say the least. And I can use any size buckets I want just as well as they do. I choose to use X number of buckets and my choice is fairly arbitrary. I really have no idea whether their buckets are “better” than mine and I think they have no idea either.

And BTW, as John mentions in the blog, and as I have mentioned before, the idea for UZR and the acronym was NOT may idea. It was his. And I appreciate him not assuming that I “ripped it off.” I have given him credit (actually all I knew was that it was STATS - I did not know it was Dewan) for the idea and for the name many times over the years. Had he trademarked the name, of course, I could not use it. Not that I WANTED to use that name necessarily. It just happened that I started computing “UZR” using my own methodology about 15 years ago, and the name kind of grew a life of its own…

#15 BenJ (see all posts) 2009/07/22 (Wed) @ 02:59

Tango/12-

I think there’s more to the skill vs. value issue than just regression.

An example: Plus/Minus likes the Jays middle infield a lot more than UZR. I suspect part of the reason is the GB/FB adjustment, since Toronto has a GB staff (top 10 in GB%).

Plus/Minus gives Hill and Scutaro credit for fielding the plays they had a chance to field. If I understand UZR’s adjustments, Hill and Scutaro would be discounted because they played behind a GB staff.

In that sense, UZR gives us a more staff-neutral evaluation (what I was calling “skill”). However, if we want to know how much Hill and Scutaro helped in that specific context (what I referred to as “value”), we don’t want to make the adjustment.

I was thinking of it like WPA vs. wOBA, to use a more extreme distinction. WPA is more context-dependent, while wOBA isn’t. WPA would correlate better with wins (obviously), but wOBA would correlate better year-to-year. It really depends on what you’re trying to evaluate.

Does that make more sense?

#16 BenJ (see all posts) 2009/07/22 (Wed) @ 03:07

MGL/14, didn’t see your post before I submitted mine in #15. I think we’re on the same page there.

Lastly, I don’t think anyone ever claimed that one system is superior than the other. In fact, we were pretty careful to say the opposite.

#17 Tangotiger (see all posts) 2009/07/22 (Wed) @ 03:29

However, if we want to know how much Hill and Scutaro helped in that specific context (what I referred to as â€œvalue”), we donâ€™t want to make the adjustment.

You are wrong here.

This is like saying that if Derek Jeter faces Johan Santana, CC, Roy Halladay, Tim Lincecum back-to-back-to-back-back that we don’t want to make the adjustment to get at Jeter’s “value”.

Or if Jeter places at Coors for a week.

Or any context you want to bring him.

The fact of the matter is that if it’s easier for Scutaro/Hill to field a GB because they are behind a GB staff, you MUST make the adjustment from a value perspective.

Otherwise, what you are capturing is not Scutaro’s value, but Scutaro + Scuataro’s context value, some, a little, or a lot, could be traced back to Scutaro. The adjustment is required so that we are aware of the context.

That is value. If you want to define value as you seem to be doing, then you need to create your label as “Scutaro + context”, because that’s what you are measuring.

#18 BenJ (see all posts) 2009/07/22 (Wed) @ 03:44

MGL/14:
The reason is that I have found that even if two ground balls are recorded as medium, the one from a high ground ball pitcher tends to be slightly slower (and therefore easier to field) than one from a flyball pitcher.

Tango/17:
The fact of the matter is that if itâ€™s easier for Scutaro/Hill to field a GB because they are behind a GB staff, you MUST make the adjustment from a value perspective.

That’s a different issue then what I was talking about. You’re not saying that they need an adjustment because they get more GB, but because the GB they get are easier to field? Even after using the hard/medium/soft distinctions? I can’t imagine the difference is very big, but maybe it’s worth adjusting for all the same. I guess I can’t speak to that, as I haven’t seen a study that backs that up. Can you point me to one?

#19 Tangotiger (see all posts) 2009/07/22 (Wed) @ 03:52

Sure, MGL showed it in part 2 of his UZR primer. IIRC, it’s something like .02 more outs per play from a GB-heavy pitcher than not. Not a big deal, considering there are not alot of GB-heavy pitchers. But, it’s there.

#20 dan (see all posts) 2009/07/22 (Wed) @ 04:14

MGL—

Yes, my comment above (#5) should be taken as a positive.

#21 MGL (see all posts) 2009/07/22 (Wed) @ 07:12

Yes, agree with Tango 100% on the value/talent issue with respect to the Toronto infielders. Plus/minus and UZR present numbers that indicate the “value” of a fielder as compared to an average fielder fielding the exact same balls. If the G/F tendencies of the pitchers behind those fielders affect the speed of those ground balls, then that needs to be accounted for regardless of whether you are going to call your metric one that reflects value or talent. Same for bases/outs state. If a certain middle infielder only plays with a runner on first and 0 or one out, if you want to compare him to all other fielders (the essence of plus/minus or UZR), you have to compare him to all other fielders with a runner on first base and 0 or 1 out, to reflect the fact that that middle infielder is playing closer to the second base bag and a little shallower than he would with 2 outs or no runner on first base.

If plus/minus did not want to “normalize” each batted ball opportunity, they would not use any park adjustments. Of course they want to “normalize” every batted ball opportunity because they don’t want to provide misleading numbers.

Tango’s example about Jeter, or any other batter who would happen to face the best pitchers in the league, is a great one. NO OFFENSIVE METRIC would choose to present data that says, for example, “Jeter had an OPS of .586,” and then fail to tell you that, “Oh yeah, and all he faced were Santana, Lincecum, and Haren.”

Again, the whole point of plus minus and UZR is to tell you how many runs a fielder cost or saved his team, as compared to an average fielder who would have had the exact same batted ball opportunities, including the speed of the ball, the location of the ball, the position of the fielder, if it was not of his own volition (e.g., it was because of runners/outs or the characteristics of the stadium), the lighting in the stadium, the temperature, the altitude, etc.

My adjustments are done to include as much of these influences as possible so that we are REALLY comparing a particular fielder’s batted ball opportunities to that of the average fielder with exactly the same batted ball opportunities. It has NOTHING whatsoever to do with the value versus talent argument.

In defense of BenJ, I think that when he was referring to the Toronto fielder discrepancies, and the fact that TOR has a heavy GB pitching staff, he meant that the number of GB received was influencing the numbers. That may very well be the case (that each system would have a very different number) if, for whatever reasons, we handle ground balls to the middle infielders differently.

“Lastly, I donâ€™t think anyone ever claimed that one system is superior than the other. In fact, we were pretty careful to say the opposite.”

I agree Ben. I was overly harsh with my criticisms, and I did misinterpret what you said about the “We remain focused on the value contributed to the team in the playerâ€™s specific context” statement. I definitely read that the wrong way and I take back my criticism of it. I do however, stick to my assertion that my adjustments have nothing to do with the “value/talent” issue, but you were clearly not saying that your metric is a better one for lack of doing those adjustments (I don’t think).

I have also never said that one is better than the other. One may be better than the other, but I have no idea or even a guess, which one it would be. Both are probably fine in their own ways. There are probably some things that each metric does better than the other and some things it does worse (or leaves out that the other includes). In fact, if I had to use a number for any given fielder, I would probably combine the two. If I actually knew all the nuts and bolts of plus/minus, and put the time into comparing it to UZR, I probably could have more of an idea as to which one is “better” but I don’t.

I appreciate all of BenJ’s responses above (and below).

#22 .(JavaScript must be enabled to view this email address) (see all posts) 2009/07/22 (Wed) @ 16:19

Someone needs to get on the ball and make a set of interactive graphs for fielding data spanning multiple years.

You put the batted ball locations for each ball around the player in question, color code for out/safe, make a nice 90% play ring, 50% play ring and 10% play ring (or whatever, selectable by the person), and be able to skip from year to year to see patterns.

Fangraphs…?

I also think that UZR should have a ‘confidence’ component which can be determined from past performance coupled with fan scouting reports. If the fans say someone dropped off a cliff defensively, and the numbers do too, then you give the current year slightly more weight. Otherwise you just weight 3 years (or whatever seems appropriate) to average out fluctuations.

#23 .(JavaScript must be enabled to view this email address) (see all posts) 2009/07/23 (Thu) @ 00:16

MGL/21- thanks for the kind words. I hope we can continue to have discussions like this, on or off line. Unfortunately most decent fielding systems are so complicated (and/or proprietary) that few people know enough details to really compare the gory details.

Sal/22- the trick is that every fly ball is different. You’ll have vastly different zones for liners, flies, fliners, or pop-ups. You could do 3-4 different plots, but then you may not have enough data to see anything significant.

For The Fielding Bible Vol. 2, we played around with a number of ideas to visually display an outfielder’s defense. In the end, we assembled color-coded plots based on the difficulty of each play, which gives you some of the picture. It’s easy to start picking out guys who play especially shallow (McLouth, for one) or deep. Check those out if you haven’t seen them, and let me know what you think of them. ESPN and SI have liked the plots when I ran them for articles they published.

#24 .(JavaScript must be enabled to view this email address) (see all posts) 2009/07/23 (Thu) @ 02:15

Dan B just linked me to this discussion, and I wanted to update you all on the status of the chat. John is traveling (baseball road trip with his son), and is posting responses whenever he gets a chance. A few new answers went up this morning, and, over time, I would expect all the questions to be answered. His responses have all been tremendous, as have Ben’s contributions.

Great stuff here, guys (I almost called you pendejos, but most of you don’t know me and might not realize I was using it as a compliment). I’ll try to get in on the conversation when the SoSH auction and Softball bash, which are taking up all of my time these days, are over.

Mgl - you should try to make it here this weekend and have an El Tesoro with me. A few of the other pendejos (there it is) from last summer’s class will be there. It’s very funny watching guys who criticize the game and players so readily actually try to play.

#25 Nick (see all posts) 2009/07/23 (Thu) @ 17:22

So MGL, if you had to guess, what would be the standard error on a sample of UZR.

Let’s say you have Brendan Ryan, who has put up a career UZR/150 of 14.5 in 833.1 innings. What is the potential variance of his true ability based off of that sample, not including information like scouting reports or anything?

#26 MGL (see all posts) 2009/07/23 (Thu) @ 20:34

Nick, first you have to regress in order to get the mean of your estimate. So, if a player was 14.5/150 games in one season, that might be +7 after regressing. A WAG on the standard error might be 4 runs.

#27 Nick (see all posts) 2009/07/23 (Thu) @ 20:52

Cool, thanks a lot. Assuming your WAG is close, that’s a lot smaller than I expected.

#28 MGL (see all posts) 2009/07/24 (Fri) @ 06:02

Well, once you regress, it’s not that small. If it were +14.5 plus or minus 8 runs (at 2 sigma) that might be small. But for a player who is +14.5 for one season and then gets regressed to +7 (estimate of his true talent), we’re only 95% certain that his true talent defensive value is between -1 and +15 runs (per 150). That is a pretty wide bar.

Again, just a WAG. You’d pretty much have to do a “with and without you” type of analysis to estimate how accurate UZR or a similar metric was (in order to get a standard error of the true talent estimate). Year to year correlations only give us reliability, not accuracy.

#29 MGL (see all posts) 2009/07/24 (Fri) @ 06:07

Tango or someone did a nice “WOWY” for offense which “proved” that lwts is “real.” I forgot exactly how the research went, but it was something like looking at all players who averaged +10 in offensive lwts (say, per 150) when they were in the game and when they were not, and he/they (the researcher) found that indeed the team scored 10 extra runs per 150 games when they were in the lineup versus when they were not (or more than 10 runs if the replacement was a below-average player).

I would love to see someone do the same thing with UZR or plus minus. A WOWY on defense to validate these metrics. That is one of the criticisms of these metrics - “How do we know that they work?” If someone were to do that kind of a study using UZR or plus minus, we would know whether those metrics “worked” or not.

#30 .(JavaScript must be enabled to view this email address) (see all posts) 2009/07/24 (Fri) @ 06:24

the trick is that every fly ball is different. Youâ€™ll have vastly different zones for liners, flies, fliners, or pop-ups. You could do 3-4 different plots, but then you may not have enough data to see anything significant.

Circles for flies.
Triangles for line drives.
Squares for ground balls.
“X"s for fliners.

Then just make it red for out, green for hit. Allow people to draw their 50% rings or whatnot wherever they want, and move forward/backward by year. Allow them to have check boxes to eliminate certain batted ball types. Interactivity will let people explore the data to reach their own conclusions. Small sample size may be a factor, but let people at least try—who knows what someone will find?

#31 Colin Wyers (see all posts) 2009/07/24 (Fri) @ 10:24

I would love to see someone do the same thing with UZR or plus minus. A WOWY on defense to validate these metrics. That is one of the criticisms of these metrics - â€œHow do we know that they work?â€? If someone were to do that kind of a study using UZR or plus minus, we would know whether those metrics â€œworkedâ€? or not.

That’s brilliant. If anyone from BIS would be willing to submit Plus/Minus numbers, I’d be happy to do it. (There’s also PMR, and hopefully one could include STATS UZR - I already have BIS UZR in database form.)

#32 MGL (see all posts) 2009/07/24 (Fri) @ 13:00

Colin, I am not exactly sure if you want to use regressed values to compare to the WOWY runs allowed or non-regressed. That is a tricky one. I’d have to think about it a bit. Maybe we can get some input from Tango and others.

IOW, let’s say we want to look at all players with at least a +5 per 150 in 5 years of data, min of 500 games. And let’s say that the average UZR/150 among those players is +9. Should we expect the WOWY analysis to yield 9 saved runs per 150 per player? Or, should we first regress each player’s UZR so that the average regressed UZR/150 of that group might be +6, and we expect the WOWY to yield 6 runs fewer allowed per player per 150?

Certainly, if we go forward, we want to use regressed numbers. IOW, if we have a bunch of players who played in one season, min of 100 games, and averaged +10 UZR/150 and then we did a WOWY for subsequent years, we would expect to find only 5 runs saved per player per 150 games.

But again, if we look at the same time period as the UZR (say, that one year only), I am not sure if we would expect to find +10, +5, or somewhere in between.

If someone says, +10, because that is what UZR says they ACTUALLY saved, keep in mind that ONE of the reasons we regress is not only to account for better or worse than average seasons (that are better or worse than a player’s true talent), but also because of measurement error. When a player is +10 in UZR for one season, not only did he likely perform better than his true talent for that one season, we also likely measured that one season incorrectly (for example, some of the balls we thought were “hard to field” according to the data were actually not hard to field). So when a player is measured at +10 for one season, the most likely scenario is that he is a true +5, but that he actually performed as a +7 for that season - 3 “extra” runs is due to measurement error and 2 extra runs is due to him having somewhat of a fluke good year.

If people don’t understand this concept, it would be like we had a bunch of stopwatches and a small percentage of them did not function properly (that is like the UZR - or any metric for that matter - methodology - it is by no means perfect - for one thing, the data is far from perfect). Now we time a person once running the 100 meter dash. The stop watch says 10.5 seconds (I think that is really fast for an average person). Two things are likely: One, the person is pretty fast, but not THAT fast (10.5 seconds) - he happened to run really fast that one time, or had a strong wind at his back. etc. Two, we have more than a random chance that we were using a faulty stopwatch.

Anyway, Colin, if you do this for, say BIS UZR, or any other defensive metric, you might want to do it two ways: One, compute UZR for 3 groups of players - average, above average and below average, during some time period, say, 3 years. Compare that to their WOWY for that time period. Then compare that for a WOWY for another time period, say the 3 years before or after. So it might look like this:

Good group UZR 03-05: Average of +6 per 150
Average group UZR 03-05: zero per 150
Bad group, 03-05: -6 per 150

06-08 WOWY

Good group, maybe +4
Average group, zero
Bad group, maybe -4

Depends on the level of their replacements of course.

03-05 WOWY (the same time period of their UZR)

Good group, maybe +5
Average group, zero
Bad group, maybe -5

Then also show their 06-08 UZR.

You could also do it backwards of course. First break the groups up according to their 06-08 UZR and then look at each group’s 03-05 UZR and their WOWY for 03-05 and 06-08. Of course, aging will play a role in this too. We expect players to have their true talent UZR (and WOWY, presumably) decline with age, starting at almost any age, like triples do.

#33 Nick (see all posts) 2009/07/27 (Mon) @ 14:53

Also, MGL, what is the deal with overlapping zones in the outfield? If you are a center fielder playing next to Carl Crawford, would your UZR be higher or lower as a results; or would it not make much of a difference?

#34 MGL (see all posts) 2009/07/28 (Tue) @ 00:33

I don’t think it will make much difference, at least the way that I do it, but I am not sure. In “bins” where most balls are caught, it doesn’t matter who catches the ball - no one gets penalized or credited much at all. In bins where few catches are made, I just don’t think that more than one player can make a catch, so I am not really sure how whether one plays next to a good or bad fielder affects one’s UZR. Are there really that many balls that more than one player can make a good or great catch on? And if there is, how often does fielder A make a great catch and take a catch away from fielder B? I don’t think that happens very often at all. It is true that if a ball drops in, all fielders who have any catch percentage in that bin get docked, so if fielder A is bad, it is true that all other fielders who ever make catches in that bin will be unfairly docked a little more than they should, but again, if fielder A makes few catches in that bin anyway, his skill won’t affect many balls in that bin, and if he makes lots of catches in that bin, the other fielders won’t make many catches in that bin, and they won’t get docked very much when a ball drops in.

If someone can give me a scenario where it matters, I will be happy to address it.

Now, if it is true that a fielder would play in a substantially different position due to the skill of an adjacent fielder, then that would probably affect the numbers, but I don’t know if that is true or not, and I don’t know what to do about that off the top of my head…

#35 puck (see all posts) 2009/07/28 (Tue) @ 01:05

How does UZR handle zones for a large park like Coors—are the outermost OF zones “simply” expanded outward to account for the added OF dimensions? Then I assume the pf is applied. (Are the Coors OF pf’s still .93, .91, .91 left to right, as they were in the 2003 Primer article? Did the infield pf change post-humidor [was .97]?)

#36 BenJ (see all posts) 2009/07/30 (Thu) @ 20:43

John has managed to plow through the remainder of the questions over at SOSH. If you’re interested in the rest of the responses, go check it out.

#37 birdo (see all posts) 2009/09/12 (Sat) @ 04:05

MGL - question in regards to your 50% regeression comment in #2 - are the UZR values on fangraphs basically the average between the 1-year UZR output and 0? Or are you saying that if you are looking at a rookie’s UZR over his first full season, you should regress 50% to 0 to estimate his fielding talent level?

Thanks.

#38 MGL (see all posts) 2009/09/12 (Sat) @ 06:53

The numbers on Fangraphs are sample measurements, I assume. So yes, if you want to estimate a player’s true defensive talent and you have nothing else to go by, then yes, you would regress one year stats by around 50%. And yes, if you know nothing about the player, you would regress towards zero. However, if you know his age, you would regress towards another number. Figure maybe 1 run per year worse below and above 27 years of age. Sort of like offense. Plus, if you know something about his speed, you would definitely regress towards a number other than zero, especially in the OF. Take a look at my article on “speed and UZR”. I think it is in the THT archives. The difference between a fast and slow player in the OF, even you know nothing else about him is probably like 6 runs or more. So if a player is faster than the average player at his position, you can probably regress towards +2. If he is slower, -2. That does not hold true for 3B. And I think it holds a little less true for SS and 1B but I don’t remember off the top of my head. You can probably even use a young player’s offense as a proxy for what the teams and scouts think, and you can then regress toward that number. For example, if an OF’er is lousy offensively, he is probably regarded as very good defensively, so you can probably regress his UZR toward something like +5 (maybe that is too high, I don’t know). And if a player is very good offensively (as compared to others at his position), it is probably true that his average UZR is less than 0.

#39 MGL (see all posts) 2009/09/12 (Sat) @ 09:39

“You can probably even use a young playerâ€™s offense as a proxy for what the teams and scouts think, and you can then regress toward that number.”

That sentence should read, “...teams and scouts think of his defense...”

IOW, if a player with poor offense for his position is playing, it is likely that he is only playing because he is an above average defender. That is especially true if it is obvious that he is a poor offensive player, like someone with little power and not great on base skills.

I’ve never done this before, but I would guess that if we split every position up into two groups, the top half and bottom half by offensive numbers, we would find that the bottom group were a few runs better in defense than the top group.

#40 Guy (see all posts) 2010/06/19 (Sat) @ 20:17

I noticed that Dewan put out the top 8 fielders in plus/minus. As you’d expect given that selection, UZR has these players lower. But the consistency and size of the gaps is striking. UZR is rating these players as saving about 1/3 as many runs on average, and in no case more than 50%. All of these P/M players have saved 10+ runs, while UZR on Fangraphs has only 1 player above 7 (and that’s Crawford, who doesn’t make the top 8 in P/M!).

Two thoughts:
1) Is Fangraphs perhaps regressing the UZR numbers, intentionally or unintentionally. I know UZR has a bit less variance, but this is an enormous difference.
2) The fact that two respected systems, using IDENTICAL data, can get results this different indicate that the “engines” are doing a lot of work here, and the decisions made in constructing those engines has big consequences for the ratings.

P/M / UZR
Zobrist 15 / 4
Jackson 15/ 2
Ramirez 12 / 3
Bourn 11 / 6
Zimmerman 11 / 6
Utley 11 / 5
Cano 10 / 2

#41 Colin Wyers (see all posts) 2010/06/19 (Sat) @ 20:42

It’s evidence of a colossal screw-up, is what it is. DRS sums to 410 or 415 for all players so far this season (depending on if you check Fangraphs or Baseball Reference), which is about 410 to 415 too much. I’ve mentioned this to Dave Cameron in the past.

#42 Colin Wyers (see all posts) 2010/06/19 (Sat) @ 20:58

Updated figures:

POS    DRS
1B    39
2B    30
3B    74
C    -17
CF    50
LF    -18
P    72
RF    50
SS    149
Total    429

The SS totals are really bizarre. Meanwhile, left field is really odd in the other direction. (Which is why you won’t see Crawford on the top ten DRS fielders - left fielders as a group are “below average” in DRS, apparently.)

#43 Rally (see all posts) 2010/06/19 (Sat) @ 21:59

Of course you’ll get bigger numbers for PM with that selection. What if you take the leaders for UZR, and compare those to PM(adjusting for league average, which looks like you’d have to subtract 5 runs from every SS)?

I’m sure if you do that, you’ll find bigger numbers for UZR. If you want a truer estimate of how much variance there is in one system compared to the other, try looking at the difference between best and worst fielder for each system. And of course, the best and worst in UZR may not be the same players as in PM.

SS UZR 8.4 DRS 22
2B UZR 15.5 DRS 24
3B UZR 11.8 DRS 18
CF UZR 22.7 DRS 31

OK, looking at it that way, much more variance in John Dewan’s data. Especially at shortstop.

#44 Tangotiger (see all posts) 2010/06/19 (Sat) @ 22:27

Colin is saying the sum of all the players is not 0, as it should be, but over +400 runs.

#45 .(JavaScript must be enabled to view this email address) (see all posts) 2010/06/19 (Sat) @ 23:12

So the average team is 14 runs above average defensively in Fangraphs DRS? And the average shortstop is 5 runs better than average?

#46 guy (see all posts) 2010/06/19 (Sat) @ 23:27

That’s a problem. But it doesn’t explain the difference in variance, as Rally showed.

#47 Colin Wyers (see all posts) 2010/06/20 (Sun) @ 00:04

Guy, I don’t know if that’s correct. It depends on what the source of the error is, isn’t it? There are at least two components to observed variance in DRS:

* Spread in fielding “performance” (skill plus luck plus bias)

* Playing time

If we were comparing variance in DRS among players with the same amount of playing time, you’re right, the “zeroing out” problem wouldn’t affect variance. But we aren’t necessarily comparing like to like when it comes to playing time, and that could artificially inflate the variance.

If I adjust DRS by taking (DRS/BIZ) for the league and subtract BIZ*(lgDRS/lgBIZ) from each fielder, I get results that seem more sane than what the raw numbers are on Fangraphs or Baseball Reference. I mean, without knowing what the problem is I don’t know if that’s the “right” way to address it but it seems to work.

#48 MGL (see all posts) 2010/06/20 (Sun) @ 00:05

If the systems were using different data sets, then you would expect that if you looked at the top players (any players actually) in one system that the other system would have numbers that would be equal to the first system somewhat regressed. So, if the top SS in one system were +10 and the proper regression (based on measurement error) for that time of the year were 50%, then the other system should have that SS at +5.

If they are using the same data set but different engines, you would expect the regression to be less, maybe a lot less. And of course the more similar the engines, the more the numbers will equal each other with no regression.

BTW, the UZR numbers that FG presents are not regressed at all.

#49 Colin Wyers (see all posts) 2010/06/20 (Sun) @ 00:13

Nope, I’m wrong. SS variance is the same once I “slope correct” based upon BIZ.

#50 Guy (see all posts) 2010/06/20 (Sun) @ 05:51

I compared UZR and DRS at three positions (among qualifiers). Since playing time for each player is identical by definition, that shouldn’t be a problem. The variance in DRS is clearly much bigger. The two ratings also diverge a lot at SS.

2B:
UZR SD: 4.1
DRS SD: 6.4
r = .79

SS:
UZR SD: 2.5
DRS SD: 5.3
r = .37 (!)

CF
UZR SD: 5.1
DRS SD: 6.9
r = .86

Big SS differences (UZR/DRS):
Ramirez 3/13
Escobar 2/18
Ryan 0/12
Gonzalez (Tor) -2/+10
Aybar -4/+9

#51 .(JavaScript must be enabled to view this email address) (see all posts) 2010/06/20 (Sun) @ 08:21

You are going to see much more agreement when the season is over. The numbers are apparently very sensitive to the methodologies in the short run.

#52 benj (see all posts) 2010/06/20 (Sun) @ 11:25

i’m late to the conversation, but wanted to add this.

plus/minus is now on a rolling one-year basis. As a result, it’s not guaranteed to zero out until the end of the season. according to plus/minus, defense was a lot weaker at the end of last year then at the beginning of this year; hence the skew. it’s particularly off at shortstop right now.

Colin is saying the sum of all the players is not 0, as it should be, but over +400 runs.

secondly, there are several components of runs saved that DON’T zero out, even at the end of the season. expecting the totals to add up to zero is a mistake. i’m extremely surprised this hasn’t come up before.

#53 Colin Wyers (see all posts) 2010/06/20 (Sun) @ 21:05

secondly, there are several components of runs saved that DONâ€™T zero out, even at the end of the season. expecting the totals to add up to zero is a mistake. iâ€™m extremely surprised this hasnâ€™t come up before.

Okay, so… I take it arm is one of those. So, uh, what do your arm ratings mean, then? A player’s throwing arm saves x arms over - the worst outfield throwing arm? An outfielder who makes no throws at all? If you’re not zeroing out, what’s the baseline? When we say that Matt Kemp has -4 arm runs, does that mean he’s really about -10 compared to average?

plus/minus is now on a rolling one-year basis. As a result, itâ€™s not guaranteed to zero out until the end of the season. according to plus/minus, defense was a lot weaker at the end of last year then at the beginning of this year; hence the skew.

So let’s just look at rPM for a second, which is right now +296. Backing that out to plays made, that’s roughly 423 extra plays, compared to last season. (Actually it’s more than that, since this year is included in the baseline.)

Meanwhile, BABIP is .298 so far this season, compared to .299 last season. Where are these extra plays? Please, please tell me what’s more likely:

1) That the fielders this year are so, so much better than last year, and somehow the batted ball distribution got so, so much more difficult to field, to where it washes out and the rate of hits on BIP is exactly the same; or

2) You have a change in the way you’re observing BIP that is causing your rating of expected outs to be skewed.

I’m betting on two. It could be any number of things - the new stringers could be different than the old stringers in where they’re marking the LD/FB boundary, or how they’re marking the spray angle of GB. Maybe some teams moved camera positions in the offseason. Maybe this is just typical random error in scoring batted balls that will wash out over time, but because of how defensive stats have been presented up until now we’ve never noticed.

But based soely on the data I’ve seen so far, I think it’s in error to simply say “defense was a lot weaker at the end of last year then at the beginning of this year.” It presumes things that aren’t in evidence. (And if I’m wrong, please, show me the evidence.)

#54 Tangotiger (see all posts) 2010/06/20 (Sun) @ 21:12

secondly, there are several components of runs saved that DONâ€™T zero out, even at the end of the season. expecting the totals to add up to zero is a mistake. iâ€™m extremely surprised this hasnâ€™t come up before.

I’m going see your extremely surprised, and raise you a fabulously stupefied!

Seriously, saying that all MLB defenders in 2010 is +400 runs above average, the presumption is that the MLB defenders of 2010 are +400 runs above the average defender of MLB in 2010.

If you want to argue that the average MLB defender in 2010 is +400 runs above average over the last 162 games (meaning a big part of 2009), then it’s up to you to make that clear. In no way do I think anyone would understand that unless you say it.

Furthermore, why NOT zero out at the seasonal level? Exactly what is the point of saying a large percentage of SS are above average (in 2010, but compared to performance of all SS in last 162 games)? The substantial majority of people are going to expect that the average is the average for the given timeframe, and not compare 60 games in 2010 for each SS, to 162 games of all SS over the past 12 months.

Are you going to fold, call me, or raise?

#55 MGL (see all posts) 2010/06/21 (Mon) @ 09:15

As I’ve said before, the problem with zeroing everything out per season is that you lose (to some extent) the ability to gauge how defense at one or more positions may have gotten better or worse from one year to another and you lose the ability to accurately compare one season’s ratings to another season’s rating (although those things can actually be inferred using various methods). I talked about that in my UZR Primer on Fangraphs.

On the other hand, NOT zeroing out per season creates problems as well. For one thing, it is not readily evident how a particular fielder fares versus all other fielders at his position unless you know the league average numbers. For example, if Player A at SS were -4, if you knew that the totals were not zeroed out per season, you would know that this player were a below average fielder compared to SOMETHING, but you would not know how he compares to the average SS THIS SEASON (again, unless you happened to know the league total), and you also would not know what that -4 is being compared to (the last 162 games, the last 4 years, etc.).

Also, most other metrics are zeroed out per season automatically. So for consistency sake, you probably want to do the same with defense. On the other hand, since you are dealing with such small samples of people on defense (e.g. all SS or all 2B, as opposed to all batters for offense or all pitchers for pitching), it is much more likely that you could have a very good or very bad crop of defenders at a particular position in a particular year. Again, you lose the ability to discern that when zeroing everything out per year.

The biggest argument for the zeroing out, in my opinion, is that it is likely that there are biases in the recording of the data that are unique to each year and that those biases change from one year to another. If you don’t zero everything out per season, you can easily THINK that overall defense at one or more positions has changed but it hasn’t. For example, if you find that all of a sudden the overall defense in 2010 is +300 runs, it is more likely that something has changed in the data than that players are that much better at defense in 2010. That is especially true if DER or BABIP has not changed considerably, as Colin points out. So not zeroing each out can be very dangerous and lead to false conclusions.

Now, keep in mind that whether the defensive metric is UZR or Dewan’s P/M, when you do the computations, you are generally comparing a player’s catch rate in any one season (or whatever time period you are computing) to the league catch rate for that position and that “bucket” over several years (I use 6 years, I think), in order to get a decent sample size for your baseline. So there is a natural inclination to NOT zero everything out such that everyone’s result is “as compared to the average fielder at that position over the entire baseline.”

In any case, you must do one of three things when presenting the numbers: 1) Clearly tell everyone that you are zeroing every defensive position out per season such that just because a player is zero this year does not mean that he is equivalent to a zero at the same position last year, which goes without saying of course for all metrics. (But, as I said, it is much more likely that the average SS in the NL from one year to the next has changed than it is for the average pitcher or batter in the NL to have changed, simply because of the sample size of the players.) 2) Don’t say anything, and the reader should assume (or he can add it up himself) that each position sums to zero at all times, which is the default position with most other offensive and pitching metrics. Most importantly, if you are NOT zeroing everything out, 3) You MUST make sure that you inform the reader of the baseline that everyone is measured against!

And BTW, a “rolling one-year baseline” is a little odd. For my taste, I would either zero everything out per season, whether it is one week or 25 weeks into the season, I would leave the baseline as whatever it was when you compared everyone’s catch rate (as I said, 6 years for me), or I would arbitrarily use the last X number of years, including the current one. I guess a “one year rolling” baseline is my third option, where X = 1 year, but I would definitely use more than 1 year if I were choosing that option.

BTW, according to my numbers, I do think that overall defense is a little better this year than last year (and probably the year before that, etc.) due to more of an emphasis on defense by several teams and due to declining power (where defense becomes more emphasized by teams in general), but I think it is more along the lines of maybe .1 rpg at the most, which would be something like 100 runs total - certainly not 400 runs. If overall defense were up or down, as Colin points out, we should certainly see it in DER or BABIP. After all that IS the definition of overall defense. I would not expect there to be an anomalous number of hard or easy BIP once we get into 1000 games or so, and we only have one new stadium this year (which may have increased defense ever so slightly since catching fly balls in the Metrodome was slightly harder than in the average stadium), and I don’t think there were any changes to any other stadiums.

#56 Tangotiger (see all posts) 2010/06/21 (Mon) @ 10:59

Can Ben show the DRS numbers for the players this year and last year, for the same fielders, weighted by the lesser of the two playing time, for players in their 20s?

Ideally, they should be pretty close to identical.

In any case, I pretty much agree with everything MGL just said.

#57 Rally (see all posts) 2010/06/21 (Mon) @ 20:24

I had intended to zero out TotalZone for each season, but I discovered it wasn’t quite doing that when Sean Forman was putting the code together for Baseball-reference.

Even if I’m using the specific season’s rate, it’s the park factors that can throw things off. The park factors use multi-year data, and unless you have the both the same mix of ballparks and the same proportion of batted ball events and buckets each year it will not perfectly zero out.

By buckets I don’t mean zones like UZR, but I do have buckets for hitter/pitcher counts, runner on first or not, left/right batter and pitcher. And for parks I’ve got limited data for the NY parks, and will have to wait to the end of the year for Minnesota.

Right now TZ has all of MLB at -41, or -1.25 per team. Close enough to zero I can live with it. UZR comes out to -.2 for the league, which is probably due to rounding. And of course Dewan’s numbers have a huge positive value for just about everything.

How is it that for the home runs saved metric there are no negatives? The Angels most certainly should show a -2 here. Did you guys not see Torii Hunter’s assist last week on the flyout turned homerun to Casey McGehee?

#58 MGL (see all posts) 2010/06/21 (Mon) @ 23:34

For the home runs saved metric, I don’t think that they scale everything to the league average for ANY time period (IOW, they are not runs or HR saved “above or below average), which is ridiculous of course.

If on the average, all players save 1.5 HR a year, I think that an average player who saves 1.5 HR gets a positive rating, which is, again, ridiculous, at least without explicitly telling us.

You have metrics that are automatically zero’d out to SOME time period, like UZR and lwts. And then you have metrics that are not, like OPS or RC. That’s fine. But…

Either you have to tells us or it should be obvious. In the case of a metric whose units are runs and they are small, it is not so obvious, like HR saved. So you have to tell us, which they don’t.

And surely having some defensive metrics which are “runs above or below average” and some that are not, coming from the same source (and intending to be added together to give us a complete picture), is problematic.

#59 Rally (see all posts) 2010/06/22 (Tue) @ 01:21

I was kind of joking on that. I know the home runs saved is not scaled to average. But even considering this, you have to give a negative rating to an outfielder who has a ball bounce off his glove and over the wall for a homer.

Or head, if we got back to Jose Canseco.

Commenting is not available in this channel entry.

<< Back to main

THE BOOK--Playing The Percentages In Baseball

Monday, July 20, 2009

John Dewan (and research assistant) speak!

Latest...