Somewhat Pretty Pictures
Ok so the last couple of posts probably have been pretty boring to most people. So I am going to interject with some plots from pitchFX system so we can maybe get a better feel for things. First up I want to break down all the pitches tracked by pitchFX by what park the pitch was thrown at.

Click on all of these plots to enlarge them. As you can see, the system rolled out first mostly on the west coast and then moved east skipping some parks along the way. If you are fan of a team in the AL west your teams will have huge amounts of pitchFX data to work with. If your favorite team is in the AL east or NL central then not so much. PitchFX has just been added to the New York parks and to Tropicana with less than 2,000 pitches thrown in them. As I pointed out in my last post, RFK has only seen one game with the system turned on. Looking at this discrepancy in the amount of data and then looking at the errors on each of the parks in the post below it is pretty clear that getting to at least a few thousand pitches is important if we want to be able to do anything with the statistics in those parks. With less than a month left in the season, this seems unlikely in at least three parks.
Moving on to the actual data, the only quantities that pitchFX actually measures are an initial position, velocity, and acceleration in the x, y, and z directions. So lets take a look at some of these quantities starting with our favorite, z0.

Starting with the uncorrected data you can see that most pitches are thrown at a height of about 6 feet but if you look closely you can see another bump around 3 feet. That must be where the sidearmers throw. Often when comparing data that spans several magnitudes people will plot that data on a logarithmic plot (here is a wiki article on log plots if you are interested) and here is one of the same data below.
It might not look very similar but if you look at the statistics in the upper right hand corner they are identical. The only difference is in the Y axis which now is on a logarithmic scale. This scale is really handy at looking at the tails of the distribution and looking at this plot we can see several bumps as we go down in release point. The data continues until the release point is actually negative! On the other side we see some pitches that were thrown at a release point of 10 feet. Obviously something is wrong here and if you look at these plots pitcher by pitcher you won't find just one guy who is causing the problems. This appears to be our first sign (though others have shown plenty more) of pitchFX just whiffing on a pitch. What about when we apply our correction factor from the post below?
Here the data appears to be behaved a little better with a smoother drop as we go down in z0. There still are some values that are just way out there but things are looking a bit better and the RMS (which is a fancy way of calculating something like variance) has gone down which is something we would expect from corrected data. What about the x0 data?
The x0 data shows what you would expect with two peaks, one on the left side of zero (this is the right handed pitchers remember these coordinates are as the catcher sees them) and another on the right side. Both peak around two feet from zero and both have a tail that goes out. The real question is why are there so many points near zero? The answer is pitchers stand on different parts of the rubber. If a left handed pitcher stands at the extreme left side of the mound he is going to release the ball very close to zero, or right in the middle of home plate. Lets zoom in at take a look at the plot in log form.
Again we can see a shoulder in the data that is where the sidearmers are. This is easier to see from teh lefties but it is there for the right handers as well. Also, we see some more pathological points extending out 10 feet from 0. Just like with the z0 data we are going to have to clean that up before we can start to really analyze it. Do things get any better with the corrected data?
We still see just as many pathological points but the peak where the sidearmers are throwing is more clear. Again, this is exactly what we would hope for in correcting the data. Hopefully, as the large error on the x0 correction goes down the data will get even more clear. What about the initial velocities.
Because the y direction is towards home plate the initial velocity in y is the largest component in the initial speed of the ball. All the values are negative because the ball is traveling to home plate from 55 feet. I have converted this into miler per hour to make things easier to see. The peak here is at about 90 MPH with a shoulder around 82 MPH. Presumably, the points near the peak are mostly fastballs and the points near the shoulder are breaking balls. How about the tails of these distributions? Do they show as many ugly points at the positions?
It looks to me that this plot is much more reasonable. There are a few points near 105 MPH but some pitchers (Zumaya) might be able to get it close to that and if they were pitching in a park with a fast pitchFX system that night that doesn't seem to unreasonable. What about the pitches on the other side? Why are there so many pitches below 60 MPH? A possible explanation would be intentional balls. It is possible that a pitcher is just lofting those in at a very low speed. To check that we can plot all of the velocities of those intentional balls because that data has been added to the system.
So the really, really, slow ones are not from intentional balls. Those seem to be a good portion of the balls thrown near 60 MPH but few of them are below 50 MPH and none of them are below 40 MPH. So again it appears that there is some issues with the data. Hopefully these outliers can be properly removed and then real study can begin.

5 Comments:
Hi Josh,
A blogger pointed me in your direction, and after reading through your work I'm really impressed by your knowledge of the numbers side of baseball. I'd like to talk to you more about your writing, but I can't seem to find an e-mail link on your site. If you could, drop me a line at chumes@mvn.com at your leisure.
Thanks, and keep up the great work,
Cory Humes
chumes@mvn.com
http://mvn.com/mlb
Thanks for the kind words Cory. You should have an email in your in box :)
Excellent work, again, Josh. I really appreciate you sharing this stuff with the rest of us.
I've seen a few outliers in just about every data set of more than 100 pitches that I've worked with, even after intentional balls are taken out, but I haven't had a systematic way to get rid of them.
There are all sorts of weird things in other parts of the Gameday data, too, like an August game in San Francisco being recorded as "indoors" and a temperature of 0 degrees. I know it's cold in San Francisco in summer, but I don't think it's that cold.
Having spent a night game in the upper deck at AT&T I can tell you it is cold. But obviously not that cold. And not when the game is being played indoors :) I am hopefully that thing like that haven't messed up the pitchFX system and that is just someone punching some keys wrong when the data was entered.
What I am really interested in is your comment about outliers in every data set over 100 pitches. It would have been nice to be able to just cut out a section of data once it was deemed bad but if the system is screwing up say, one time in 20 all the time there are ways of dealing with that as well.
If you look at some of my plots above it looks like the pathological points are pretty evenly distributed. Or at least they are linearly distributed. Background subtraction on data like that is something that is done all the time and when I look at plots like the ones above it actually makes me feel better about our chances of really understanding the data sooner rather than later.
I was looking at a data set for Jamie Moyer this evening, and found one of the obvious outliers in my data, and just thought I'd mention it to you. I don't know if helps your work to have examples, particularly ones like this that are pretty obvious and thus easier to weed out of the data, but I'm just happy someone else is looking at this facet of the data and so I have to share.
I was particularly interested in Moyer's changeup, which he routinely throws with a start_speed of 68-71 mph (uncorrected for changes in y0), with the slowest pitch at 66mph. Except for one outlier at 38mph.
That pitch was recorded in a game August 12, in the third inning, in the middle of Yunel Escobar at bat. It was a ball just of the inside corner above the knees. It was surrounded by otherwise normal pitches of speeds 74, 83, 81, and 82 mph. It's just weird.
Post a Comment
<< Home