Monday, September 3, 2007

Preliminary Correction to the PitchFX data

So I have been doing a lot of work with the pitchFX data recently and I think I have a nice method for correcting the vertical release point, z0, (and other variables) in the data. This post will outline the method that I am using. I'd really, really, like some input on what people think about this method. Because of this, this post is going to be very math heavy. If you aren't interesting in that kind of stuff just skip it or skip to the final numbers at the end.

Joe P. Sheehan has started to look at this in his post at baseballanalysts.com. In that post he compared pitchers home/road splits for the vertical release point and showed that the data has some serious quirks. Before any real analysis can take place the data must be normalized. One would like to find a league average release point and then calculate the difference between that average and an average of every pitch thrown in a particular park. The problem with that is if a team's pitching staff is much taller or shorter than average the data is going to appear skewed. Because of this, a more complicated method will be needed.

I want to start by mentioning that during different times this year the pitchFX system has been recording at different distances. They started at 55 ft. from home plate but have moved it in as far as 40 ft. To normalize the data we are going to need a consistent value for the initial position. So I am moving back each point not at 55 ft. to 55ft. using the standard equations of motion. You can find a good description of this process at Alan M. Nathan's page here. From now on if I use the word release point I am talking about the z0 data recorded at 55ft. Once this correction has been done I can start the normalization process. Also, I am using almost all the data available going up to the games played Saturday night (Sept. 2rd).

My plan is going to be to compare each park using pitchers that threw in both parks while the pitchFX system was in place. By comparing the difference in release point between the two parks we can get an estimate the difference in the pitchFX system in each park. This process assumes that pitchers aren't changing their release point over time. Obviously, some of this is going to occur but pitchFX has tracked over 200,000 pitches so hopefully we can overcome this with the aid of large statistics. Sheehan used Josh Beckett as an example in his analysis and it turns out that Beckett's numbers are pretty interesting so I will use him as an example as well. Here is a histogram of Beckett's release point for every park he has pitched in that had an operational pitchFX system.

Obviously, Beckett has thrown the most pitches at Fenway and had exactly one start in each of the other parks. If you would scale each distribution to the same size though they would pretty much look like each other. This implies that Beckett hasn't changed his release point and the only difference in the distributions are from the pitchFX system. As Sheehan noted, the data from Fenway appear to be way different from the data at the other parks. That said, there still is relatively large variations in his road starts as well.

So to start I would like to calculate the difference between the system's output in two parks. To do this I need to search for pitchers that have pitched in both parks while pitchFX has been turned on. To be added into the sample I also require at least 20 pitches be thrown by the pitcher in each park. I then calculate the mean and the variance for each pitcher in each park. Then I calculate the difference between the parks by subtracting the means. To get an error on this I add the squares of variances. I then have a list of pitchers who pitched in both parks and the mean and variance for the difference between the parks. I want to average them but to do that I am going to use a weighted mean so pitchers with lower variance will count more. Do this for each pair of parks and you have the difference between release points between the parks.

What we really want though is the difference between each park and a league average. Fortunately, if we add each of the differences for a park together we should get the difference between that park and league average. Do this for each park and you have every parks difference from league average.

As an aside this is where interleague games really come in handy. Without them there would be little crossover between AL and NL parks and we would probably have to settle for an AL and NL adjustment instead of a league adjustment. So interleague play is good for something.

So here are the results:
z0 release point at each park compared to league average
bos -0.653
nyn -0.357
cha -0.276
sdn -0.243
sea -0.176
flo -0.175
tba -0.083
tor -0.083
det -0.073
lan -0.061
hou -0.046
sln -0.030
nya -0.011
atl -0.006
was 0.000
bal -----
pit -----
kca 0.016
ari 0.017
cle 0.019
mil 0.034
oak 0.071
col 0.145
phi 0.183
min 0.195
tex 0.243
chn 0.253
cin 0.285
ana 0.362
sfn 0.450

First, pitchFX has yet to be turned on in Baltimore and Pittsburgh. Both those teams are on a road trip right now and I expect that when they come back home data will start rolling in for them as well. Washington has been turned on but only for a very limited time. Hopefully, that will get back shortly and that can be added as well.

These results generally agree with what Sheehan found using home/road splits but there are a few parks that changed a good deal. For instance, Colorado looked like it was producing very high release points but now has settled down. Because pitchFX was installed very early in the west coast stadiums, and because San Fransisco and San Diego are pretty far away from league average, the home/road splits could have been corrupted. Maybe a similar thing was happening with Boston and Toronto with Rogers Centre.

Going back to Beckett for a moment. Now we can see from the chart that every single park he has pitched in with pitchFX turned on is below league average. So if you were to look at his data without first adjusting for these effects you would really be short changing Beckett.

In any case, it is important to test these results to see how much of a difference they make. To do this I looked at every pitcher in the league and calculated his mean and variance of his release point at home and on the road first without applying these corrections then after applying these corrections. I then calculated a league average mean and variance for home and road games again using weighted means for each of the samples. If the corrections are working the road variance should move much closer to the home variance, and that is exactly what I find. Without the correction the home variance is .00004187 and the road variance is .00005516. While these numbers look small the percent difference between them is over 30%. Now after the corrections the home variance becomes .00004173 and the road variance becomes .00004213. Now that is a remarkable change. You may be asking why the home variance changed at all. After all, if you are just adding a constant to the home numbers the variance should remain unchanged. The reason there is a slight change is from pitchers who changed teams and have at least two home parks. The way I have my code setup it is very hard to remove them from the sample so I left them in and that slightly changed the home variance.

So what is next? Well I have to run the numbers for the horizontal release point, x0 and then for the initial velocities. All of these numbers will need to be corrected before we really can use the data properly. The initial velocities will be more difficult because pitchers might throw a different percentage of off speed pitches to different teams. The Marlins, for example, are supposed to be a very good fastball hitting team. When a visiting pitcher faces them he probably will throw more off speed pitches making it appear like his overall velocity is down if you average over all pitches. A possible work around to this will be to identify just the fastballs in the sample and then only compare them. In any case, it will take me a few days to produce those numbers as the code needed for this takes forever to run and my eyes, they bleed. If you got this far congratulations, please consider leaving a comment if you have any questions or concerns about the method. Thanks.

6 Comments:

At September 5, 2007 12:24 AM , Blogger Mike said...

Josh, this looks like great stuff. I'm definitely interested in your results for other parameters and whatever of the gory details you want to share about your method.

I've felt for a while that what Bill Ferris, John Beamer, and Joe P. Sheehan had done with looking at the data integrity from PITCHf/x was only scratching the surface of a much more complicated problem.

Have you looked at which of the deviations from the mean for z0 would be considered statistically significant?

Also, are there differences between LHP/RHP?

I'd love to get some feedback from Sportvision about this, since they claim their data is within an inch or two. I think if we can get together some statistically significant data from the complete data set, rather than the dribs and drabs we've had so far, we might be able to do that.

Again, excellent work.

 
At September 5, 2007 9:48 AM , Blogger Josh Kalk said...

Mike,

Thanks for the kinds words. It is funny you bring up the statistically significant part because the first thing on my list to do today was to get the errors propagated over the last step. Because the last step involves adding the error in quadrature parks that don't have a lot of common pitchers with just one other park have their errors blow up. On the bad side, parks like Tropicana have errors like 6 which is obviously huge. One the good side, if a park has a decent amount of common pitchers with all other parks things work out great. Here are some of those with their means and variances for data up to Sunday (everything in ft.):

park mean variance
sdn -0.262 0.02817
sea -0.174 0.00669
lan -0.061 0.02935
sln -0.029 0.01239
sfn 0.452 0.00089

I haven't run an ANOVA test on the sample but just looking at those five parks I think it is safe to say that pitchFX is clearly not accurate to an inch at 55 ft. from home plate across all parks.

So what can be done about the parks like Tropicana, Fenway, and Roger's Centre that all have errors larger than 1? Well, in each case it appears this is happening because of one or two parks. Maybe the answer is move everything to second order. Not just look at the pitchers who pitched in park A and park B but pitchers who pitched in park A and park C and then other pitchers who pitched in park C and park B. That is probably what I will be trying tonight.

As for the differences between LHP/RHP I haven't broken the sample down like that but that is something I definitely will be trying. I haven't post the x0 results yet because they just got finished this morning but without doing any LHP/RHP corrections there I see just as good of a match as I do with the z0 data.

Thanks again for the interest and hopefully I will have more interesting stuff for you to look at tonight.

 
At September 5, 2007 8:16 PM , Blogger Harry Pavlidis said...

Very nice, I'm very glad someone is taking this on.

Also, perhaps some of the variance could be in mound height? Would that not even be on the order of magnitude to impact the z0, amongst other potential factors?

 
At September 5, 2007 9:15 PM , Blogger Alan said...

Josh...if I am reading your writeup correctly, there is over a 1 ft difference between the two extremes (Boston and SF). That is an enormous difference. Here is something you can do to see if it all makes sense. Consider only fastballs, which we can take to be pitches >90 mph. First thing to look at is the initial z-component of the velocity. A negative z velocity means the pitch is thrown slightly downward. Do you see a correlation between the release point and the initial z velocity? Does the pitcher compensate for the higher release point with a larger downward component of velocity? Ultimately, the ball has to drop into the strike zone, so another thing to look at is the z component of the acceleration (which should be negative). Do you see systematic differences between SF and Boston? If the camera calibrations are correct, I would not expect to see any differences. If you see a larger downward acceleration for SF than for Boston, that might indicate a difference in the vertical calibration between home plate (where is must be pretty close to right) and the mound.

 
At September 5, 2007 11:09 PM , Blogger Josh Kalk said...

Harry,

Yes the mound height certainly could have an effect. MLB is supposed to make sure the mounds are the same but who knows how often, if at all they check that. The problem with the data is right now the corrections are very large, nearly a foot between the highest and lowest values. If the mounds really were that different in the parks people would notice. Once we get the data behaving better though mound height is something that we will definitely have to consider.

 
At September 5, 2007 11:26 PM , Blogger Josh Kalk said...

Alan,

Thanks for your suggestions. I definitely will take a peak at the correlations you suggested between z0 and vz0 (and the vertical acceration too). I will try to get a post looking at that tomorrow.

I am a bit leary of using pitches>90 to define fastballs though. My initial numbers seem to indicate that some parks velocity measurements might be off by 5 MPH. If that is anywhere close to being correct then if I make a cut at 90 MPH I am going to get some 85 MPH pitches included in my sample. Certainly, many of those pitches could be off speed pitches. I may start with a cut like that but I definitely want something more advanced later on.

 

Post a Comment

<< Home