Response to a Reader Question
A couple of posts back reader Alan brought up some interesting ideas for checking the data. Here is part of his comment.
Consider only fastballs, which we can take to be pitches>90 mph. First thing to look at is the initial z-component of the velocity. A negative z velocity means the pitch is thrown slightly downward. Do you see a correlation between the release point and the initial z velocity? Does the pitcher compensate for the higher release point with a larger downward component of velocity?
I want to examine these correlations and add in a few more variables to help complete the picture. At the time Alan had wanted me to use the two parks that had the largest separation in vertical release point, z0, which were Fenway and AT&T. Since then I found a bug in my code and now the two parks that are furthest away are Fenway and the Metrodome. The problem is both of those parks are on the lower end number of pitches tracked by PITCHf/x. So instead I am going to start by looking at Petco park in San Diego and AT&T park in San Fransisco. AT&T doesn't have a whole lot more statistics than the Metrodome but it has a lower variance in my correction factor and the Giants and the Padres play each other very regularly so hopefully the overlap of pitchers in the data will be larger.
I also should note that I am a little bit concerned about using the definition of all pitches with an initial speed of 90 MPH are fastballs. While I am not too concerned about actual fastballs that are below 90 MPH being missed with this definition, I am concerned that some breaking balls will enter the sample. Not too many pitchers throw a 90 MPH breaking ball but my initial correction factor for the error on the pitch speed is about 5 MPH and there are plenty of pitchers who can throw an 85 MPH breaking ball. Nevertheless, I haven't come up with a better definition at this time and this definition will work for our purposes today.
To start with I am going to check the correlation between the initial vertical release point and the vertical height when the ball crosses home plate. The reason I want to check this first is something else that Alan said in his post that the calibration should be better near home plate. That got me remembering this tidbit from Joe P. Sheehan when he was writing about differences in the parks here:
Almost all of the pitchers also get a smaller pfx_z [movement of the pitch vertically] value at home, which would seem to indicate that their pitches have more sink at Fenway, but is actually a result of the lower release height combined with the fact that, overall, the average height when a pitch crosses the plate at Fenway is similar to the height at other parks.So he was seeing a very large variation in the release point but a small variation when the pitched crossed home plate. This doesn't seem to make sense and I want to look at this first. So finally, here is a plot comparing the initial and final height of the ball at Petco and ATT&T parks.
Pitchers tend to release the ball about 6 feet above ground level though obviously this will vary from pitcher to pitcher. We can see in this data though that the Petco data tends to be below 6 feet and AT&T data tends to be above 6 feet. Also, we can see a bunch of points near 3 feet in the San Diego data. This is from side armer Cla Meredith for the Padres. You would expect to see a few points from him show up on the San Fransisco data but that appears missing. So I went back and checked and Meredith has yet to pitch at AT&T park while PITCHf/x was activated. There appears to be another grouping of pitches just above 4 at Petco. This almost certainly is another Padre pitcher but I haven't yet tracked him down. If there are any Padre fans who know who this is please let me know.Anyway, besides the disparity in initial height, the height as the ball cross home plate appears very consistent across both parks. If the initial position is off by as much as we think then why is the final position so stable? It must be as Alan suggested that the PITCHf/x system is more stable near home plate. I have a theory as to why this is but I am going to save that for my next post when I go in depth as to what I think is actually happening with the data. What this is showing is the initial and final heights of the baseball aren't correlated at all. This means we should be free to correct the initial position without worrying about changing the final position (as funny as that sounds). Here then is the same plot with the vertical correction applied.
What an improvement that makes. Again because this correction is based on a pitcher by pitcher comparison of each park, this shift isn't moving the center of the Petco data on to the center of the AT&T data. Because the Padres have a few pitchers who throw at a very low height that difference still remains in the data. The "average pitcher" who releases his ball just above 6 feet though will come together and that is exactly what the corrected plot shows. Now we are ready to look at the initial height and the initial vertical velocity to see if we see a correlation there. Because we aren't seeing a correlation between the two heights something must be causing that and it pretty much has to be either the initial velocity or the acceleration or both. Starting again with the uncorrected data.
Here we can see clear correlation and it is exactly what we would expect. As the pitch is being released the higher it is being released from the more negative (or downward) its velocity. This makes perfect sense the only problem is the data looks terrible. Again we see a difference in the initial height but there appears to be more here. Lets start out by correcting for the initial height and see what that gives us.
Now the heights seem to match up well (except again for the two blobs now at 3 feet and near 5 feet) but the velocities seem off. The AT&T data appears to have more downward initial velocity than the Petco data. So I am going to apply a correction to the initial velocities that I calculated the same way I calculated the initial height correction. As I pointed out in previous posts the errors that I am seeing on these corrections are huge. For instance, Petco checks in as being high by .5 FT/s with an error of 146 FT/s (AT&T is nearly 1 FT/s low). Obviously that doesn't seem to make any sense and either something is still wrong with the my code or we just need more data or I need to correctly identify the fastballs or I need to carry the calculation out further. Because of this I am not yet going to publish these corrections. I don't really trust these numbers and I don't want people using them until I feel confidant that they are correct. Once I get them fixed though I will be putting the numbers out for people to use. Just for fun lets put in the numbers and see what happens.
Wow that looks pretty good. I just don't understand why I am seeing such a huge error when I look at plots like this showing things matching up well. There is another interesting thing can be seen in this plot. Remember back when I said I was concerned about making a hard cut at 90 MPH for the pitch speed? The reason was that cut wouldn't be uniform over the parks. Here, AT&T was increasing the initial pitch speed by having a more negative initial vertical velocity. Petco was doing exactly the opposite. That means we are actually seeing some 87ish MPH pitches in the Petco data and we are only seeing 93ish MPH pitches in AT&T. I believe that is why the AT&T data fits snugly inside the Petco data. The slower the ball is moving presumably the more potential for break (acceleration) there is and the wider the variation in position and velocity.That was interesting but while Petco and AT&T were at the extremes for variation in intial height they were closer to middle of the pack for variation in initial downward velocity. What if we look at two parks that are very extreme in both categories? The two best (worst?) parks here are Fenway and Angel stadium.
Wow that plot looks ugly. Hopefully after our corrections things will get better. Again we will start by correcting just the initial height.
Not quite the nice fit we saw before (in the initial height match). Part of this could be due to the Boston staff being shorter than usual but part of it might be be due to error on these numbers. Fenway is checking in at an error of nearly .2 feet and while that might not seem like a lot, if you moved the purple points right .2 feet it sure would look better to me. Now on to the initial velocity adjustment. These two parks are over 4 ft/s (over 2 MPH) different in just their initial downward velocity according to my numbers. Again, the errors on these numbers are huge but lets put them in and see what we get.
While not as perfect as the AT&T/Petco match this is a huge improvement for two parks that were radically different to start with. This basically is the worst case scenario for having to correct the data and the results seem very good to me. If this was all the closer we could get with these corrections I would still be happy.Ok so I have shared the good news with you. Looking at these plots it really seems like not only can we understand what is going on with the data but we can fix it as well. Now the fly in the ointment. The other parameters that are vital to these calculations are the accelerations (in x, y, and z). For this data Sportvision is assuming that the acceleration is constant over time, meaning the change in velocity when the pitch is thrown is the same as the change in velocity as the pitch goes over home plate. Now, obviously this isn't a perfect assumption as the ball could be slowing down more the closer it gets to home plate. The problem is if you allow for a changing acceleration then the nice equations of motion that they use fall apart and things become even more messy. In reality, it probably isn't bad at all to make the assumption that the acceleration isn't changing (Though I can't say for sure. If you are looking for a topic to tackle using this data this would be an idea.) but the problem for us is the method that we are using for creating corrections for the initial distances and velocities won't work. This means if we find that the accelerations need fixing, along with the positions and the velocities, then we are going to have to come up with a different method then the one I have outlined for fixing them.
Close your eyes (or turn off your monitor) if you don't want to see the bad news.
Going back to Petco and AT&T here is the vertical acceleration compared to the initial height. Again we can see the problems in the initial height because this is uncorrected data, but the accelerations don't seem to be matching up well either. Correcting for the initial height we can fully see the problem.
Ick. Again we can see Meredith and his fastballs that appear to be breaking down very hard (sinkers). Also, pitcher X's data has come out from hiding a bit and we can see his contribution near 5 feet in initial height and -40 st/s^2. His fastball must be a sinker as well. The bad news though is it appears that an acceleration correction is going to have to be made for this data to match up. It is close, but just not close enough. This really sucks because what appears to be happening is the acceleration is being spread out in Petco and this correction won't be a nice linear one like the position and velocity corrections have been. Just for more proof here is the Fenway/Angel stadium plot uncorrected first.
Again, these two parks are just about as bad as the data is going to get unless one of the last two parks to come online really sucks. Correcting for the initial height things get better but still look pretty poor.
Again we are seeing a spreading out of the accelerations. Instead of being able to match these two distributions by moving one or the other left/right or up/down the distributions will have to be shrunk or spread out. It is possible that my artificial cut at 90 MPH is doing some of this (like we saw in the position/velocity graphs) but I don't think it is responsible for all of it.So where do we stand? Even without a great way of teasing the fastballs out of the data it appears that we will eventually be able to get some good correction factors for the initial positions and velocities. The accelerations are another story and something that will have to be thought about. If anyone has a good way of cutting the data to produce fastballs and are interested in sharing it please let me know. Also, if anyone knows thinks they have a good method for correcting the accelerations even if they don't know exactly how to implement it let me know.
You may have noticed that I started calling it PITCHf/x instead of pitchFX like my previous posts. I had seen it written both ways a lot and thought pitchFX was correct but after reading through the sport vision website again it definitely should be PITCHf/x. My apologies to the creators.
ps. If reader Alan happens to be Dr. Alan Nathan who published this excellent paper examining John Lester's start against the Mariners please email me. You can find my email address under my profile on the right. I'd really like to chat about possibly using the spin magnitude and axis to classify pitches and why his theoretical fit to the data matched up so well when I am seeing such terrible agreement. Actually, anyone who wants to discuss any of that or anything else can email me with the link provided under my profile.

6 Comments:
Josh, here is the list of pitchers with the most pitches in Petco with a recorded z0 of less than 4.5 feet:
404 Meredith
75 Thatcher
58 Kim
34 Fuentes
31 Smith
28 Moylan
I'm pretty sure Joe Thatcher's your man, since he and Meredith are the only guys with multiple games with low release points, whereas the others are mostly visiting pitchers who accumulated the total in a single game, e.g., Byung-Hyun Kim on July 5.
I should clarify something from my previous comment. Meredith and Thatcher are the only two with many games with lots of pitches with low release points.
Brian Fuentes has three such games as a visiting Rocky. Chad Bradford has a game in there with 8 pitches, and we know he's a submariner. Jake Peavy likes to come sidearm a couple times a game, and so he has a seven games with one to three pitches with z0 < 4.5.
But Meredith and Thatcher are the two main contributors.
Yeah that looks right to me. Thanks Mike. The other points that appear more scattered but still with a low z0 could be some combination of the other guys but the two larger blobs almost certainly are Meredith and Thatcher.
I looked up a little bit more about Joe Thatcher, and here's what I found:
Thatcher called up to Padres
"Thatcher describes himself as a lefty with a 'funky, low three-quarter delivery' with a standard fastball and slider combination as his primary pitches."
ISU grad Joe Thatcher knocking on major league door
"Thatcher has been challenging hitters in minor league parks all over the country these past few seasons with his slider and his cutting fastball that reaches 91 mph."
Also, you might want to check out Joe P. Sheehan's article on separating out the fastballs in the dataset, as well as the comments to the article. Unfortunately, it requires a bit of work to implement.
Makin' a Filter
I've done some work with Alan Nathan's method using speed and spin direction to classify pitches. I haven't published any of it yet, but it looks like a promising method. I've been using the approximate formulas he presented in his paper rather than trying to solve the equations of motion, and that seems to work pretty well.
One nice thing I've found, although I don't know if it's universally true for all pitchers and pitch types, is that speed and spin direction seem to be sufficient to capture the uniqueness of most pitches. Spin rate is highly correlated to speed and can be effectively ignored in many cases. That makes some physical sense, certainly with the fastball, where the harder you throw it, the more backspin it will have, but probably also with other pitches. Being able to classify pitches with two parameters would be an incredible boon for presenting that information graphically. The release point is another important variable, as is handedness of the batter, but I think you can still separate pitch types more easily in two dimensions with speed and spin direction as the axes than with speed and horizontal break as the axes.
In addition, since the spin direction is dependent only on the acceleration in x and z, it is independent of the y0 measurement point. Only the speed has to be adjusted for y0, which makes it easier to compare data sets collected at different y0 points.
Wow that is some amazing stuff there Mike. I can't believe I missed that post from Joe. Thanks very much for that link. Also, your description of Thatcher fits perfectly with the blob for pitcher X.
You are absolutely right if we can get pitch classifications down to two variables that would be very, very, useful in teasing out each pitch type from the data. If we can reduce x and z break down to just spin direction that would be great. If you think about it, each type of pitch should have the same spin direction. A curve ball HAS to spin like a curve ball no matter if Barry Zito or Ben Sheets is throwing it. We might have to do a 180 correction for lefties but that should be easy.
I encourage you to post your findings on your blog. Even if the work is just preliminary, stuff like that can really spark things forward (just look at how much I have learned from posting my preliminary results).
Josh, thanks for your encouragement. I've published what I have so far:
Post a Comment
<< Home