Thursday, September 27, 2007

Player Cards for Batters Now Available

I know I promised a post with a full explanation of my code but things just keep changing and I am going to hold off until I am really happy with everything to make that post. So again I have messed with my clustering algorithm and added a couple of dozen hand edits on top of that mostly changing classifications of sliders/cutters and splitters/sinkers. I think things are getting close as probably about 95% of pitches are correctly being classified.

This has allowed me to then run the data through another plot maker to make player cards for batters as well. Again, I require 100 pitches seen by PITCHf/x to qualify. I have added the most recent team a player has played for and what hand he throws with. Sadly some players, like Adam Dunn, bat with the opposite hand than they throw with. I'd love to add if they bat left handed, right handed, or are a switch hitter but that data doesn't seem to be easily available in the files I am grabbing from MLB. So I am going to have to grab some other files and cross reference to get that. We will see when I get around to that.

There are some more things I am planning on adding to the player cards like contact rate and how often they swing at balls but to make those numbers meaningful I need to find a league average first. I also am going to be comparing pitchers this way. Hopefully an update with that will come this weekend.

As always, if you see something you think is wrong or something you would like to see added, or a design thing you would like to see changed leave a comment below.

Thursday, September 20, 2007

Brand New Player Cards

Well I was able to get things finished a bit early so here are some brand new player cards. I am planning on making a big post about exactly how these plots are produced this weekend but for now let me just say that I am correcting the initial position/velocity of the ball and the accelerations (and a big thanks to Dr. Nathan for finding a mistake in my code that corrects the acceleration). This way I can properly combine home and road data. Once that correction is done I am pretending each ball was thrown at sea level at standard temperature. This is done so my classification code can look at each pitch on the same footing and then determine what type of pitch it was.

The classification code still needs some work. It is getting better but still will miss-classify some pitches as belonging to a wrong group or incorrectly classify a group. Maybe a better way of thinking about the two possible errors is the first error is when a pitcher throws both a slider and a curve and a certain pitch got labeled a slider when it should have been labeled a curve. The second error is when a group of pitches are labeled as sliders when really every one of them is a curve. I am still trying to adjust this code but it is getting better. The only real big hole right now is it is not calling any group of pitches split fingered fastballs. It also is having some issues with side armed pitchers and if they are throwing two or four seamed fastballs. I am working to correct that but I thought it would still be useful to show what I currently have. The last time I did this I got some excellent reader response on what pitches the clustering algorithm was getting wrong and I'd like to ask for your assistance again. Take a look through the player cards and either comment below what is incorrect or email me (my address is under my profile to the right).

Even if you don't find any pitches that are being classified incorrectly if there is something you don't like about the presentation of the player cards, or something you would like to see added, again please let me know. This really is my first attempt at something like this and I am not incredibly handy with html so feel free to suggest an alternate way of doing something.

Wednesday, September 19, 2007

Corrections to the Corrections?

Mike Fast made an interesting comment in my last post.
One other thing I had been wanting to ask you about...when you calculated the corrections to the x0, z0 initial point, did you assume that each park had a single correction factor that did not vary with time? I noticed with Papelbon that his data was quite different between two different Boston homestands, and Dr. Nathan mentioned to me that the PITCHf/x system is typically recalibrated between homestands.

I haven't known Mike for long (and really how much to you ever know someone from reading blogs?) but I do know that when he says something, it is worthwhile to look in to. I had assumed that things in each home park stayed pretty much the same. Pretty much everywhere I look people have been making plots combining all home data. This is something I probably should have looked at earlier but better late than never. So the question is, do we need to add a daily (or home stand) correction to home parks?

To start, lets look at a few pitchers initial release point from game to game and see how things look. Even though Mike mentioned Papelbon I am going to start by looking at Jake Peavy. The PITCHf/x system was installed from day one in San Diego and Peavy has been a workhorse for them. Here is Peavy's vertical release point by date.
I have added Peavy's road starts in just to give an idea of what kind of error you can expect from park to park. It looks like there is some variation in Peavy's release point as time goes on in his home starts. That variation is less than the variation you see in the road parks but it is there. That said, you can see the wide spread of his release point in game and that spread is larger than what the difference game to game is. By the way, I have removed pitches with speed less than 60 as Joe P. Sheehan suggested but I still see some pathological points. I am not quite sure what to do to remove these right now. The plot looked much worse before I made the cut on speed though so I do believe that is at least helping. What about his horizontal release?
This looks maybe a little worse than the vertical release. Even though Peavy is a right hander I changed the sign to report positive numbers here. Maybe there is some trend towards bringing his release point in closer to his body? Is that an adjustment, or is that from PITCHf/x getting recalibrated or is that just some random noise? With Peavy not really providing the answers lets turn our attention to Papelbon.
Papelbon being a reliever has stretches of getting into multiple games back to back. The last two games on the way right are Sept 12th and Sept 14th and the blob just left of that was a three game stretch from Spet 2nd to the 4th. PITCHf/x wasn't installed in Fenway or many AL East stadiums until recently so we don't have a ton of data to work with. Fenway is also noted as having one of the worst calibrated PITCHf/x systems which is kind of strange because it was installed relatively late. You can see Fenway tends to be lower than his road starts which the correction fact finds and maybe the four home days on the left are lower than the four on the right. Could this be Sportvision realizing Fenway was messed up and recalibrating? Lets take a look at his horizontal release point.
Ugh this is all over the place. That nice three game stretch we noted seemed to have very consistent vertical release point but the last day here it appears Papelbon's release was much closer to his body (mechanics breakdown from pitching three straight days?), or maybe he was a step left on the mound from what he normally was, or maybe the system was recalibrated mid series. If that was the case maybe we would need to do a daily correction to the release point like we have done with the acceleration. Well, from looking at these plots it doesn't appear we have any definitive answers. One pitcher just isn't enough, we need to look at the whole staff. We can't just plot every release point from the home team every day though because some pitchers have very different release points. We need to find each players average release point and then subtract that from each pitch. This will show us the actual difference in release point from average for each pitcher which will put them all on the same level and easy to compare. Everything up until now I have been measuring in feet because these release points are far away from the origin. These differences are going to be much smaller though so I am going to move to inches to make these difference plots. Also, because the horizontal direction seems to be worse I will be using that to compare. Lets start with Fenway.
You now can clearly see the Red Sox home stands and what the differences were for each pitch thrown on each day. Again notice at how large the in game spread is. It appears that at least the Boston pitchers are varying their release point by almost a foot during each game. I've added a grid to make it easier to see how each of these home stands compare with each other and with zero. If the system was getting recalibrated and that was changing the horizontal release points being measured you would expect to find some home stands higher than zero and some lower than zero. If you look very closely you can see that maybe the first few home stands are high by an inch and maybe the last two home stands are low by an inch but it is hard to tell. Maybe looking at a park like Petco which was around from the start would show some move variation.
The Petco data looks pretty consistent to me. Again, maybe the home stands in the middle and the one on the far right show a slight increase and the others a slight decrease but that appears to be very small. Interestingly, their second home stand which was very short seemed to have a few pitchers throwing with and increased difference. That is countered by a single pitcher who was almost a foot below average though. A few parks had the system installed for one day while ESPN was in town only to have their camera removed and then added again at a later date. Coors field is one of those and we have seen that system seems too be pretty bad as well so lets look at that data next.
Even that first day, almost 80 days before their camera was installed full time, shows remarkable agreement with the rest of the data. Maybe that day is a little high and maybe the last home stand is as well but that again isn't anything larger than two inches at most. I have looked through every stadium and through several variables and have seen the same story in each one. The only stadium that really shows a recalibration changing the data is the horizontal release point at Chase field.
This is what I would have expected to see from the other parks if the recalibration was really changing the data. The first two home stands appear to be about four inches above zero. The next home stand maybe about two inches above zero. The last three home stands appear to be two or three inches below zero. Interestingly, the vertical change appears to much smaller than the horizontal.
So what can we conclude? Well it does appear that Sportvision is recalibrating their PITCHf/x systems between home stands but, in general, those corrections are relatively small. Chase Field does appear to be an exception though. My correction factor seems to think that, overall, Chase Field is moving the horizontal release point about four inches to the left (as the catcher sees it). But it appears that the difference in Diamondback home games alone is about four inches because of recalibration.

So what should we do about this. I probably could just adjust Chase Field "by hand" and be done with it but what if a recalibration in another park messes up the data in these last few weeks or even next year? It sure would be nice to have that automated. So what I am planning on doing is writing a first correction algorithm that will sit in between my code that parses the data and the code that currently does the corrections. This code will do a home stand by home stand intra-park correction and then feed the results to the regular code that will handle the inter-park corrections. Unfortunately, this will push back the player cards to probably this weekend. I know I am such a tease, but hopefully making this last correction will really nail things down. I'd like to thank Mike again for pointing this out. If you have any comments or concerns with the PITCHf/x data please comment below or email me.

Tuesday, September 18, 2007

Breakthrough

Well basically nothing got done this weekend but I did have a little time Monday while watching baseball to muck with my code and I think I have finally solved the riddle of correcting the acceleration data. As I noted at the end of this post the solution to the problem would not be a nice linear solution like the initial position or the initial velocities were. As Dr. Alan Nathan points out in his analysis (p. 2), the forces (and, as such, the acceleration) on the ball in flight are definitely not linear.

This means a non-linear solution will be needed and in particular a solution is going to be needed for every park for every day that a game has been played. For example, the air density is needed for calculating both the drag and Magnus forces on the ball. The most widely known example of air density causing a problem is the thin air at Coors field in Colorado. What isn't as widely known is that the air temperature plays just as big of a roll in calculating the air density as the distance above sea level. This year the Reds played a game against the Pirates at home where the game time temperature was 30 degrees Fahrenheit. They played another game against the Braves where the game time temperature was 99 degrees Fahrenheit. Obviously, the ball is going to behave quite differently in these two situations. Again, the air density is a non-linear equation so this is going to be a mess.

I am not going to go into great detail on everything that went into my 700 line C++ program that calculated these corrections because while I think I have everything correct there still might need to be a tweak or two to the code. Also, I really don't want to bore the readers with four paragraphs of hard core explanation. If this is something that you the readers want to see, please add a comment at the bottom and I will consider it. The short version is I modified my code that was used for the linear corrections to make it non-linear and added in some physics equations to find the corrections and only used fastballs for making this comparison. This is also going to make it very difficult if not impossible for me to properly disseminate the corrections. I will be thinking about this and hopefully will come up with a solution.

What I do want to do is show you the results. For that I am going to use Jeff Francis since he has been in the Rockies rotation the whole year and Colorado does indeed need the largest corrections even after the atmospheric values were taken into consideration. Instead of showing you break of the ball, like I have before, in these plots I am going to show you the actual accelerations. Break can be calculated from these values (along with the initial positions/velocities) and again Dr. Nathan has a good explanation of how to do that here. The reason I want to show acceleration here is because that is the value that is going to be corrected. So starting with the uncorrected data here is the x and z accelerations for home and away games while PITCHf/x was on for Jeff Francis.
Now maybe I don't know a lot but one thing I do know is the ball should break less (have a smaller acceleration) at Coors than at other parks around the league. Yet this data appears to be exactly opposite of that. Obviously, something is messed up with the data. The blob of data around (-5,-35) is Francis' curve ball and the huge mess to the upper right is a concoction of his fastball and change. Lets apply the correction factors to both the home and road pitches and see if we can't clear things up a bit.
You can see some of the non-linearness if you look very closely between these two plots. Some good things happened here. First, the away game data changed a bit but very slightly. This is exactly what we would expect to see as his road starts should be much closer to league average than his home starts. Second, the home game data now shows less acceleration both vertically and horizontally which is good. Francis' curve at home actually appears to have about the same horizontal acceleration as his curve on the road. There does appear to be a slight tail pointing towards zero in the road data though. I wonder if this is because I used only fastballs in my comparison or if maybe Francis is compensating and slightly overthrowing his curve at home? This is one of the loose ends I am still trying to track down. Lastly, the huge fastball/change blob in the uncorrected data has become much more distinguished and now it definitely appears to be two blobs with the fastball in the upper right and the change down and toward zero horizontally. That is exactly what we would expect from a change and the reason this blob is so close to the fastball blob is because Francis has a very good one. Looking back you can kind of make out the distinction in the previous plot but it is much more defined here which again is a sign that the corrections are working.

There still is one problem though. How do we know this correction is moving the data to the right spots? We know that the Coors data should show less acceleration than the road data but how can we tell if it is overcompensating or under compensating? The only way I know how to check this is to transform each pitch like it was thrown at sea level at standard temperature. You can check the air density link again if you want to look that up. In this frame everything should be equal and all of the accelerations now should match up. So does it?
Yes it matches up very well. A careful eye will notice that not only did the Coors data get an increase in acceleration (or decrease since these numbers are mostly negative) but the road data did as well (though much smaller). The reason for this is standard temperature is about 59 degrees Fahrenheit and most baseball games are played at temperatures above that. Again, Francis' curve looks slightly different at home and away here. Maybe that is from the extra tail I mentioned before in the road data but maybe not. The strange thing is the vertical acceleration seems to be spot on but the horizontal acceleration is slightly off. I don't really have a good explanation for that right now. Maybe we should ask Jeff Francis himself who studied physics while at college at the University of British Columbia.

So where is all of this going? Well except for a little fine tuning I think I am ready to move to the full data set. You might remember me saying that I have stopped adding data so I would have a consistent set to work with. I am almost two weeks behind but I have started grabbing the new data now. Once that is done I have to run it through my parser to get the data in a usable form, then my correction code to get the new correction values, and lastly my player card generator which I still need to adjust a bit to output more plots than just the break. Hopefully, I will be able to have at least four plots like I showed for Jose Contreras in my previous post. If there is an extra pitching plot you would like to see added let me know in the comments section below. I also still need to remove those pathological points I showed in those Contreras plots. Joe P. Sheehan has suggested that a cut on initial speed might solve the problem. I had kind of thought that might be right but somehow that got lost in my memory so thanks to Joe for telling/reminding me of that. If things go smoothly I should have player cards with corrected home and road data by Thursday night. If things don't go smoothly or if I don't get a chance to work on this then I will have them up by the weekend. Once that is completed I can move on to other fun topics I wanted to look at with this data.

Thursday, September 13, 2007

Progress. Errrr, Sort of.

First I want to encourage everyone here to read Mike Fast's recent post about Greg Maddux. The analysis he has done on Maddux is what I am hoping my clustering algorithm can do on every pitcher. Things are moving along with the algorithm and I want to share some progress. The algorithm is still messing up Maddux's two and four seam fastballs but it now correctly identifies his cutter so that it some progress. Instead of showing you worse plots than what Mike had put together I decided to show a similar type of pitcher to Maddux in Jose Contreras. Now Contreras isn't having nearly as good of a year as Maddux but both are similar pitchers featuring several types of fastballs and a pretty small variation between pitches. Here is Contreras' horizontal and vertical movement.
What a mess we have here. Contreras is throwing a two seam fastball and what looks to be a cut fastball but also a change, a slider, and a curve. All of the pitches seem to blur together in this plot but if we add in the pitch speed they start to separate.
Here is a breakdown of horizontal break and pitch speed. I thought about adding vertical break to this plot as well but things were very messy as is. Here you can see Contreras' change break away from his fastballs and some separation between his sinker and his cutter. I was pretty impressed that the algorithm would pick up these differences. Also, even though we have much less statistics, we can see a clear speed difference between his slider and curve ball. Next the vertical movement.
Now you can see that his sinker really is sinking more than his cutter and the increased drop in his curve from his slider. What about his release point though? Contreras is known as someone who will drop down to 3/4 arm slot from time to time.
Perfect. We can see his regular arm slot and the 3/4 arm slot and it appears most of his cutters come from that 3/4 position. But hang on a second. What is with those stay points off to the right? This must be where PITCHf/x just screwed up and miss read the pitch. As crafty as Contreras is I doubt he actually threw a pitch left handed. Every time I look up it seems there is something else to the data that needs correcting. Clearly that unknown point way to the right needs to go and the change that is off by more that a foot also needs to be removed. What about that cluster of five pitches in the upper right though? Is that a crafty vet showing a different arm angle for an important pitch or just a mistake from PITCHf/x? I don't have the answer right now but hopefully will soon.

This weekend looks very busy for me but hopefully I will have some time to work on this. The order of what I am planning on doing is fixing this release point issue first. Then hopefully going back to the clustering algorithm and getting that ready to go. I feel like that is close. Seeing what a good job it did with Contreras gives me hope. The biggest thing right now is probably getting it to merge more of those unknown points into established pitches. Lastly, tackling the acceleration correction which I will almost certainly not have time for. I actually had a decent idea for a work around with it but it is going to take a long time to code up and then test.

Tuesday, September 11, 2007

Player Cards

After a weekend of banging my head against the wall trying to figure out how to properly normalize the acceleration I needed a break. So back to just looking at home pitches for pitchers. I decided to skip ahead to the next thing I wanted to do which is upload some player cards. Basically, the plan was to use the PITCHf/x data to create a plot of the type of pitch each pitcher throws and then start to expand from there. What I needed was a clustering algorithm that could look at all the pitches thrown by a pitcher and classify them. I am not going to go into details about the algorithm as it still needs some fine tuning (as you will see below) but basically it examines every pitch and correlates speed and movement into clusters. Once it has those clusters for each pitcher it finds the pitcher's fastball and then calculates what his other pitches do in comparison to the fastball. It then compares his other offerings to all other pitchers and tries to guess what the other pitches are. Sometimes this algorithm preforms well.
First, these plots show the movement of the pitch not the location. For a great description of what exactly this means read this excellent article by John Walsh. This is exactly what you would expect from the hard throwing Broxton. He has a great four seam fastball and what can be a devastating slider. His change though, is a work in progress. It doesn't have nearly the same movement as his fastball which helps tip the pitch to opposing batters. Because of this, you can see he doesn't throw it very often.

Sometimes though the algorithm can get messed up. This mostly happens in two ways. First, the clustering gets over active and combines two pitches that really are different.
Saito appears to be throwing two varieties of fastballs (two seamer? cutter?) but the clustering algorithm combines them into one type. This mostly occurs when the speed of the two pitches is very close. You can see that Saito's splitter and his curve are about as far apart as the two fastballs but the algorithm correctly separated them. The other failure is sometimes the algorithm will misidentify a pitch.
It is my understanding that Oswalt throws a slider not a split finger fastball but the pitch seems to move more like Saito's split flinger fastball than Broxton's slider to the algorithm. Also, one pitch that Oswalt threw didn't seem to match up to anything and just got left out. Looking from the movement on the pitch it probably is a fastball but it could be a change. Missing one pitch from Oswalt really isn't a problem but a few pitchers have clusters of pitches that aren't combined. Rich Hill is an example of this.
Again, Rich Hill throws a slider not a splitter but the horizontal movement gets the pitch classified as a splitter. If the group of unidentified pitches were added in maybe the pitch would be correctly identified. Sometimes all hell breaks loose and the algorithm falls apart.
The great Greg Maddux who throws nothing but fastballs. So what is going on here? Well the clustering algorithm really needs some space between the types of pitches and Maddux doesn't really provide any. What I mean by that is Maddux will throw his fastball at a wide range of speeds. The low end of that range is very close to the high end of the velocity on his change. This provides a bridge for the clustering algorithm to lump them all together. The unknown points in the bottom right are some type of off speed pitch but it is unclear what. Lastly, we can look at the worst case scenario, the knuckleball.
Here the algorithm really doesn't have a chance. It does a good job of separating the knuckleball from the fastballs and most of the knuckleballs are grouped together with a few wrongly grouped at the edges. The problem comes in comparing Wakefield to other pitchers. Without any other knuckleballers in the league to compare him to the algorithm is lost and just throws out a guess and calls the pitch a slider.

Anyway, here is where you come in. I have uploaded a plot for every pitcher who has thrown more than 100 pitches in their home park while PITCHf/x was on. If your favorite pitcher is missing don't worry, hopefully I will soon have a good league correction and can add in the away stats. What I need is you to look over plots and tell me where the algorithm has messed up. If the algorithm has combined two lumps of pitches that you think should be separated let me know in the comments below. If the algorithm has incorrectly identified a group let me know. If there is something ascetically unpleasing about the graphs or if there is something you would like to see me add to them let me know. If you would rather email me than add a comment my email can be found under my profile to the right.

The whole process of going from downloading the data to producing the plots takes nearly half a day. The clustering algorithm itself takes over three hours on my super fast desktop. The moral of the story is I am going to stick with this data set for at least a few more days as I try to hammer out the kinks to the algorithm. Maybe this weekend I will do a full update.

Thursday, September 6, 2007

Response to a Reader Question

A couple of posts back reader Alan brought up some interesting ideas for checking the data. Here is part of his comment.

Consider only fastballs, which we can take to be pitches>90 mph. First thing to look at is the initial z-component of the velocity. A negative z velocity means the pitch is thrown slightly downward. Do you see a correlation between the release point and the initial z velocity? Does the pitcher compensate for the higher release point with a larger downward component of velocity?


I want to examine these correlations and add in a few more variables to help complete the picture. At the time Alan had wanted me to use the two parks that had the largest separation in vertical release point, z0, which were Fenway and AT&T. Since then I found a bug in my code and now the two parks that are furthest away are Fenway and the Metrodome. The problem is both of those parks are on the lower end number of pitches tracked by PITCHf/x. So instead I am going to start by looking at Petco park in San Diego and AT&T park in San Fransisco. AT&T doesn't have a whole lot more statistics than the Metrodome but it has a lower variance in my correction factor and the Giants and the Padres play each other very regularly so hopefully the overlap of pitchers in the data will be larger.

I also should note that I am a little bit concerned about using the definition of all pitches with an initial speed of 90 MPH are fastballs. While I am not too concerned about actual fastballs that are below 90 MPH being missed with this definition, I am concerned that some breaking balls will enter the sample. Not too many pitchers throw a 90 MPH breaking ball but my initial correction factor for the error on the pitch speed is about 5 MPH and there are plenty of pitchers who can throw an 85 MPH breaking ball. Nevertheless, I haven't come up with a better definition at this time and this definition will work for our purposes today.

To start with I am going to check the correlation between the initial vertical release point and the vertical height when the ball crosses home plate. The reason I want to check this first is something else that Alan said in his post that the calibration should be better near home plate. That got me remembering this tidbit from Joe P. Sheehan when he was writing about differences in the parks here:
Almost all of the pitchers also get a smaller pfx_z [movement of the pitch vertically] value at home, which would seem to indicate that their pitches have more sink at Fenway, but is actually a result of the lower release height combined with the fact that, overall, the average height when a pitch crosses the plate at Fenway is similar to the height at other parks.
So he was seeing a very large variation in the release point but a small variation when the pitched crossed home plate. This doesn't seem to make sense and I want to look at this first. So finally, here is a plot comparing the initial and final height of the ball at Petco and ATT&T parks.
Pitchers tend to release the ball about 6 feet above ground level though obviously this will vary from pitcher to pitcher. We can see in this data though that the Petco data tends to be below 6 feet and AT&T data tends to be above 6 feet. Also, we can see a bunch of points near 3 feet in the San Diego data. This is from side armer Cla Meredith for the Padres. You would expect to see a few points from him show up on the San Fransisco data but that appears missing. So I went back and checked and Meredith has yet to pitch at AT&T park while PITCHf/x was activated. There appears to be another grouping of pitches just above 4 at Petco. This almost certainly is another Padre pitcher but I haven't yet tracked him down. If there are any Padre fans who know who this is please let me know.

Anyway, besides the disparity in initial height, the height as the ball cross home plate appears very consistent across both parks. If the initial position is off by as much as we think then why is the final position so stable? It must be as Alan suggested that the PITCHf/x system is more stable near home plate. I have a theory as to why this is but I am going to save that for my next post when I go in depth as to what I think is actually happening with the data. What this is showing is the initial and final heights of the baseball aren't correlated at all. This means we should be free to correct the initial position without worrying about changing the final position (as funny as that sounds). Here then is the same plot with the vertical correction applied.
What an improvement that makes. Again because this correction is based on a pitcher by pitcher comparison of each park, this shift isn't moving the center of the Petco data on to the center of the AT&T data. Because the Padres have a few pitchers who throw at a very low height that difference still remains in the data. The "average pitcher" who releases his ball just above 6 feet though will come together and that is exactly what the corrected plot shows. Now we are ready to look at the initial height and the initial vertical velocity to see if we see a correlation there. Because we aren't seeing a correlation between the two heights something must be causing that and it pretty much has to be either the initial velocity or the acceleration or both. Starting again with the uncorrected data.
Here we can see clear correlation and it is exactly what we would expect. As the pitch is being released the higher it is being released from the more negative (or downward) its velocity. This makes perfect sense the only problem is the data looks terrible. Again we see a difference in the initial height but there appears to be more here. Lets start out by correcting for the initial height and see what that gives us.

Now the heights seem to match up well (except again for the two blobs now at 3 feet and near 5 feet) but the velocities seem off. The AT&T data appears to have more downward initial velocity than the Petco data. So I am going to apply a correction to the initial velocities that I calculated the same way I calculated the initial height correction. As I pointed out in previous posts the errors that I am seeing on these corrections are huge. For instance, Petco checks in as being high by .5 FT/s with an error of 146 FT/s (AT&T is nearly 1 FT/s low). Obviously that doesn't seem to make any sense and either something is still wrong with the my code or we just need more data or I need to correctly identify the fastballs or I need to carry the calculation out further. Because of this I am not yet going to publish these corrections. I don't really trust these numbers and I don't want people using them until I feel confidant that they are correct. Once I get them fixed though I will be putting the numbers out for people to use. Just for fun lets put in the numbers and see what happens.
Wow that looks pretty good. I just don't understand why I am seeing such a huge error when I look at plots like this showing things matching up well. There is another interesting thing can be seen in this plot. Remember back when I said I was concerned about making a hard cut at 90 MPH for the pitch speed? The reason was that cut wouldn't be uniform over the parks. Here, AT&T was increasing the initial pitch speed by having a more negative initial vertical velocity. Petco was doing exactly the opposite. That means we are actually seeing some 87ish MPH pitches in the Petco data and we are only seeing 93ish MPH pitches in AT&T. I believe that is why the AT&T data fits snugly inside the Petco data. The slower the ball is moving presumably the more potential for break (acceleration) there is and the wider the variation in position and velocity.

That was interesting but while Petco and AT&T were at the extremes for variation in intial height they were closer to middle of the pack for variation in initial downward velocity. What if we look at two parks that are very extreme in both categories? The two best (worst?) parks here are Fenway and Angel stadium.
Wow that plot looks ugly. Hopefully after our corrections things will get better. Again we will start by correcting just the initial height.
Not quite the nice fit we saw before (in the initial height match). Part of this could be due to the Boston staff being shorter than usual but part of it might be be due to error on these numbers. Fenway is checking in at an error of nearly .2 feet and while that might not seem like a lot, if you moved the purple points right .2 feet it sure would look better to me. Now on to the initial velocity adjustment. These two parks are over 4 ft/s (over 2 MPH) different in just their initial downward velocity according to my numbers. Again, the errors on these numbers are huge but lets put them in and see what we get.
While not as perfect as the AT&T/Petco match this is a huge improvement for two parks that were radically different to start with. This basically is the worst case scenario for having to correct the data and the results seem very good to me. If this was all the closer we could get with these corrections I would still be happy.

Ok so I have shared the good news with you. Looking at these plots it really seems like not only can we understand what is going on with the data but we can fix it as well. Now the fly in the ointment. The other parameters that are vital to these calculations are the accelerations (in x, y, and z). For this data Sportvision is assuming that the acceleration is constant over time, meaning the change in velocity when the pitch is thrown is the same as the change in velocity as the pitch goes over home plate. Now, obviously this isn't a perfect assumption as the ball could be slowing down more the closer it gets to home plate. The problem is if you allow for a changing acceleration then the nice equations of motion that they use fall apart and things become even more messy. In reality, it probably isn't bad at all to make the assumption that the acceleration isn't changing (Though I can't say for sure. If you are looking for a topic to tackle using this data this would be an idea.) but the problem for us is the method that we are using for creating corrections for the initial distances and velocities won't work. This means if we find that the accelerations need fixing, along with the positions and the velocities, then we are going to have to come up with a different method then the one I have outlined for fixing them.


Close your eyes (or turn off your monitor) if you don't want to see the bad news.
Going back to Petco and AT&T here is the vertical acceleration compared to the initial height. Again we can see the problems in the initial height because this is uncorrected data, but the accelerations don't seem to be matching up well either. Correcting for the initial height we can fully see the problem.
Ick. Again we can see Meredith and his fastballs that appear to be breaking down very hard (sinkers). Also, pitcher X's data has come out from hiding a bit and we can see his contribution near 5 feet in initial height and -40 st/s^2. His fastball must be a sinker as well. The bad news though is it appears that an acceleration correction is going to have to be made for this data to match up. It is close, but just not close enough. This really sucks because what appears to be happening is the acceleration is being spread out in Petco and this correction won't be a nice linear one like the position and velocity corrections have been. Just for more proof here is the Fenway/Angel stadium plot uncorrected first.
Again, these two parks are just about as bad as the data is going to get unless one of the last two parks to come online really sucks. Correcting for the initial height things get better but still look pretty poor.
Again we are seeing a spreading out of the accelerations. Instead of being able to match these two distributions by moving one or the other left/right or up/down the distributions will have to be shrunk or spread out. It is possible that my artificial cut at 90 MPH is doing some of this (like we saw in the position/velocity graphs) but I don't think it is responsible for all of it.

So where do we stand? Even without a great way of teasing the fastballs out of the data it appears that we will eventually be able to get some good correction factors for the initial positions and velocities. The accelerations are another story and something that will have to be thought about. If anyone has a good way of cutting the data to produce fastballs and are interested in sharing it please let me know. Also, if anyone knows thinks they have a good method for correcting the accelerations even if they don't know exactly how to implement it let me know.

You may have noticed that I started calling it PITCHf/x instead of pitchFX like my previous posts. I had seen it written both ways a lot and thought pitchFX was correct but after reading through the sport vision website again it definitely should be PITCHf/x. My apologies to the creators.

ps. If reader Alan happens to be Dr. Alan Nathan who published this excellent paper examining John Lester's start against the Mariners please email me. You can find my email address under my profile on the right. I'd really like to chat about possibly using the spin magnitude and axis to classify pitches and why his theoretical fit to the data matched up so well when I am seeing such terrible agreement. Actually, anyone who wants to discuss any of that or anything else can email me with the link provided under my profile.

Wednesday, September 5, 2007

Somewhat Pretty Pictures

Ok so the last couple of posts probably have been pretty boring to most people. So I am going to interject with some plots from pitchFX system so we can maybe get a better feel for things. First up I want to break down all the pitches tracked by pitchFX by what park the pitch was thrown at.

Click on all of these plots to enlarge them. As you can see, the system rolled out first mostly on the west coast and then moved east skipping some parks along the way. If you are fan of a team in the AL west your teams will have huge amounts of pitchFX data to work with. If your favorite team is in the AL east or NL central then not so much. PitchFX has just been added to the New York parks and to Tropicana with less than 2,000 pitches thrown in them. As I pointed out in my last post, RFK has only seen one game with the system turned on. Looking at this discrepancy in the amount of data and then looking at the errors on each of the parks in the post below it is pretty clear that getting to at least a few thousand pitches is important if we want to be able to do anything with the statistics in those parks. With less than a month left in the season, this seems unlikely in at least three parks.

Moving on to the actual data, the only quantities that pitchFX actually measures are an initial position, velocity, and acceleration in the x, y, and z directions. So lets take a look at some of these quantities starting with our favorite, z0.

Starting with the uncorrected data you can see that most pitches are thrown at a height of about 6 feet but if you look closely you can see another bump around 3 feet. That must be where the sidearmers throw. Often when comparing data that spans several magnitudes people will plot that data on a logarithmic plot (here is a wiki article on log plots if you are interested) and here is one of the same data below.

It might not look very similar but if you look at the statistics in the upper right hand corner they are identical. The only difference is in the Y axis which now is on a logarithmic scale. This scale is really handy at looking at the tails of the distribution and looking at this plot we can see several bumps as we go down in release point. The data continues until the release point is actually negative! On the other side we see some pitches that were thrown at a release point of 10 feet. Obviously something is wrong here and if you look at these plots pitcher by pitcher you won't find just one guy who is causing the problems. This appears to be our first sign (though others have shown plenty more) of pitchFX just whiffing on a pitch. What about when we apply our correction factor from the post below?

Here the data appears to be behaved a little better with a smoother drop as we go down in z0. There still are some values that are just way out there but things are looking a bit better and the RMS (which is a fancy way of calculating something like variance) has gone down which is something we would expect from corrected data. What about the x0 data?
The x0 data shows what you would expect with two peaks, one on the left side of zero (this is the right handed pitchers remember these coordinates are as the catcher sees them) and another on the right side. Both peak around two feet from zero and both have a tail that goes out. The real question is why are there so many points near zero? The answer is pitchers stand on different parts of the rubber. If a left handed pitcher stands at the extreme left side of the mound he is going to release the ball very close to zero, or right in the middle of home plate. Lets zoom in at take a look at the plot in log form.

Again we can see a shoulder in the data that is where the sidearmers are. This is easier to see from teh lefties but it is there for the right handers as well. Also, we see some more pathological points extending out 10 feet from 0. Just like with the z0 data we are going to have to clean that up before we can start to really analyze it. Do things get any better with the corrected data?
We still see just as many pathological points but the peak where the sidearmers are throwing is more clear. Again, this is exactly what we would hope for in correcting the data. Hopefully, as the large error on the x0 correction goes down the data will get even more clear. What about the initial velocities.

Because the y direction is towards home plate the initial velocity in y is the largest component in the initial speed of the ball. All the values are negative because the ball is traveling to home plate from 55 feet. I have converted this into miler per hour to make things easier to see. The peak here is at about 90 MPH with a shoulder around 82 MPH. Presumably, the points near the peak are mostly fastballs and the points near the shoulder are breaking balls. How about the tails of these distributions? Do they show as many ugly points at the positions?
It looks to me that this plot is much more reasonable. There are a few points near 105 MPH but some pitchers (Zumaya) might be able to get it close to that and if they were pitching in a park with a fast pitchFX system that night that doesn't seem to unreasonable. What about the pitches on the other side? Why are there so many pitches below 60 MPH? A possible explanation would be intentional balls. It is possible that a pitcher is just lofting those in at a very low speed. To check that we can plot all of the velocities of those intentional balls because that data has been added to the system.

So the really, really, slow ones are not from intentional balls. Those seem to be a good portion of the balls thrown near 60 MPH but few of them are below 50 MPH and none of them are below 40 MPH. So again it appears that there is some issues with the data. Hopefully these outliers can be properly removed and then real study can begin.

Preliminary Correction to PitchFX Data Part II

So I am still working the kinks out of my code and this afternoon I found a bug that was messing up with the weighted variance when it came time to calculate the differences between each of the parks. This carried over and messed with the release point factors as well. It did so in a nasty way that didn't have a huge effect on parks that had a lot of data which is why I didn't catch it earlier. Anyway, I fixed the bug and when I went to recalculate the park factors all of the errors went way up. This actually seemed not unreasonable to me as I was planning on adding in a second order calculation anyway. I started talking about it in response to one of the comments in the last post.

It really is a pretty simple concept. In addition to directly comparing pitchers who pitched in park A and park B I am adding in pitchers who pitched in park A and park C and then pitchers who pitched in park B and park C. I am adding this two step process together in quadrature just like I did when I calculated the differences originally. It turned out this improved things but not quite as much as I hoped. So I took it one step further and added a third order correction as well. Every additional step you take helps less and less but third order still was enough to produce some pretty pleasing results. Obviously, this isn't a very good writeup of the process but for reasons that I will detail later, the code still isn't quite where it needs to be. So, I am not going to do another full writeup of the process until it is more set in stone. Here are the results from the new method including all games played yesterday, Sept. 4th.

Correction to the z0 release point (in inches)

park factor variance
bos -5.703 0.18480
sdn -4.070 0.04284
was -3.607 0.36426
sln -3.148 0.03288
cha -2.761 0.04104
nyn -2.759 0.50190
flo -2.601 0.10395
mil -1.757 0.07686
lan -1.051 0.03574
ari -0.880 0.05964
hou -0.719 0.06288
sea -0.391 0.04026
bal -----
pit -----
det 0.139 0.10242
atl 0.349 0.03999
cle 0.359 0.07261
tor 0.451 0.04750
oak 0.769 0.03911
phi 0.862 0.11741
nya 1.053 0.36311
tba 1.115 0.24041
chn 1.436 0.06175
col 2.588 0.07733
kca 2.629 0.16184
cin 2.884 0.08300
tex 3.211 0.03561
ana 3.370 0.04835
sfn 3.848 0.04182
min 4.464 0.07266

I am moving to inches because if I report the numbers in feet some of the variances are incredibly small and kind of hard to write in a nice table format. The relative error doesn't change but this is easier to read. Still no data for Baltimore or Pittsburgh but Washington is showing up. I looked into this and found that th pitchFX system was turned on for one game at RFK but has been turned off since and no data was received so far for their home stand. Things look pretty good here with the statistical error being at most half an inch but when you turn to x0 things get a bit out of hand.

Correction to the x0 release point (in inches)

park factor variance
flo -9.687 3.16846
ari -6.589 1.70915
tex -3.249 1.36334
sdn -2.364 1.72881
chn -2.123 1.53192
hou -2.059 1.75174
cin -1.936 2.51889
sfn -1.880 1.80075
phi -1.704 3.41626
nyn -0.715 8.95378
col -0.492 2.44724
sln -0.449 1.25995
ana -0.441 1.11405
sea -0.275 1.50260
was -0.051 20.73689
bal ----
pit ----
cha 0.533 1.78187
det 0.625 3.14010
oak 0.632 1.39684
lan 0.851 1.62027
kca 0.911 3.62942
cle 1.453 4.05667
tor 2.261 1.42610
nya 2.999 8.05124
mil 3.103 3.15419
min 3.125 1.66268
bos 3.876 3.28594
tba 5.705 3.37659
atl 7.728 1.82724

Wow. The first thing to notice is that the correction in x needs to be bigger than the correction in z. If these numbers are correct, pitchFX is missing the horizontal release point by almost 10 inches in Florida. That is huge. Also huge are the errors on these numbers. Even the parks with a lot of data still have errors bigger than an inch. That is just too big. Maybe this is because I need to go to forth order because the spread is much bigger. Maybe there needs to be a separate correction for left handers and right handers for each park. That would really suck because cutting an already thin sample by about 1/3 to look at just lefties would be pretty painful.

Anyway, work is in progress but I need to be able to hammer out the details for x0 and z0 before I move on to the initial velocities. If I use the same code run for vz0 I get a correction for Fenway that is 2.719 with an error of 170! Obviously, taking out the breaking pitches will be essential for correcting that data. If anyone has any thoughts about possible improvements to these corrections I'd love to hear them. Either comment below or drop me an email. If we can just get this data corrected I believe it would be a huge step forward in analyzing baseball.

Monday, September 3, 2007

Preliminary Correction to the PitchFX data

So I have been doing a lot of work with the pitchFX data recently and I think I have a nice method for correcting the vertical release point, z0, (and other variables) in the data. This post will outline the method that I am using. I'd really, really, like some input on what people think about this method. Because of this, this post is going to be very math heavy. If you aren't interesting in that kind of stuff just skip it or skip to the final numbers at the end.

Joe P. Sheehan has started to look at this in his post at baseballanalysts.com. In that post he compared pitchers home/road splits for the vertical release point and showed that the data has some serious quirks. Before any real analysis can take place the data must be normalized. One would like to find a league average release point and then calculate the difference between that average and an average of every pitch thrown in a particular park. The problem with that is if a team's pitching staff is much taller or shorter than average the data is going to appear skewed. Because of this, a more complicated method will be needed.

I want to start by mentioning that during different times this year the pitchFX system has been recording at different distances. They started at 55 ft. from home plate but have moved it in as far as 40 ft. To normalize the data we are going to need a consistent value for the initial position. So I am moving back each point not at 55 ft. to 55ft. using the standard equations of motion. You can find a good description of this process at Alan M. Nathan's page here. From now on if I use the word release point I am talking about the z0 data recorded at 55ft. Once this correction has been done I can start the normalization process. Also, I am using almost all the data available going up to the games played Saturday night (Sept. 2rd).

My plan is going to be to compare each park using pitchers that threw in both parks while the pitchFX system was in place. By comparing the difference in release point between the two parks we can get an estimate the difference in the pitchFX system in each park. This process assumes that pitchers aren't changing their release point over time. Obviously, some of this is going to occur but pitchFX has tracked over 200,000 pitches so hopefully we can overcome this with the aid of large statistics. Sheehan used Josh Beckett as an example in his analysis and it turns out that Beckett's numbers are pretty interesting so I will use him as an example as well. Here is a histogram of Beckett's release point for every park he has pitched in that had an operational pitchFX system.

Obviously, Beckett has thrown the most pitches at Fenway and had exactly one start in each of the other parks. If you would scale each distribution to the same size though they would pretty much look like each other. This implies that Beckett hasn't changed his release point and the only difference in the distributions are from the pitchFX system. As Sheehan noted, the data from Fenway appear to be way different from the data at the other parks. That said, there still is relatively large variations in his road starts as well.

So to start I would like to calculate the difference between the system's output in two parks. To do this I need to search for pitchers that have pitched in both parks while pitchFX has been turned on. To be added into the sample I also require at least 20 pitches be thrown by the pitcher in each park. I then calculate the mean and the variance for each pitcher in each park. Then I calculate the difference between the parks by subtracting the means. To get an error on this I add the squares of variances. I then have a list of pitchers who pitched in both parks and the mean and variance for the difference between the parks. I want to average them but to do that I am going to use a weighted mean so pitchers with lower variance will count more. Do this for each pair of parks and you have the difference between release points between the parks.

What we really want though is the difference between each park and a league average. Fortunately, if we add each of the differences for a park together we should get the difference between that park and league average. Do this for each park and you have every parks difference from league average.

As an aside this is where interleague games really come in handy. Without them there would be little crossover between AL and NL parks and we would probably have to settle for an AL and NL adjustment instead of a league adjustment. So interleague play is good for something.

So here are the results:
z0 release point at each park compared to league average
bos -0.653
nyn -0.357
cha -0.276
sdn -0.243
sea -0.176
flo -0.175
tba -0.083
tor -0.083
det -0.073
lan -0.061
hou -0.046
sln -0.030
nya -0.011
atl -0.006
was 0.000
bal -----
pit -----
kca 0.016
ari 0.017
cle 0.019
mil 0.034
oak 0.071
col 0.145
phi 0.183
min 0.195
tex 0.243
chn 0.253
cin 0.285
ana 0.362
sfn 0.450

First, pitchFX has yet to be turned on in Baltimore and Pittsburgh. Both those teams are on a road trip right now and I expect that when they come back home data will start rolling in for them as well. Washington has been turned on but only for a very limited time. Hopefully, that will get back shortly and that can be added as well.

These results generally agree with what Sheehan found using home/road splits but there are a few parks that changed a good deal. For instance, Colorado looked like it was producing very high release points but now has settled down. Because pitchFX was installed very early in the west coast stadiums, and because San Fransisco and San Diego are pretty far away from league average, the home/road splits could have been corrupted. Maybe a similar thing was happening with Boston and Toronto with Rogers Centre.

Going back to Beckett for a moment. Now we can see from the chart that every single park he has pitched in with pitchFX turned on is below league average. So if you were to look at his data without first adjusting for these effects you would really be short changing Beckett.

In any case, it is important to test these results to see how much of a difference they make. To do this I looked at every pitcher in the league and calculated his mean and variance of his release point at home and on the road first without applying these corrections then after applying these corrections. I then calculated a league average mean and variance for home and road games again using weighted means for each of the samples. If the corrections are working the road variance should move much closer to the home variance, and that is exactly what I find. Without the correction the home variance is .00004187 and the road variance is .00005516. While these numbers look small the percent difference between them is over 30%. Now after the corrections the home variance becomes .00004173 and the road variance becomes .00004213. Now that is a remarkable change. You may be asking why the home variance changed at all. After all, if you are just adding a constant to the home numbers the variance should remain unchanged. The reason there is a slight change is from pitchers who changed teams and have at least two home parks. The way I have my code setup it is very hard to remove them from the sample so I left them in and that slightly changed the home variance.

So what is next? Well I have to run the numbers for the horizontal release point, x0 and then for the initial velocities. All of these numbers will need to be corrected before we really can use the data properly. The initial velocities will be more difficult because pitchers might throw a different percentage of off speed pitches to different teams. The Marlins, for example, are supposed to be a very good fastball hitting team. When a visiting pitcher faces them he probably will throw more off speed pitches making it appear like his overall velocity is down if you average over all pitches. A possible work around to this will be to identify just the fastballs in the sample and then only compare them. In any case, it will take me a few days to produce those numbers as the code needed for this takes forever to run and my eyes, they bleed. If you got this far congratulations, please consider leaving a comment if you have any questions or concerns about the method. Thanks.

Sunday, September 2, 2007

Estrada Hit Chart

So TheJay keeps requesting them and I am going to keep posting them. Again, if anyone has a player they would like to see please leave a comment below. Also, I am getting close to getting the release point of the pitch normalized for each park. Since all quantities depend solely on the initial conditions this should be very useful. Hopefully today or tomorrow I will be blogging about that with some numbers for people to use. The velocity normalization is going to be harder to do but I do have an idea on that. The acceleration will be very tricky and may not be possible with the data we currently have. Anyway, on to Estrada.

Estrada is a hitter that swings at a lot of pitches and makes a lot of contact. For a long time this year he was swinging at the first pitch over 50% of the time. I'm not sure if he is still doing that but here are the results from pitchFX for all of the pitches to Estrada.

So I have combined the hit chart and the strike chart into one and added foul balls. This plot is very busy but if you click on it it will enlarge and look a lot better. Estrada is a switch hitter so you would have to assume most of those strike calls there were off the plate came from him batting left handed. Estrada is mostly a singles hitter with a little power mixed in mostly down and away when he is batting left handed. What about him swining at the first pitch?

Estrada appears to be laying off more first pitches recently and because most of his pitchFX data has come recently he appears to be more patent than he really has been over the whole year. Almost all of his power comes swinging at the first pitch so it appears that he is swinging very hard when he does swing. Most of his swinging strikes came out of the zone and most of his foul balls came in the zone. Honestly with his track record I am shocked that pitchers continue to give him first pitch fastballs in the zone. If he is going to burn you it most likely is on that first pitch and starting him off with offspeed stuff out of the zone seems like a good bet to me.