Monday, November 12, 2007

Classifcation Algorithm Explained

Once the data has been corrected we are ready to start classifying the pitches. But first there is a little trick I want to apply. Because the atmospherics can reduce the spin on the ball up to 25% on a hot day at Coors I translate each pitch like it was thrown at sea level at standard temperature (59 degrees Fahrenheit). This is sort of like applying a park factor to correct for runs scored and puts each pitch on a level playing field. This is very important for the classification algorithm because if these pitches weren't translated pitchers who spent half of their time at Coors would have two separate curve balls. This would really mess the algorithm up and while Coors is the biggest problem some other parks during mid summer or during a cold spell can have a higher than 10% change as well. Translating these pitches solves these problems.

Ok so now the pitches are translated we are ready to classify them. I am using an incredible simple algorithm that clusters pitches by determining how close a pitch was to every other pitch thrown by that pitcher. It calculates a "distance" between each pair of pitches by comparing the speed the pitch was thrown at and the vertical and horizontal accelerations. The two pitches that are closest together get merged. This process continues until all pitches are in clusters and the clusters are far enough away from each other.

Once the clusters are formed the algorithm finds the pitcher's fastball. It does this by simply taking the cluster that has the highest speed. Once the fastball is found every other cluster is compared to the fastball in speed and the two accelerations. Now the cluster algorithm is run again on the remaining clusters and pitch types are formed. By first comparing the pitches to the pitchers fastball Jamie Moyer's other pitches can be on the same footing as Joel Zumaya's pitches. The algorithm can't say these are curve balls but it can put all the curve balls
together and then I can label the group curve balls. Once this is done it goes back to the fastballs we started with and reclassifies those in case a pitcher only throws sinkers or cutters for example.

Sadly, this algorithm is far from perfect and needs some human intervention. I have to hand edit about 40 pitchers who might have a splitter that looks like a sinker to the algorithm or a slider that looks like a cutter and so on. I have tried to check other references to make sure I have the right pitches for each pitcher but for many pitchers who have just thrown a few pitches in the big leagues this is particularly hard. If you are browsing the player cards and find something you think I got wrong please leave a comment below.

Explanation of the correction code

This post is way overdue but finally here is a detailed explanation of the correction code to the PITCHf/x data. As we have seen in previous posts, the PITCHf/x data needs some serious corrections. This is going to be a pretty hard core post so feel free to skip it if you aren't interesting in the method or how to correct the data. I am going to describe the process for one variable, the initial position of the ball in the vertical position, or z0. After that I will discuss alternations for other variables.

Once I have all the data read in and all initial positions are moved back to 55 feet from home plate I am ready to correct the data from park to park. What we would really like to do is first calculate a league average and then calculate how each park varies from that. But because the nature of the data this is impossible. For instance, if a home team has a very short pitching staff that park is going to have a low average z0 if we simply averaged all the pitches thrown in the park. Having a park average for each park is essential for the league average calculation so we must do something else.

What I have come up with is instead of calculating an average I am calculating the difference between two parks based off common pitchers to each park. I first calculate a mean and a variance for z0 for each pitcher for each park he has pitched in. I then take every pitcher who has thrown a tracked pitch in park A and park B and calculate the difference between the two means from the two parks. I also carry out a similar trick by adding the square of the variances to find the error on this difference. So, if a pitcher had a mean of 6 feet in park A and a mean of 6.5 feet in park B than his difference would be -.5 feet. Once I have done this for every pitcher who has thrown in the two parks I can add up the differences. But, because some pitchers contributed a lot of pitches in both parks and some just a few I actually find a weighted average. This is were the error comes in for each pitcher in the differences. If a pitcher just threw a few pitches in both parks he is going to have a very large variance and won't count as much to the weighted mean.

So this should give me a park difference between every park. The problem is there are many park combinations that no pitcher threw in both parks while PITCHf/x was tracking. To solve this problem I carry out the above procedure to higher orders. I do that by adding intermediary parks. So instead of going straight from park A to park B I also add in pitchers who threw pitches in park A and park C and then pitchers who threw in park B and park C. Now because park C has been added we have two sets of errors which again we need to combine in quadrature which means this measurement will be less accurate than just going from park A to park B but it is the only solution for parks with no common pitchers. In fact, I carry this procedure out to 4th order to get the best possible results. I could go further but I have found that 5th order and beyond change the numbers less than 1/2 a percent. Needless to say, this takes a long time. Hours in fact on my desktop. But the result is I now have a difference between all the parks. From now on I will call the difference between park A and B D(A)(B).

I now have all of the differences but this doesn't get me any closer to the league average. In fact, we will now apply a nifty statistics trick. While I would really like to find the league average I don't actually need it. What I really need is the difference between each park and the league average. I will also note park A's average as PA. Again, I can't actually find this number but we will need it in the difference between each park and league average calculation. Here is how we are going to find that.

By definition, the league average would be the sum of each park divided by the number of parks. Multiplying each side of that equation by the number of parks and we get.

P1+P2+P3+...+P28+P29 = LgAve * 29

Note we are using 29 here because the system was never turned on in Baltimore. Also, the numbers 1 through 29 are just placeholders for each of the parks. If we want to now find the difference between park 1 and league average we can start by adding P1-P2 to both sides.

2*P1+P3+....+p28+P29 = LgAve*29 + P1 - P2

We have got P2 out of the right side which is good but now it is on the right side which is bad. The good news is we know what P1 - P2 is that is D(1)(2) which we already have measured. In fact, I now can add P1 - P3 and P1 - P4 and so on to each side and then replace each difference on the right side with the corresponding D until I get:

29 * P1 = LgAve * 29 +D(1)(2) + D(1)(3) + ... + D(1)(28) + D(1)(29)

Moving the LgAve to the left side and dividing by 29 we get:

P1-LgAve = (D(1)(2)+D(1)(3) + ... D(1)(28)+D(1)(29))/29

The left side is exactly what we want, the difference between one park and league average. The right side are all numbers which we have calculated. So we can apply this method for each park and just like that we have the park corrections for the initial vertical release point.

Whew, we now need to do this method for each park for each variable. That is all the initial locations, the initial velocities, and the accelerations. The accelerations are a little bit complicated because they also are affected by the atmospheric conditions. For them I find the altitude and temperature of the game and find the air density. Because the ball is being manipulated by drag and spin (Magnus force) and both forces are proportional to air density I can multiply in the air density then run the correction code. This actually gives me the correction factor times the density but I can divide that out when I go to apply it.

Lastly, the z direction acceleration needs another trick. Gravity is also acting on the ball but it doesn't care about air density. So it must be subtracted first. Once the correction factor is found gravity can be added back in to find the true acceleration in the z direction.