Monday, November 12, 2007

Classifcation Algorithm Explained

Once the data has been corrected we are ready to start classifying the pitches. But first there is a little trick I want to apply. Because the atmospherics can reduce the spin on the ball up to 25% on a hot day at Coors I translate each pitch like it was thrown at sea level at standard temperature (59 degrees Fahrenheit). This is sort of like applying a park factor to correct for runs scored and puts each pitch on a level playing field. This is very important for the classification algorithm because if these pitches weren't translated pitchers who spent half of their time at Coors would have two separate curve balls. This would really mess the algorithm up and while Coors is the biggest problem some other parks during mid summer or during a cold spell can have a higher than 10% change as well. Translating these pitches solves these problems.

Ok so now the pitches are translated we are ready to classify them. I am using an incredible simple algorithm that clusters pitches by determining how close a pitch was to every other pitch thrown by that pitcher. It calculates a "distance" between each pair of pitches by comparing the speed the pitch was thrown at and the vertical and horizontal accelerations. The two pitches that are closest together get merged. This process continues until all pitches are in clusters and the clusters are far enough away from each other.

Once the clusters are formed the algorithm finds the pitcher's fastball. It does this by simply taking the cluster that has the highest speed. Once the fastball is found every other cluster is compared to the fastball in speed and the two accelerations. Now the cluster algorithm is run again on the remaining clusters and pitch types are formed. By first comparing the pitches to the pitchers fastball Jamie Moyer's other pitches can be on the same footing as Joel Zumaya's pitches. The algorithm can't say these are curve balls but it can put all the curve balls
together and then I can label the group curve balls. Once this is done it goes back to the fastballs we started with and reclassifies those in case a pitcher only throws sinkers or cutters for example.

Sadly, this algorithm is far from perfect and needs some human intervention. I have to hand edit about 40 pitchers who might have a splitter that looks like a sinker to the algorithm or a slider that looks like a cutter and so on. I have tried to check other references to make sure I have the right pitches for each pitcher but for many pitchers who have just thrown a few pitches in the big leagues this is particularly hard. If you are browsing the player cards and find something you think I got wrong please leave a comment below.

10 Comments:

At November 15, 2007 11:59 AM , Blogger Matt said...

Fantastic player cards. I think what Derek Lowe throws most often is called a sinker, not a splitter. I have never heard the announcers call it a splitter.

 
At November 16, 2007 10:06 AM , Blogger Josh Kalk said...

Yes you are absolutely correct. Those are all sinkers and that is one that got by me. So add another pitcher to the list that will need hand editing. The sinker/splitter distinction is by far the hardest one for the algorithm to handle. I actually have a slightly new version of the code which actually might correct this by itself but if it doesn't then I will again do a hand edit. Hopefully this weekend you will see new player cards this weekend or early next week.

 
At November 22, 2007 6:45 AM , Blogger Readercon said...

Josh, you're doing great work. I'll probably steal the correction code stuff.

However, I think there's a fundamental flaw in your pitch classification. And that's the seemingly obvious step of simply lumping all a pitcher's games together, even after adjusting for atmospheric conditions.

The problem is that a given pitch or pair of pitches will consistently break more on a good day than on a bad one, and this creates false continua that are causing your algorithm to lump clearly disparate pitches together. For instance, Dice-K's bad-day slider, good-day slider, bad-day curve, and good-day curve form a continuum that your algorithm identifies as "slider," but if you look at individual games, the slider and curve are always distinct from one another. Josh Beckett throws both a 4-seam fastball and 2-seamer (or sinker) and this is very clear (and important to his success!) from most single games but almost impossible to pick up from merged data.

I also wonder if your lumping algorithm is overly aggressive; the Joba card, for instance, has what seems to the naked eye to be 5 obvious curves and 3 obvious changes lumped as sliders.

I'm frankly not sure what the solution to this problem is! My guess is that we need a really terrific lumping algorithm that can be applied game-by-game. I'd be interested in discussing that with you.

Eric M. Van
emvan@post.harvard.edu

(logged in under a non-baseball organizational account!)

 
At November 23, 2007 2:58 PM , Blogger dan said...

I think the Felix Hernandez card needs hand editing. There's no way he throws 50% splitters and 5% fastballs.

http://baseball.bornbybits.com/plots/gifs/Felix_Hernandez3.gif

That shows the vertical break of the splitters and fastballs to be extremely similar, which they shouldn't be.

 
At November 24, 2007 10:44 AM , Blogger chip said...

One of the worst things that can happen to a pitcher is when his release point wanders, either telegraphing a particular pitch, having a bad day, or if mechanics have just shifted. At least that's something we always watched for with young pitchers.

The stability of release point in many of the pitchers you've cataloged is pretty amazing. I'm wondering about filtering on deviations from their preferred release point. Can you create a box cut in x and y and then look at the quality of pitches outside of that cut? Can it even be correlated with strikes/balls/walks/hits/losses? Such a warning sign could have real application, for defense and offense.

 
At November 25, 2007 7:01 PM , Blogger Andre said...

Greak work, Josh. Though I'm wondering that if you can put the "break angle" on the form, too? Cuz I think it'd help us to understant the pitch better, knowing which direction it's breaking. Thanks.

 
At November 25, 2007 7:15 PM , Blogger Andre said...

btw, I believe that Smoltz's breaking pitches could be separated in slider and curve while there's only slider in his card.

 
At November 26, 2007 9:40 PM , Blogger Mike Fast said...

Josh, thanks a bunch for your explanations. I'm curious about the details of your air density correction. I assume you "correct" or adjust the accelerations according to F=ma, such that if the air density is greater, you need to increase the acceleration by an amount proportional to the difference between the air density on a given day and standard air density.

I understand how to do this for altitude, I think, but I'm not sure how you do it for temperature. Is it as simple as saying air density is inversely proportional to temperature, where temperature is measured from absolute zero?

 
At December 3, 2007 10:56 AM , Blogger Josh Kalk said...

Hey everyone,

Sorry for the slow reply I was away for a bit.

Eric,

You are absolutely correct that the algorithm is pretty aggressive especially with pitchers with small samples. The problem is if I loosen the algorithm it causes huge problems for a lot of other pitchers. The algorithm will have to be rewritten this off-season though because I want the code to update overnight and it is far to slow to do that right now. I'll be emailing you shortly to chat more.

Dan,

Noted. Thanks a lot for that. I think that is solved in the latest version which actually isn't updated on the cards yet. Hopefully in a few days that will be fixed. In you find more issues like this please keep letting me know.

Chip,

That is a great idea. That is something that I will definitely add shortly here to the web based tool. The cards are meant for a global view and then the web based tool is going to be for that. While I used to completely agree with you about the mechanics and release point check out Jake Peavy. His release point seems to wander all over the pace and it clearly isn't hurting him. Maybe he is just an exception.

Jake,

Because of having to recalculating the data I will have to regenerated that before I post it. I will get around to that but probably closer to the Christmas update. As for Smoltz I will take a look at that. If the algorithm is putting them together I will be hesitant to make any manual changes like that. Hopefully a new algorithm will be better.

Mike,

You are absolutely correct about the F=ma correction stuff. And the temperature correction can be found on the air density wiki page here: http://en.wikipedia.org/wiki/Air_density

The temperature indeed has to be in Kelvins and because of that I convert everything into metric to find the density before using it.

 
At March 8, 2008 2:22 PM , Blogger Scott said...

This is very interesting. I have a question though. Given the Pitcher and the pitchfx info would I be able to query your database to determine the pitch type? Would your classification algorithm "know" (with a certain level of accuracy) what the pitch was?

 

Post a Comment

<< Home