Player Cards
After a weekend of banging my head against the wall trying to figure out how to properly normalize the acceleration I needed a break. So back to just looking at home pitches for pitchers. I decided to skip ahead to the next thing I wanted to do which is upload some player cards. Basically, the plan was to use the PITCHf/x data to create a plot of the type of pitch each pitcher throws and then start to expand from there. What I needed was a clustering algorithm that could look at all the pitches thrown by a pitcher and classify them. I am not going to go into details about the algorithm as it still needs some fine tuning (as you will see below) but basically it examines every pitch and correlates speed and movement into clusters. Once it has those clusters for each pitcher it finds the pitcher's fastball and then calculates what his other pitches do in comparison to the fastball. It then compares his other offerings to all other pitchers and tries to guess what the other pitches are. Sometimes this algorithm preforms well.
First, these plots show the movement of the pitch not the location. For a great description of what exactly this means read this excellent article by John Walsh. This is exactly what you would expect from the hard throwing Broxton. He has a great four seam fastball and what can be a devastating slider. His change though, is a work in progress. It doesn't have nearly the same movement as his fastball which helps tip the pitch to opposing batters. Because of this, you can see he doesn't throw it very often.Sometimes though the algorithm can get messed up. This mostly happens in two ways. First, the clustering gets over active and combines two pitches that really are different.
Saito appears to be throwing two varieties of fastballs (two seamer? cutter?) but the clustering algorithm combines them into one type. This mostly occurs when the speed of the two pitches is very close. You can see that Saito's splitter and his curve are about as far apart as the two fastballs but the algorithm correctly separated them. The other failure is sometimes the algorithm will misidentify a pitch.
It is my understanding that Oswalt throws a slider not a split finger fastball but the pitch seems to move more like Saito's split flinger fastball than Broxton's slider to the algorithm. Also, one pitch that Oswalt threw didn't seem to match up to anything and just got left out. Looking from the movement on the pitch it probably is a fastball but it could be a change. Missing one pitch from Oswalt really isn't a problem but a few pitchers have clusters of pitches that aren't combined. Rich Hill is an example of this.
Again, Rich Hill throws a slider not a splitter but the horizontal movement gets the pitch classified as a splitter. If the group of unidentified pitches were added in maybe the pitch would be correctly identified. Sometimes all hell breaks loose and the algorithm falls apart.
The great Greg Maddux who throws nothing but fastballs. So what is going on here? Well the clustering algorithm really needs some space between the types of pitches and Maddux doesn't really provide any. What I mean by that is Maddux will throw his fastball at a wide range of speeds. The low end of that range is very close to the high end of the velocity on his change. This provides a bridge for the clustering algorithm to lump them all together. The unknown points in the bottom right are some type of off speed pitch but it is unclear what. Lastly, we can look at the worst case scenario, the knuckleball.
Here the algorithm really doesn't have a chance. It does a good job of separating the knuckleball from the fastballs and most of the knuckleballs are grouped together with a few wrongly grouped at the edges. The problem comes in comparing Wakefield to other pitchers. Without any other knuckleballers in the league to compare him to the algorithm is lost and just throws out a guess and calls the pitch a slider.Anyway, here is where you come in. I have uploaded a plot for every pitcher who has thrown more than 100 pitches in their home park while PITCHf/x was on. If your favorite pitcher is missing don't worry, hopefully I will soon have a good league correction and can add in the away stats. What I need is you to look over plots and tell me where the algorithm has messed up. If the algorithm has combined two lumps of pitches that you think should be separated let me know in the comments below. If the algorithm has incorrectly identified a group let me know. If there is something ascetically unpleasing about the graphs or if there is something you would like to see me add to them let me know. If you would rather email me than add a comment my email can be found under my profile to the right.
The whole process of going from downloading the data to producing the plots takes nearly half a day. The clustering algorithm itself takes over three hours on my super fast desktop. The moral of the story is I am going to stick with this data set for at least a few more days as I try to hammer out the kinks to the algorithm. Maybe this weekend I will do a full update.

7 Comments:
Josh, I took a look at Greg Maddux using my analysis techniques. The slow pitch is his curveball. I can pick out his changeup pretty well. It's a little tougher finding the boundary between his two-seam fastball and his cutter. He's also got a slider that he uses occasionally that I can pick out pretty well.
I've still got one unknown group of pitches that's close to the cutter and slider. I'm not sure if it belongs to one or the other or if it's a distinct pitch.
This graph is a point on a work in progress, but it shows you basically what I've found so far.
http://fastballs.files.wordpress.com/2007/09/maddux_sep10_speed_vs_spin_direction.jpg
My remaining loose ends are finding the best way to quantify the boundary between the two-seamer and the cutter and identifying the unknown group of pitches.
You also might want to check out my list of articles that have already been written about various pitchers:
http://fastballs.wordpress.com/2007/09/01/enhanced-gameday-analysis-cataloged-by-pitcher/
Also, ultxmxpx has classified pitch types for about 130 pitchers on his website:
http://theuniverseas.com/baseball/conrate.html
He doesn't disclose his method, but I find his data can be useful as a check to make sure I'm not missing something.
Thanks for those links Mike. That will certainly help me check my results.
Mike,
I loved your most recent post on your site. I tried to leave a comment but after I pushed the enter button the comment got erased and nothing got posted. This actually has happened a few times and since I don't have your email address I am posting this here hoping you will see it.
josh
Hi Josh...interesting site. I was going to start taking a look at PITCHf/x data myself in the next few months, and stumbled my way here while trying to see what has already been done.
So, since you want feedback on other players, I perused a few, and noticed some funny stuff going on with Dice-K.
Specifically, his having a "slider" that doesn't break, and movement wise, looks identical to a grouping of fastballs. (gyroball??, or maybe just a BP fastball.) If I were to take a guess, with more data, this particular cluster would probably later be identified as a fastball.
Since just the break numbers are up, and not velocity, I'd tend to also think that what is defined as his splitter may be 2 or even 3 seperate pitches. The grouping below the fastball I would probably call a changeup.
As for the "unknowns"...shuuto?
Probably better just to leave them as unknown.
Anyway, just wanted to throw stuff out there, and compliment you on your site.
Cheers,
Ike H.
Thanks for the compliment Ike. Dice-K is an interesting pitcher to look at and I am not surprised with the problems that the algorithm is having with him. Besides the things you mention, the PITCHf/x system at Fenway appears to be the least calibrated system in the league. That is one of the reasons I am only using home pitches here to give the algorithm a chance to properly group these but it appears that the Fenway pitches are losing up to half of their true break due to the poor calibration. If that is the case then that slider/gyroball/BP fastball may be something completely different and only look like a slider to the algorithm because the the incorrectly assigned break looks like the break others pitchers get on their sliders.
The new and improved algorithm is now correctly identifying Dice-K's change though and it appears you were 100% right on that. I am holding off on publishing that data until I can get this release point issue fixed though.
Josh, there are proper cluster analysis techniques available. I'm not sure of your level of statistical savvy, but any good textbook on multivariate analysis will have a chapter on it. Some of the higher order stat programs (SPSS, SAS) also have built-in algorithims for this work.
Josh, I sent you an email but haven't heard anything back from you. You can contact me at mikefast@gmail.com.
My feeling on the first pass of your algorithm is that you were identifying way too many things as splitters. I haven't found many pitchers to throw that pitch as their offspeed pitch.
In the case of Zack Greinke, you were lumping his slider, curveball, and changeup under the label of splitter.
Maybe you've already corrected this with newer revisions of your algorithm.
Post a Comment
<< Home