STEVE NASH: Building the Model

I've decided to ring in the new year by building a half-complex, half-simplistic model to statistically assess prior information about the season and come up with, essentially, a preseason view on where things stand in the league post-free agency and before a single game is played. I won't dither any more, though. This is a short appendix outlining the creation of the model and how I came to some of my projection decisions.

• • •

BUILDING THE MODEL

This is rather wonky. I wouldn't particularly blame you if you'd simply skip entirely and assume STEVE NASH is nothing but a magic box that gives numbers. While the model itself isn't particularly complicated, explaining how I came to it is. I started this project with some extremely complicated models I was going to experiment with, but ran out of time and (frankly) decided it was best for me to put out a decent model now and a better one next year after a summer of further research. As it stands, the model is a series of segmented linear regressions. It uses four factor variables along with some new and interesting manually calculated variables. Some of the newly introduced variables:

  • Game Score per minute (recurring players). This is a calculation of the game score of all players that are holdovers from the year before, but adjusted by their minutes into a per-minute metric. Game score is a box score aggregate statistic by John Hollinger, useful for my purposes because I had a personal database available with simple box score statistics ranging back to 1992 but not more advanced box scores. Because it's a simple equation, you can relatively easily calculate it over a full season by applying the equation to a full season of games, and turning it into a per-minute metric is elementary. The equation, for the curious:
GMSC = 1.0 * PTS + 0.4 * FG  - 0.7 * FGA - 0.4 * (FTA - FT) + 0.7 * ORB + 
       1.0 * STL + 0.7 * AST + 0.7 * BLK - 0.4 * (PF - TOV) + 0.3 * DRB
  • Game Score per minute (new players). By that same token, this is a calculation of the previous year's game score per minute of all players that are new offseason additions to the team.
  • Continuation percentage. The percentage of last year's minutes played by the remaining players on the team -- this is to say, if a team had 2 players who combined to play 20% of their team's minutes in 2011 and that team lost only those players in the offseason, their continuation percentage would be 80%.
  • Minutes played in the previous year (new players). This ended up being, interestingly enough, the strongest "free agency" related variable out of all these that I'm now enumerating. If a team's new additions played a lot of minutes in the previous year, it's likely they're good, and it's likely they'll be improved the next year. That's my understanding of the correlation.
  • Weighted Age. I created two different metrics meant to weight a team's age -- the first simply weights it by minute, multiplying the number of minutes played by a player by his age and dividing the sum total of this by the total minutes available (48 * 5 * # of Games). The second weights it by percentage of team's collective game score the player was "responsible" for -- IE, if one player produced 2000 game score in a season for a team whose collective game score is an atrocious 3000, that player's age would be multiplied by 0.66, with all other players' percentages summing to 0.33 and ending with an average age weighted by game score contributions. Instead of picking one to be better than the other, I made both available to the variable selection process and found that while almost every model liked age as one of the influential variables, different segments liked different weightings.

After calculating these variables (along with an indicator as to whether the team was flying with a new coach that season -- a variable that essentially serves to up the variance among some segments), I decided to test out several unique segmentations to try and increase model fit and predictive accuracy on my holdout populations (1999 and 2011) and allow a varied selection of variables for each segment-level model. While several variables weren't on their face valuable to the model fit process, I was able to find one segmentation that consistently outperformed the others -- a 2x2x2 segmentation on previous year SRS, continuation percentage, and game score added. After manipulating my segmentation cuts to give me a roughly even distribution (using zero as my cut point for SRS, 0.73 as my cut point for continuation percentage, and 1847 as my cut point for game score added), I arrived at eight segments -- here are the sample sizes in each segment-level population:

                             HI CPCT   LO CPT
   LOW GMSC ADDED     - SRS      100       43
                      + SRS       74       52
  HIGH GMSC ADDED     - SRS       66       65
                      + SRS       52       97

Once I had arrived at my segmentations, I chose to fit models applying different variable selections to different segments in an attempt to help the model differentiate between the lot of them. To do this, I used one of my favorite standby tools for variable selection in predictive models -- the BAS package in R (which is currently offline, so that link is to a cached archive of the page), which is highlighted by a highly flexible Bayesian Adaptive Model Sampler (courtesy of Merlise Clyde). Utilizing Clyde's tool, one is able to assess the comparative log likelihood of a large number of unique models quickly, and examine variable inclusion percentages post-sampling. Here's an example of a BMA output plot, which is a visual representation of the variable selections made for the top 20 models in this run of the bas.lm() function.

Then, after I had picked my variables?

I... basically copped out and did simple linear regressions on my chosen variables for the segment level buckets. I know what my statistician friends are thinking, here. And I agree. It's pretty ridiculous that I'd go through all this trouble to get proper segmentation, new variables to describe free agency on a team-level, and use adaptive Bayesian methods to get my variable selections... only to default to the most textbook actual model I could possibly do. And I realize that's silly. Truth be told? Sort of ran out of time, sort of ran out of motivation, a little of this, a little of that. My intent originally was to use a Markov Chain type model once I got to the segment-stage models, or perhaps a Gaussian Process model like I used for my thesis. In the end, I decided that I really needed to push this prediction model out sooner as opposed to later. I'll probably revisit this in the postseason and roll out a completely different type of model before next season using some of my variable calculations and segmentation methods. I feel they're worthy of better.

But as it stands, a linear regression was the easiest thing to do. And the thing is? It fit everything rather well. Perhaps not as spot-on predictions (and with some ideosyncracies) but as a prior expectations model for preseason projection purposes? There's not a lot at fault here, it's an easy model to monitor, and it's an easy model to build on later for me. I have a lot of plans for STEVE NASH and making it better later, but for now, it's good enough for rock and roll. Click here for the spreadsheet of initial predictions. Good day, all.