Who does a compressed season "help"? (A stats-heavy followup.)

Posted on Fri 02 December 2011 in 2012 Season Preview by Aaron McGuire

About a week ago, Zach Lowe reached out to his followers on twitter to ask if we'd run some numbers for him. I decided to follow through on it -- he wanted to see some simple correlations between team win percentage and the offensive and defensive four factors. He used the numbers to back his main claim in a piece that dropped earlier this week where he came to the well-supported conclusion that we have no real idea what kinds of teams a compressed season helps or hurts, and at this point, we may as well assume the season proceeds as normal because we don't know what predicts performance in a shortened season. I essentially told him that, although there were some tertiary trends that seemed marginally predictive, the stats weren't telling us anything valuable. There weren't any jarringly common statistical differences between teams that did well in the lockout season and teams that did poorly.

Something about those specious, tertiary trends bugged me, though. I thought there might be more to it than the numbers were showing. So I expanded the amount of data I was working with, did some spreadsheet wrangling, and tried to tease out a few more predictive metrics for figuring out the win percentage in a lockout season versus the win percentage in a non-lockout season. This post walks through my analysis, shares the data, and comes to a few key conclusions that supplement Zach's excellent piece. So, dally no longer. Let's dig in. All sheet/cell references are in reference to the main spreadsheet I made for the analysis, which I've uploaded to Google Docs for your reading pleasure. You know. If you like that sort of thing.

• • •

Part I: Correlations within seasons.

For this part, turn to Sheet "C1" in the spreadsheet.

This was where it all started. Zach's initial request was for some within-season correlations, between a team's win percentage and the four factors, for the lockout-compressed 1999 season. He also wanted a few years around it, attempting to see if anything was more or less predictive in 1999 than it was for other years. A fair question. There are more years here than there were when I initially gave Zach the results (in my attempts to broaden the dataset for part two of this analysis), but the basic trend is the same. Nothing really stands out all that much. To wit, compare 1999 correlations with the average correlations among all other years in the sample:

               --------- OFFENSIVE ---------   --------- DEFENSIVE ---------
       Pace    eFG%    TOV%    ORB%    FT/FG   eFG%    TOV%    DRB%    FT/FG
 AVG  -0.151   0.689  -0.417   0.018   0.224  -0.685   0.093   0.440  -0.270
1999  -0.286   0.553   0.051   0.034   0.374  -0.738  -0.139   0.483  -0.448

See what I mean? Every one of 1999's values that falls two standard deviations or more outside the population average is highlighted in red. That's right -- offensive turnover percentage is the only one. And I admit, it's a bit strange -- in 1999, offensive turnover percentage was positively correlated with win percentage, which means that having a high turnover percentage actually led to more wins. A very odd result. But not really a notable one, without any further context. Probably worth looking into (as I eventually did), but not really worth calling the be-all and end-all of lockout impacts. At this point, I sent the analysis to Zach and he used it as support for his article.

Then, this week, I got a little bit deeper.

Part II: Comparing season averages.

For this part, turn to Sheet "AVG" in the spreadsheet.

I wanted to dig a little bit deeper into the statistics. So I decided to ignore correlations to win percentage for a bit, and see if there are any key differences between season averages. This is a more customary analysis, and there are many people who have done quite well at it themselves. Here, though, I was just looking for some basic numbers. There was a little more to go on in terms of isolating 1999's differentiating factors here, though not much. To wit, let's again compare the averages for several key statistics between non-1999 seasons and 1999:

               --------- OFFENSIVE ---------   --------- DEFENSIVE ---------
       Pace    eFG%    TOV%    ORB%    FT/FG   eFG%    TOV%    DRB%    FT/FG
 AVG  91.683   0.488   0.140   0.291   0.236   0.488   0.140   0.709   0.236
1999  88.917   0.466   0.146   0.301   0.241   0.465   0.146   0.698   0.241

Same deal -- numbers outside two standard deviations of the population average are highlighted in red. While eFG% is the only "true" outlier here, pace should probably be highlighted too. If you remove the two outlier fast-paced years of 1992 and 1993 (96.6 and 96.7 respectively) from the analysis, pace is well beyond the 2 standard deviation threshhold. Which fits expectations: 1999 is without any real comparison in terms of how slow it was -- it's over a full possession slower than any other year in the dataset. I'd also turn your attention to turnover percentage, which is insignificantly above the average for the other years -- this isn't particularly important, but does sort of point to one of the reasons why basketball was so odd and nigh-unwatchable in 1999 (if the games I've seen from that season are any indication). Everyone, even good teams that weren't usually associated with doing so, was turning the ball over at a slightly larger rate than usual. Which wouldn't make basketball unwatchable on its face.

Usually, though, the majority of the high turnover teams were the singularly bad teams in the league. In 1999, that wasn't necessarily the case. You had the crummy ball control usually kept solely to the lower-tier teams being played by upper tier teams as well. I mean, hell -- the New York Knicks had the 3rd highest turnover rate in the league, and they made the finals. The Utah Jazz had the highest, and they won 70% of their games! Both odd in a normal season, but entirely par for the course in 1999. Hence, the difference in the average wasn't huge, but the distribution of what teams were turning the ball over more often was skewed far more towards teams that got TV air-time and playoff dap than in any other season. This contributes a lot to the general consensus that 1999 had the worst basketball ever played. And while it may not be entirely accurate on an aggregated leaguewide level, there's no doubting that the distribution of teams with the traits of crummy teams was skewed in such a way that the best teams in the league were sharing traits that the crummy teams usually kept to themselves.

This isn't really relevant to the broader analysis here at all, but I think it's interesting and worth noting. What is relevant to the analysis at this point is that even though pace was barely within the two standard deviation threshhold, I had some intuition here. I was curious if we could be seeing something of a joint effect between the relatively high (though within range) correlation between slow pace and winning in 1999 and the raw average pace being so incredibly low. So, I followed up on that.

Part III: Correlations between seasons.

For this part, turn to Sheet "C2" in the spreadsheet.

Here's the meat of the post -- the correlation structure that made me go "eureka" and start cleaning this up to post-quality levels. In this part, I realized that within-season correlations, while useful, aren't really what we're going for here. What we actually want? We're trying to get predictive analytics -- we want to find the key barometers of a successful or unsuccessful lockout team from the season before. So in this part, what I did was find the correlations between the four factors from the season before, and win percentage in the current season. I excluded 1996 and 2005 (expansion years for the Raptors, Grizzlies, and Bobcats) from the analysis because I didn't particularly want to bias the results towards the mean for those years, and I did all of this in excel where my usual missing data tricks are less easy to use. Anyway. As from before, here's a table comparing the average values among correlations for non-1999 seasons and the correlations in 1999. Lo and behold, we finally get something useful.

               --------- OFFENSIVE ---------   --------- DEFENSIVE ---------
       Pace    eFG%    TOV%    ORB%    FT/FG   eFG%    TOV%    DRB%    FT/FG
 AVG  -0.103   0.568  -0.348   0.015   0.163  -0.575   0.069   0.338  -0.248
1999  -0.455   0.413   0.127  -0.081   0.475  -0.584  -0.197   0.433  -0.173

And here's the big reveal. Two correlations were well outside the aggregate interval for 1999 -- pace, and turnover percentage. Meaning the effect that a team's performance in 1998 on the 1999 season was markedly different than in previous and future seasons -- in the case of pace, quite a bit more intense. And in the case of turnover percentage? Completely the opposite direction. Teams that were slow paced in 1998 were far better than expected in 1999, while teams that were fast paced were for the most part worse. Looking at the raw data confirms this and actually makes the correlation factor here look like an understatement. Because there's two big outliers (the 1999 Sacramento Kings and the 1999 Chicago Bulls) that have a lot of leverage on the results, here. Take those two out, and the correlation factor balloons beyond -0.6, which is pushing beyond eFG% range in terms of predictive value.

Even if you don't exclude those two outliers, though, the numbers look pretty stark. If you take one stat away from this post, make it this one: of the 14 teams that played at an above-average pace in 1998, a full 10 of them had a decreased winning percentage in the 1999 season. And of the 15 teams who played at a below-average pace in 1998? A_ full 12 of them___ had the same or better win percentage in 1999.

That's a big factor, and it points to (quite possibly) the defining effect of season compression: in 1999's compressed season, the type of basketball your team was used to playing was important. Run-and-gun offenses were harmed by the compression, while slowdown offenses prospered in adapting to the generalized decrease in speed of the game caused by the compression. Compounding this, there's the strange turnover percentage correlation. I was really confused about this at first, but realized after a while that it could be a result of the 1998 Bulls losing Jordan. The 1998 Bulls had a turnover percentage of 0.133 and a winning percentage of 0.756, while the 1999 Bulls had a turnover percentage of 0.151 to a winning percentage of 0.260. A pretty big difference, that.

To test what kind of effect eliminating the explicable Jordan-caused outlier would have, I changed the 1998 Bulls to a turnover percentage of 0.151 (dang, MJ, stop losin' the ball!) in an alternate spreadsheet and checked the correlations. It's still positive and well outside the standard deviation, but it's a more reasonable 0.026, indicating that turnover percentage isn't wildly positively correlated, just slightly. Which makes a bit more sense. You see tics like that all the time in odd datasets like this, but the initial result (a 0.127 POSITIVE correlation?) was simply too weird to not have some kind of strange outlier effecting the result. I'd also note that taking out the Bulls actually makes the pace factor slightly larger -- the Bulls were one of only three 1998 below-pace teams that got worse in 1999, so removing them from the dataset actually raises the correlation to -0.521. I would've put the Jordan-lacking numbers in the comparison, but I felt it would be better to explain why I took them out with the context of what I was seeing rather than just stating it as a prior fact. Throwing away data is bad, folks.

Part IV: Conclusion & Followups

So, overall takeaway? Zach isn't really wrong in his excellent post, let's start with that. This lockout may create a new trend, and the lessons of the past aren't strictly prescriptive. While 66 games crammed into the window of December 25th to May 1st is pretty bad, I'm not positive it's as bad as the insane pace the 1999 season was played at. It's comparable, but certainly not worse and possibly not on the same order of magnitude as bad, if they set up the schedule correctly. My analysis here doesn't even begin to touch the way that prevailing sentiment somehow says that the lockout will advantage the following types of teams:

  1. Young teams, because they have "young legs."
  2. Old teams, because they have "experience."
  3. Strong systems, because they "don't need training camp."
  4. Weak systems with talented players, because they "don't need training camp."
  5. Samardo Samuels, because he "went to St. Benedicts."

In short, depending on who you ask, the lockout is either going to irrevocably destroy every team in the league's mojo or grant them infinite advantages over everyone else. Not contradictory in the slightest! But honestly, short answer is that we honestly don't know what exogenous factors are going to prove to be the defining things that advantage and disadvantage teams in this lockout. Zach is absolutely right about this.

But by looking a bit deeper into what happened directly before the lockout, we can tease out one big conclusion. Teams with high pace last season (the Raptors, Knicks, and Suns are big offenders here) have a reasonably good chance of starting day one at a disadvantage over teams that operated at a slow pace. We're going to see some outliers to this rule, because they're always there, but in the last lockout this was a relatively hard and fast rule. Teams that played fast in 1998 played worse in 1999. Teams that played slow in 1998 played better in 1999. Not hard to understand, nor is it altogether unintuitive -- fast paced teams generally do worse on back-to-backs, traditionally, and the compressed season means less recovery time between games for teams that try to push the pace.

Regardless, we probably could use a bit more analysis here. In particular, I'd like to do a ridge regression predicting season stats from the previous season's four factors stats -- there's enough data for it, and using the lockout data for a test sample could help put in context any joint effects among the covariates. But that's enough for today. This result is useful enough to post without having done the regressions yet, I think.

• • •

Feel free to poke around the sheet, comment on the conclusions, and note the things I've forgotten to mention. Because this is unfortunately a pretty slipshod analysis right now, and there's plenty of room to critique the methodology here. But I think it's useful, and worth a look. Please let me know if you have any followup questions in the comments, because I'll probably take a revised look at this sheet and the data this weekend with R and get some higher order models built for the sake of teasing out any joint effects. Until then? Keep it real, readers.