It's the most wonderful time of the year. A time of Cinderella stories, buzzer beaters, and rooting for a team you've never heard of to beat a perennial powerhouse. Yes, I'm talking about March Madness. I'd be lying if I said I won't be glued to a TV or checking scores for the next few days (weeks if my team can make a run). This year I decided to do more than observe, I wanted to dig a bit deeper to see just how crazy these tournaments can be. So, let's have our one shining moment with some march madness analytics.
For my analysis I was motivated by upsets in the tournament. To explore this, I needed some historical data that captures the seed of each team in the NCAA tournament. It turns out the Washington Post has a pretty great database for the past 33 years of the tournament.
With the data in hand I began analyzing the average seed for each round based on tournament history. I then compared that average to what would be the expected average for that round. For example, in the Elite Eight you'd expect to have all 1 and 2 seeds, so the average would be 1.5. These values are depicted in the chart below as a red line (actual round average) and green line (expected round average).
As you can see there are some big differences between the expected and actual average of each round. It becomes apparent how these tournaments are so exciting - low seeds can really come out of nowhere and shift the landscape. The gap tends to increase the farther along in the tournament, which is likely a product of fewer teams and therefore more volatility. Some interesting years also start to emerge. In the 2011 final four UCONN was the best team at a 3 seed with VCU being the lowest seed at 11. Also, in the final four, Michigan State was the only 1 seed in 2000, with two other 8 seeds. And finally, in 2014 UCONN and Kentucky faced off in the National Championship as 7 and 8 seeds, respectively.
All of this got me wondering - how do the individual seeds typically perform? To answer this question, I put together a distribution of the seeds over the past 33 years for each round. The bars show the percentage of those seeds that have made that round.
Some of this information is well known. For instance, how all 1 seeds have previously made it to the second round. Some interesting points are how 12 seeds in the second round are not much farther behind 11 seeds but largely outperform 13 seeds. It looks like there is something to the excitement of those 5/12 seed matchups. The Sweet Sixteen also shows some interesting insights. There is a pretty steady decline until the 10, 11, and 12 seeds. These seeds appear almost as frequently as a 7 seed in that round. In the Elite Eight the 6 seeds also stop the downward trend. Then the 8 & 11 seeds seem to stand out in the final four. Finally, there have been three 8 seeds to reach the National Championship, with one winning it all (Villanova).
The cool thing about all this data is it can actually be useful for your office bracket pool! The first chart on average seed by round gives you an idea of what the landscape should be. The seed distribution helps identify some matchups that may be possible upsets. So, if you'll excuse me, I'm off to fill out my bracket at the last minute.