A March Madness Analysis

by Navneeth Babra, Rohan Bharadwaj, Maxwell Newman

Introduction

The NCAA is the National Collegiate Athletic Association and is in charge of a majority of college sports. In college basketball, the NCAA divion 1 tournament (also known as march madness) is a single elimination tournament featuring 68 college teams in a bracket style tourney. Teams compete in 7 rounds for the national championship. Considering the amount of attention college basketball gets every year, there are people that get invested in predicting brackets for the tournament, and also bet on their bracket to be the final bracket. Read here to see information about odds of a perfect bracket: https://www.ncaa.com/webview/news%3Abasketball-men%3Abracketiq%3A2020-01-15%3Aperfect-ncaa-bracket-absurd-odds-march-madness-dream.

Data science has changed how sports betting has functioned, with some algorithms being more accurate than experts. More information about how data science has influenced betting strategy: https://towardsdatascience.com/making-big-bucks-with-a-data-driven-sports-betting-strategy-6c21a6869171.

In this tutorial, our goal is to showcase correlations between winrate and bracket placement as well as in game statistics (3 pointers, turnovers, etc.) and winrate. Through this process, we are also trying to find which in game statistics influence winrate. These two correlations should provide us a better understanding of what influences a team's bracket placement and winrate, in turn determining the overall bracket.

Part 1 -- Analyzing Factors That Help Analysts Make Bracket Predictions

In order to understand the purpose behind why datascience is useful for making predictions within the sports realm, it helps if we are to focus on a single variable to predict bracketeering. In this case, we will analyze seasonal winrate and its relation to tournament placements.

Data Scraping

Seasonal data

Seasonal data for the NCAA was acquired from the kaggle database and is seperated by year

1a. Determining the correlation between winrate and bracket placement

Prediction: Winrate will be a very good indicator for selecting top16 placements

Many people who create their march madness brackets make the majority of their predictions based off of the highest winrates of the teams playing for that season. Let us take a dive into the effectiveness of this strategy if we were to execute it for last year. (We can't do this year because the 2020 brackets have been canceled thanks to covid-19). We begin by calcualating the win percentages for each team during the 2018 and 2019 seasons, and use these teams as our predictions for the top16 placers. We will then take our march madness bracket game data and create a list of the true top 16. We will compare this list of top 16 teams to our predicted top 16 and look at the results.

Function for getting the bracket placement data based on a given year:

Whenever we want to get the results for the top 16 for a given year.

This function works by iterating through our braket gamedata that we scraped, and counting the number of wins and total games for each team in the bracket. We then sort the teams based off of these wins to give the placements

Next, we will get the data from the 2019 season and determine the winrate for each of the teams for the season and select the top 16. This will be our 'rough' prediction to select teams with the highest winrate to be our picks for the final 16 in the march madness bracket

2019 top16 Predicitons based off of winrate

Here we take the season data, calcualte the total winrate for each team for that year, and then use the teams with the highest winrates as a prediction for the top16

2019 Actual top16 Bracket

Here we get the actual top16 teams from our scraped data. We can see that we got 5/16 teams correct in our prediction

2018 top16 Predicitons based off of winrate

Here we take the season data, calcualte the total winrate for each team for that year, and then use the teams with the highest winrates as a prediction for the top16

2018 Actual top16 Bracket

Here we get the actual top16 teams from our scraped data. We can see that we got 4/16 teams correct in our prediction

1a (cont). Results: 5/16 Correctly Predicted For 2019, 4/16 Correctly Predicted For 2018

Out of the 16 teams we selected for 2018, 4 of those ended up making the top 16. For 2019, 5 out of the 16 teams we selected made it to the top 16. Despite these low numbers, these are really good final 16 picks given the fact that there are over 350 teams that play in the bracket! To put this in perspective, the probability of guessing (at random) atleast any 4 of the top 16 teams is .0002%, and atleast 5 of the top 16 teams is .00001%. This also applies to the other years as well, but it is not necessary to show them. If there truly was no correlation with winrate to march madness placement, the odds of us randomly guessing atleast 25% of the top16 two years in a row is 4e-12, or .0000000004% -- And keep in mind, we predicted better than 25% for 2019.

We want to see how correlated seasonal winrate is to placement. By creating a linear regression model using seasonal winrate as the dependent variable and bracket placement as the independent variable.

1b. Using Linear Regression to determine Correlation of Seasonal Winrate to March Madness Bracket Placement

From the small p and moderately small r-squared value, there definitely is a correlation, but not a tight correlation. This is due to a multitude of factors and not only seasonal winrate influencing bracket placement. However, winrate is a good identifier for trends for teams and their placement.

Let's take a quick look at winrates of the top 16 teams on the years before this bracket to see if there is any trends.

1c. Winrates Over The Last 5 Years For 2019's Top16 Teams

Here we take our season data, calculate the win rates for each team. We then use the top16 teams for 2019 and plot them for year vs winrate to see the trend

We can see that these top 16 teams had variating winrates for the years prior to 2019, and then at 2018-2019 their winrates shot up, and the majority of them had winrates above 80%. This leads to an interesting question: How are these teams increasing their winrates? Maybe more interestingly, how do certain stats correlate with winrate? (3 pointers, turnovers, etc.)

1d. Determining the correlation between game stats (turnovers, 3pointers, etc) and winrate

Since we have taken a look at how winrate affects bracketeering, let's take a look at how game stats affect winrate.

First we must calculate the winrates for seasonal data for the last 5 years

Great.

Now, for this linear regression model, we want to see the correlation between in-game statistics and seasonal winrate. We are using 7 in-game statistics: effective field goal percentage (EFG_O), turnover rate (TOR), steal rate (TORD), free throw rate (FTR), two-point shooting rate (2P_O), three-point shooting rate (3P_O), and adjusted tempo (ADJ_T).

We believe these are the most 'valuable' in game statistics for contributing to seasonal winrate, where we know that relatively high rates of ball posession and scoring lead to more wins.

We can see here that seasonal winrate is stronly dependent on in game statistics of the team (from the r-squared value, compared to winrate v. bracket placement). This makes a lot of sense since the outcome of games are dependent on points scored, and these points scored are very much dependent on these in game statistics.

1e. Consistent Top 16 Placers

Another piece of information that may influence braketeers is consistency. Our goal here is to determine which teams throughout the years make consistent progress in the bracket. To do this, we will find and plot the most consistent teams. We will define "consistent" as the teams with at least 4 top16 placements within the last 5 years.

These are the most consistent top16 placement teams throughtout the years 2009-2018. These teams each placed top16 x4 in any 5 year period. We can see that it is relatively consistent throughout the years 2010 to 2015 where many of the same teams were fighting for a spot in the top16, and then after 2015 Gonzaga emerges as the most consistent top16 team by far, being the first to hit 5 consecutive years of placement.

NOTE: any teams that have a sudden stop in their trend indicate that they have not placed top16 following that year.

1f. Seeding Matchups

Here, we want to explore how often each seed has been able to beat its matchup, and vice-versa.

Part 2 -- Winner Prediction

Now that we have have explored the potential correlations between game stats and bracket placements, its time we make a winner predictor. In order to get a more accurate predictor, we are going to need more than 5 years of data, so we must scrape ESPN to get more seasonal data on team stats.

Data Collection

Like the game data, this data was collected from ESPN as well. The data is collected on a need-to-collect basis where only teams that competed in the March Madness bracket for a given year will have their data scraped for that specific year.

Linear Regression

We will first use a linear regression model to identify a benchmark for a more robust prediction model. A multiple linear regression model will also enable us to identify which features explain the variation in points spread, which can help understand which features explain the variation in win probability.

Explaining the regression

As we can see from the regression table, the difference in seeds explains a great deal in the difference between team scores. This would make sense as the seed number somewhat encodes skill as lower seeded teams are usually at the top of their respective leagues.

Logisitic Regression

While linear regression works well in predicting point spread, we care about win probabilities between teams. As a result we will use a logistic regression to classify whether Team 1 will beat Team 2 based on the difference in their average team metrics.

Confusion Matrix

When running a logistic regression, accuracy does not provide enough information about the model as we don't understand the variability in how decisions were made. Therefore a confusion matrix is often used to identify precision and recall - two other metrics that better explain how effective a model worked in classification.

As we can see, our recall is much higher than our precision, suggesting that our model was overpredicting wins. This could be due to the statistically significant variable of seed difference, which is likely a key contributor in predictions.

Bracket Calculation

The fun of March Madness comes from predicting an accurate bracket based on the teams that have been seeded. This is a very fun experience as college basketball fans and statisticians alike try their hardest to pick the perfect bracket. However, the odds of picking a perfect bracket are about 1 in 120 billion, making it all the more fun.

Model shortcomings

Our model predicted that for each of the regions in 2019, based off the logistic regression model trained, that the 1 seed would win each round. This is obviously due to overcorrelation between our features and win probability - likely due to seed number. Had we omitted that variable, it might have been more effective at identifying win probability.

Further work

Our current model works in a static fashion, not taking into account the placements within the tournaments as games progress, which is likely a good predictor for future win probability. Addtionally, seasonal stats seem to be quite ineffective in explaining the variance in win probability and there are likely more granular features that can be used to explain it better and even track it across time. Ultimately, the issue here will be to avoid trying to overcomplicate the model and have confounding variables, and instead try to understand what metrics of teams best predict their success in a tournament.

Conclusion

March Madness is a tough competition to predict. There are way too many variables affecting the outcome, and in the case of single elimination tournaments, a Black Swan even is likely to alter the entire prediction. Ultimately a static model like we used here is not going to be a useful predictor of such tournaments, and a model that factors in variable win probabilities as the tournament progresses would likely be a better predictor.