The NCAA is the National Collegiate Athletic Association and is in charge of a majority of college sports. In college basketball, the NCAA divion 1 tournament (also known as march madness) is a single elimination tournament featuring 68 college teams in a bracket style tourney. Teams compete in 7 rounds for the national championship. Considering the amount of attention college basketball gets every year, there are people that get invested in predicting brackets for the tournament, and also bet on their bracket to be the final bracket. Read here to see information about odds of a perfect bracket: https://www.ncaa.com/webview/news%3Abasketball-men%3Abracketiq%3A2020-01-15%3Aperfect-ncaa-bracket-absurd-odds-march-madness-dream.
Data science has changed how sports betting has functioned, with some algorithms being more accurate than experts. More information about how data science has influenced betting strategy: https://towardsdatascience.com/making-big-bucks-with-a-data-driven-sports-betting-strategy-6c21a6869171.
In this tutorial, our goal is to showcase correlations between winrate and bracket placement as well as in game statistics (3 pointers, turnovers, etc.) and winrate. Through this process, we are also trying to find which in game statistics influence winrate. These two correlations should provide us a better understanding of what influences a team's bracket placement and winrate, in turn determining the overall bracket.
In order to understand the purpose behind why datascience is useful for making predictions within the sports realm, it helps if we are to focus on a single variable to predict bracketeering. In this case, we will analyze seasonal winrate and its relation to tournament placements.
from bs4 import BeautifulSoup as bs
import requests as rq
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16,9]
def parse_match(match, year):
match_string = match.get_text().split('\n\n')
teams = match_string[0]
score = match_string[1]
team_vals = re.findall("(\d+) (.+)", teams)
seeds = [int(v[0]) for v in team_vals]
team_names = [v[1] for v in team_vals]
scores = [int(s) for s in score.split('\n')]
class_str = " ".join(match.get('class'))
ids = [int(re.findall("\/id\/(.+)\/", m.get('href'))[0]) for m in match.find_all('a')]
bracket_round = re.findall('round(\d)', class_str)
match_info = {"year": year, "team1_name": team_names[0], "team2_name": team_names[1], "team1_id": ids[0], "team2_id": ids[1], "team1_seed": seeds[0], "team2_seed": seeds[1],
"team1_score": scores[0], "team2_score": scores[1], "match_round": int(bracket_round[0]) if len(bracket_round) > 0 else 0}
return match_info
def get_matches(year):
url = "http://www.espn.com/mens-college-basketball/tournament/bracket/_/id/{}22/".format(year)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
r = rq.get(url, headers=headers)
page = r.text
soup = bs(page)
for br in soup.find_all("br"):
br.replace_with("\n")
for dt in soup.find_all("dt"):
dt.append('\n\n')
matches = soup.find('div', {'id': 'bracket'}).select('.match')
matches_list = []
for match in matches:
matches_list.append(parse_match(match, year))
return matches_list
def generate_game_data():
data = []
for year in range(2009, 2020):
data.append(pd.DataFrame(get_matches(year)))
return pd.concat(data).sort_values(by=['year', 'match_round']).reset_index(drop=True)
game_data = generate_game_data()
game_data
year | team1_name | team2_name | team1_id | team2_id | team1_seed | team2_seed | team1_score | team2_score | match_round | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2009 | Louisville | Morehead State | 97 | 2413 | 1 | 16 | 74 | 54 | 1 |
1 | 2009 | Ohio State | Siena | 194 | 2561 | 8 | 9 | 72 | 74 | 1 |
2 | 2009 | Utah | Arizona | 254 | 12 | 5 | 12 | 71 | 84 | 1 |
3 | 2009 | Wake Forest | Cleveland State | 154 | 325 | 4 | 13 | 69 | 84 | 1 |
4 | 2009 | West Virginia | Dayton | 277 | 2168 | 6 | 11 | 60 | 68 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
725 | 2019 | Virginia | Purdue | 258 | 2509 | 1 | 3 | 80 | 75 | 4 |
726 | 2019 | Auburn | Kentucky | 2 | 96 | 5 | 2 | 77 | 71 | 4 |
727 | 2019 | Michigan State | Texas Tech | 127 | 2641 | 2 | 3 | 51 | 61 | 5 |
728 | 2019 | Virginia | Auburn | 258 | 2 | 1 | 5 | 63 | 62 | 5 |
729 | 2019 | Texas Tech | Virginia | 2641 | 258 | 3 | 1 | 77 | 85 | 6 |
730 rows × 10 columns
Seasonal data for the NCAA was acquired from the kaggle database and is seperated by year
seasondata = pd.read_csv ('cbb.csv')
seasondata2018 = pd.read_csv ('cbb18.csv')
seasondata2019 = pd.read_csv ('cbb19.csv')
seasondata2018 = pd.read_csv ('cbb18.csv')
seasondata2017 = pd.read_csv ('cbb17.csv')
seasondata2016 = pd.read_csv ('cbb16.csv')
seasondata2015 = pd.read_csv ('cbb15.csv')
team_stats = pd.read_csv('team_stats.csv')
team_stats
Unnamed: 0 | id | year | games_played | points | rebounds | assists | steals | blocks | turnovers | fg_pct | ft_pct | 3P_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 9 | 2009 | 33.0 | 69.5 | 31.3 | 15.6 | 5.9 | 2.1 | 11.5 | 48.2 | 74.0 | 36.6 |
1 | 1 | 12 | 2009 | 32.0 | 72.2 | 33.9 | 14.4 | 6.0 | 3.1 | 12.6 | 47.5 | 73.4 | 39.5 |
2 | 2 | 25 | 2009 | 32.0 | 75.0 | 33.7 | 15.5 | 4.9 | 2.0 | 12.0 | 48.5 | 75.6 | 43.4 |
3 | 3 | 26 | 2009 | 33.0 | 76.0 | 33.0 | 15.7 | 8.4 | 3.2 | 12.4 | 49.4 | 72.1 | 39.8 |
4 | 4 | 30 | 2009 | 33.0 | 68.4 | 35.8 | 12.4 | 6.3 | 4.7 | 14.0 | 47.4 | 66.6 | 32.9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
736 | 736 | 2608 | 2019 | 33.0 | 72.9 | 34.7 | 10.1 | 6.0 | 2.5 | 10.6 | 47.4 | 74.5 | 37.8 |
737 | 737 | 2633 | 2019 | 33.0 | 82.3 | 37.7 | 18.5 | 6.1 | 5.4 | 11.0 | 49.8 | 76.2 | 35.8 |
738 | 738 | 2641 | 2019 | 32.0 | 73.1 | 34.3 | 14.0 | 7.4 | 4.9 | 12.4 | 47.2 | 72.8 | 36.8 |
739 | 739 | 2670 | 2019 | 32.0 | 71.4 | 36.8 | 13.9 | 8.0 | 4.5 | 13.9 | 44.2 | 69.8 | 30.7 |
740 | 740 | 2747 | 2019 | 33.0 | 83.0 | 35.5 | 15.4 | 6.9 | 3.0 | 10.9 | 49.3 | 69.9 | 41.6 |
741 rows × 13 columns
Many people who create their march madness brackets make the majority of their predictions based off of the highest winrates of the teams playing for that season. Let us take a dive into the effectiveness of this strategy if we were to execute it for last year. (We can't do this year because the 2020 brackets have been canceled thanks to covid-19). We begin by calcualating the win percentages for each team during the 2018 and 2019 seasons, and use these teams as our predictions for the top16 placers. We will then take our march madness bracket game data and create a list of the true top 16. We will compare this list of top 16 teams to our predicted top 16 and look at the results.
Whenever we want to get the results for the top 16 for a given year.
This function works by iterating through our braket gamedata that we scraped, and counting the number of wins and total games for each team in the bracket. We then sort the teams based off of these wins to give the placements
def getBracketPlacement(year):
teams ={}
total = 0
i = -1
for row in game_data.iterrows():
row = row[1]
if row['year'] != year:
continue
team1 = team2 = 0
if int(row['team1_score']) > int(row['team2_score']):
team1 += 1
else:
team2 += 1
if row['team1_name'] not in teams:
teams[row['team1_name']] = [0,0]
if row['team2_name'] not in teams:
teams[row['team2_name']] = [0,0]
teams[row['team1_name']][0] += team1
teams[row['team2_name']][0] += team2
teams[row['team1_name']][1] +=1
teams[row['team2_name']][1] +=1
total += team1 + team2
return sorted(teams.items(), key =
lambda kv:(kv[1], kv[0]),reverse=True)
Next, we will get the data from the 2019 season and determine the winrate for each of the teams for the season and select the top 16. This will be our 'rough' prediction to select teams with the highest winrate to be our picks for the final 16 in the march madness bracket
Here we take the season data, calcualte the total winrate for each team for that year, and then use the teams with the highest winrates as a prediction for the top16
i=0
for row in seasondata2019.iterrows():
row = row[1]
seasondata2019.loc[i,'winrate'] = row['W']/row['G']
i+=1
seasondata2019.sort_values(by=['winrate'], ascending=False).head(16)['TEAM']
25 Wofford 1 Virginia 22 Buffalo 47 New Mexico St. 48 Murray St. 61 Abilene Christian 0 Gonzaga 10 Houston 49 Liberty 50 UC Irvine 7 Tennessee 114 UNC Greensboro 27 Nevada 42 Belmont 2 Duke 84 Furman Name: TEAM, dtype: object
Here we get the actual top16 teams from our scraped data. We can see that we got 5/16 teams correct in our prediction
bracket2019 = getBracketPlacement(2019)
i = 0
top16_2019 = []
for team in bracket2019:
if i == 16:
break
print(team)
top16_2019.append(team[0])
i+= 1
('Virginia', [6, 6]) ('Texas Tech', [5, 6]) ('Michigan State', [4, 5]) ('Auburn', [4, 5]) ('Purdue', [3, 4]) ('Kentucky', [3, 4]) ('Gonzaga', [3, 4]) ('Duke', [3, 4]) ('Virginia Tech', [2, 3]) ('Tennessee', [2, 3]) ('Oregon', [2, 3]) ('North Carolina', [2, 3]) ('Michigan', [2, 3]) ('LSU', [2, 3]) ('Houston', [2, 3]) ('Florida State', [2, 3])
Here we take the season data, calcualte the total winrate for each team for that year, and then use the teams with the highest winrates as a prediction for the top16
i=0
for row in seasondata2018.iterrows():
row = row[1]
seasondata2018.loc[i,'winrate'] = row['W']/row['G']
i+=1
seasondata2018.sort_values(by=['winrate'], ascending=False).head(16)['TEAM']
1 Virginia 63 New Mexico St. 98 Stephen F. Austin 61 South Dakota St. 0 Villanova 78 Murray St. 40 Loyola Chicago 12 Gonzaga 4 Cincinnati 8 Michigan St. 65 Louisiana Lafayette 86 UNC Greensboro 77 South Dakota 38 Saint Mary's 11 Xavier 6 Michigan Name: TEAM, dtype: object
Here we get the actual top16 teams from our scraped data. We can see that we got 4/16 teams correct in our prediction
bracket2018 = getBracketPlacement(2018)
i = 0
top16_2018 = []
for team in bracket2018:
if i == 16:
break
print(team)
top16_2018.append(team[0])
i+= 1
('Villanova', [6, 6]) ('Michigan', [5, 6]) ('Loyola Chicago', [4, 5]) ('Kansas', [4, 5]) ('Texas Tech', [3, 4]) ('Syracuse', [3, 4]) ('Kansas State', [3, 4]) ('Florida State', [3, 4]) ('Duke', [3, 4]) ('West Virginia', [2, 3]) ('Texas A&M', [2, 3]) ('Purdue', [2, 3]) ('Nevada', [2, 3]) ('Kentucky', [2, 3]) ('Gonzaga', [2, 3]) ('Clemson', [2, 3])
Out of the 16 teams we selected for 2018, 4 of those ended up making the top 16. For 2019, 5 out of the 16 teams we selected made it to the top 16. Despite these low numbers, these are really good final 16 picks given the fact that there are over 350 teams that play in the bracket! To put this in perspective, the probability of guessing (at random) atleast any 4 of the top 16 teams is .0002%, and atleast 5 of the top 16 teams is .00001%. This also applies to the other years as well, but it is not necessary to show them. If there truly was no correlation with winrate to march madness placement, the odds of us randomly guessing atleast 25% of the top16 two years in a row is 4e-12, or .0000000004% -- And keep in mind, we predicted better than 25% for 2019.
We want to see how correlated seasonal winrate is to placement. By creating a linear regression model using seasonal winrate as the dependent variable and bracket placement as the independent variable.
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Placement = y
# winrate = x
rankings = []
for tup in bracket2019:
l = [tup[0],tup[1][0], tup[1][1]]
rankings.append(l)
df_bracket = pd.DataFrame(rankings, columns=['TEAM', 'Games Won', 'Games Played'])
df_final = pd.merge(df_bracket, seasondata2019, how='inner', on='TEAM')
Y = df_final['Games Won']
X = df_final[['winrate']]
X = sm.add_constant(X)
winrate_model = smf.ols(formula="Q('Games Won') ~ winrate", data=df_final).fit()
winrate_model.summary()
Dep. Variable: | Q('Games Won') | R-squared: | 0.216 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.199 |
Method: | Least Squares | F-statistic: | 12.91 |
Date: | Mon, 21 Dec 2020 | Prob (F-statistic): | 0.000780 |
Time: | 19:42:30 | Log-Likelihood: | -79.016 |
No. Observations: | 49 | AIC: | 162.0 |
Df Residuals: | 47 | BIC: | 165.8 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -3.1980 | 1.221 | -2.619 | 0.012 | -5.654 | -0.742 |
winrate | 5.8873 | 1.638 | 3.593 | 0.001 | 2.591 | 9.184 |
Omnibus: | 13.022 | Durbin-Watson: | 0.449 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 13.637 |
Skew: | 1.113 | Prob(JB): | 0.00109 |
Kurtosis: | 4.311 | Cond. No. | 14.3 |
from statsmodels.graphics.regressionplots import abline_plot
# plot regression line
abline_plot(model_results=winrate_model, ax=df_final.plot(x='winrate', y='Games Won', kind='scatter'))
From the small p and moderately small r-squared value, there definitely is a correlation, but not a tight correlation. This is due to a multitude of factors and not only seasonal winrate influencing bracket placement. However, winrate is a good identifier for trends for teams and their placement.
Let's take a quick look at winrates of the top 16 teams on the years before this bracket to see if there is any trends.
Here we take our season data, calculate the win rates for each team. We then use the top16 teams for 2019 and plot them for year vs winrate to see the trend
seasondata = pd.read_csv ('cbb.csv')
i=0
for row in seasondata.iterrows():
row = row[1]
seasondata.loc[i,'winrate'] = row['W']/row['G']
i+=1
grouped = seasondata.sort_values(by=['YEAR']).groupby(['TEAM'])
fig, ax = plt.subplots()
for key, group in grouped:
if key in top16_2019:
group.plot('YEAR', 'winrate', label=key, ax=ax)
plt.legend()
plt.title('Winrates for 2019 top16 NCAA')
plt.xlabel('YEAR')
plt.ylabel('Winrate')
plt.xticks([2015,2016,2017,2018,2019])
plt.show()
We can see that these top 16 teams had variating winrates for the years prior to 2019, and then at 2018-2019 their winrates shot up, and the majority of them had winrates above 80%. This leads to an interesting question: How are these teams increasing their winrates? Maybe more interestingly, how do certain stats correlate with winrate? (3 pointers, turnovers, etc.)
Since we have taken a look at how winrate affects bracketeering, let's take a look at how game stats affect winrate.
First we must calculate the winrates for seasonal data for the last 5 years
i=0
for row in seasondata.iterrows():
row = row[1]
seasondata.loc[i,'winrate'] = row['W']/row['G']
i+=1
seasondata
TEAM | CONF | G | W | ADJOE | ADJDE | BARTHAG | EFG_O | EFG_D | TOR | ... | 2P_O | 2P_D | 3P_O | 3P_D | ADJ_T | WAB | POSTSEASON | SEED | YEAR | winrate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | North Carolina | ACC | 40 | 33 | 123.3 | 94.9 | 0.9531 | 52.6 | 48.1 | 15.4 | ... | 53.9 | 44.6 | 32.7 | 36.2 | 71.7 | 8.6 | 2ND | 1.0 | 2016 | 0.825000 |
1 | Wisconsin | B10 | 40 | 36 | 129.1 | 93.6 | 0.9758 | 54.8 | 47.7 | 12.4 | ... | 54.8 | 44.7 | 36.5 | 37.5 | 59.3 | 11.3 | 2ND | 1.0 | 2015 | 0.900000 |
2 | Michigan | B10 | 40 | 33 | 114.4 | 90.4 | 0.9375 | 53.9 | 47.7 | 14.0 | ... | 54.7 | 46.8 | 35.2 | 33.2 | 65.9 | 6.9 | 2ND | 3.0 | 2018 | 0.825000 |
3 | Texas Tech | B12 | 38 | 31 | 115.2 | 85.2 | 0.9696 | 53.5 | 43.0 | 17.7 | ... | 52.8 | 41.9 | 36.5 | 29.7 | 67.5 | 7.0 | 2ND | 3.0 | 2019 | 0.815789 |
4 | Gonzaga | WCC | 39 | 37 | 117.8 | 86.3 | 0.9728 | 56.6 | 41.1 | 16.2 | ... | 56.3 | 40.0 | 38.2 | 29.0 | 71.5 | 7.7 | 2ND | 1.0 | 2017 | 0.948718 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1752 | Texas A&M | SEC | 35 | 22 | 111.2 | 94.7 | 0.8640 | 51.4 | 46.9 | 19.2 | ... | 52.5 | 45.7 | 32.9 | 32.6 | 70.3 | 1.9 | S16 | 7.0 | 2018 | 0.628571 |
1753 | LSU | SEC | 35 | 28 | 117.9 | 96.6 | 0.9081 | 51.2 | 49.9 | 17.9 | ... | 52.9 | 49.4 | 31.9 | 33.7 | 71.2 | 7.3 | S16 | 3.0 | 2019 | 0.800000 |
1754 | Tennessee | SEC | 36 | 31 | 122.8 | 95.2 | 0.9488 | 55.3 | 48.1 | 15.8 | ... | 55.4 | 44.7 | 36.7 | 35.4 | 68.8 | 9.9 | S16 | 2.0 | 2019 | 0.861111 |
1755 | Gonzaga | WCC | 35 | 27 | 117.4 | 94.5 | 0.9238 | 55.2 | 44.8 | 17.1 | ... | 54.3 | 44.4 | 37.8 | 30.3 | 68.2 | 2.1 | S16 | 11.0 | 2016 | 0.771429 |
1756 | Gonzaga | WCC | 37 | 32 | 117.2 | 94.9 | 0.9192 | 57.0 | 47.1 | 16.1 | ... | 58.2 | 44.1 | 36.8 | 35.0 | 70.5 | 4.9 | S16 | 4.0 | 2018 | 0.864865 |
1757 rows × 25 columns
Great.
Now, for this linear regression model, we want to see the correlation between in-game statistics and seasonal winrate. We are using 7 in-game statistics: effective field goal percentage (EFG_O), turnover rate (TOR), steal rate (TORD), free throw rate (FTR), two-point shooting rate (2P_O), three-point shooting rate (3P_O), and adjusted tempo (ADJ_T).
We believe these are the most 'valuable' in game statistics for contributing to seasonal winrate, where we know that relatively high rates of ball posession and scoring lead to more wins.
X = seasondata[['EFG_O', 'TOR', 'TORD', 'FTR', '2P_O', '3P_O', 'ADJ_T']]
Y = seasondata['winrate']
X = sm.add_constant(X)
game_model = smf.ols(formula="winrate ~ Q('EFG_O') + Q('TOR') + Q('TORD') + Q('FTR') + Q('2P_O') + Q('3P_O') + Q('ADJ_T')", data=seasondata).fit()
game_model.summary()
Dep. Variable: | winrate | R-squared: | 0.575 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.573 |
Method: | Least Squares | F-statistic: | 337.9 |
Date: | Mon, 21 Dec 2020 | Prob (F-statistic): | 2.43e-319 |
Time: | 19:42:33 | Log-Likelihood: | 1287.2 |
No. Observations: | 1757 | AIC: | -2558. |
Df Residuals: | 1749 | BIC: | -2515. |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.8161 | 0.093 | -8.787 | 0.000 | -0.998 | -0.634 |
Q('EFG_O') | 0.0043 | 0.010 | 0.417 | 0.677 | -0.016 | 0.024 |
Q('TOR') | -0.0284 | 0.002 | -18.583 | 0.000 | -0.031 | -0.025 |
Q('TORD') | 0.0192 | 0.001 | 14.243 | 0.000 | 0.017 | 0.022 |
Q('FTR') | 0.0072 | 0.001 | 12.277 | 0.000 | 0.006 | 0.008 |
Q('2P_O') | 0.0214 | 0.007 | 3.223 | 0.001 | 0.008 | 0.034 |
Q('3P_O') | 0.0115 | 0.006 | 2.006 | 0.045 | 0.000 | 0.023 |
Q('ADJ_T') | -0.0060 | 0.001 | -6.905 | 0.000 | -0.008 | -0.004 |
Omnibus: | 2.749 | Durbin-Watson: | 1.722 |
---|---|---|---|
Prob(Omnibus): | 0.253 | Jarque-Bera (JB): | 2.506 |
Skew: | -0.020 | Prob(JB): | 0.286 |
Kurtosis: | 2.820 | Cond. No. | 3.77e+03 |
We can see here that seasonal winrate is stronly dependent on in game statistics of the team (from the r-squared value, compared to winrate v. bracket placement). This makes a lot of sense since the outcome of games are dependent on points scored, and these points scored are very much dependent on these in game statistics.
Another piece of information that may influence braketeers is consistency. Our goal here is to determine which teams throughout the years make consistent progress in the bracket. To do this, we will find and plot the most consistent teams. We will define "consistent" as the teams with at least 4 top16 placements within the last 5 years.
top16= []
year = 2009
# Create a dataframe with an instance of every team with top16 placement for each year
while year < 2020:
bracket = getBracketPlacement(year)
i = 0
for team in bracket:
if i == 16:
break
top16.append([year, team[0]])
i+= 1
year +=1
top_teams = pd.DataFrame(data=top16, columns=['year','team'])
# iterate through the array to determine the number of wins a team has within the last 5 years
team_frequency = {}
oldestwin={}
legend = []
i = 0
year = 2009
for row in top_teams.iterrows():
row = row[1]
if row['team'] not in team_frequency:
team_frequency[row['team']] = 1
oldestwin[row['team']] = [row['year']]
else:
team_frequency[row['team']] +=1
oldestwin[row['team']] += [row['year']]
# win decay -- if 5 years have passed since the last win, remove one
if row['year'] > 2013:
if row['year'] != year:
year = row['year']
for key in team_frequency.keys():
if team_frequency[key] > 0 and year - oldestwin[key][0] >= 5:
team_frequency[key] -=1
oldestwin[key] = oldestwin[key][1:]
# For brevity, we only want to see the teams who have been in the top 16 for a minimum of 4 out of any 5 year period
if team_frequency[row['team']] >= 4:
legend.append(row['team'])
top_teams.loc[i,'placements'] = team_frequency[row['team']]
i+=1
# Create a plot with each of the most consistent teams
grouped = top_teams.sort_values(by=['year']).groupby(['team'])
fig, ax = plt.subplots()
for key, group in grouped:
if key in legend:
group.plot('year', 'placements', label=key, ax=ax)
plt.legend()
plt.title('Top 16 consistency')
plt.xlabel('YEAR')
plt.ylabel('number of top16 placements in the last 5 years')
Text(0, 0.5, 'number of top16 placements in the last 5 years')
These are the most consistent top16 placement teams throughtout the years 2009-2018. These teams each placed top16 x4 in any 5 year period. We can see that it is relatively consistent throughout the years 2010 to 2015 where many of the same teams were fighting for a spot in the top16, and then after 2015 Gonzaga emerges as the most consistent top16 team by far, being the first to hit 5 consecutive years of placement.
NOTE: any teams that have a sudden stop in their trend indicate that they have not placed top16 following that year.
Here, we want to explore how often each seed has been able to beat its matchup, and vice-versa.
df_matchup = game_data[['team1_seed', 'team2_seed', 'team1_score', 'team2_score']].copy()
df_matchup['result'] = np.where(df_matchup['team1_score'] > df_matchup['team2_score'], 1, 2)
df_matchup = df_matchup[['team1_seed', 'team2_seed', 'result']]
indicies_to_drop = []
for i in df_matchup.index:
team1 = int(df_matchup.loc[[i]]['team1_seed'])
team2 = int(df_matchup.loc[[i]]['team2_seed'])
if team1 + team2 != 17:
indicies_to_drop.append(i)
# Drop rows that dont have correct matchups
df_matchup.drop(df_matchup.index[indicies_to_drop], inplace=True)
# Get counts of rows that match (Our count)
df_matchup = df_matchup.groupby(df_matchup.columns.tolist(),as_index=False).size()
df_matchup['matchup'] = df_matchup["team1_seed"].astype(str) + ":" + df_matchup["team2_seed"].astype(str)
ax = df_matchup.pivot("matchup", "result", "size").plot(kind='bar', title="Number of wins per seed matchup")
ax.legend(['team 1 wins', 'team 2 wins'])
ax.set_xlabel("matchup (team 1 : team 2)")
Text(0.5, 0, 'matchup (team 1 : team 2)')
Now that we have have explored the potential correlations between game stats and bracket placements, its time we make a winner predictor. In order to get a more accurate predictor, we are going to need more than 5 years of data, so we must scrape ESPN to get more seasonal data on team stats.
Like the game data, this data was collected from ESPN as well. The data is collected on a need-to-collect basis where only teams that competed in the March Madness bracket for a given year will have their data scraped for that specific year.
def get_stats(teamId, year):
url = 'https://www.espn.com/mens-college-basketball/team/stats/_/id/{}/season/{}'.format(teamId, year)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
r = rq.get(url, headers=headers)
page = r.text
soup = bs(page)
table = soup.select('.Table__Scroller')[0]
stat_row = table.find_all('tr')[-1]
for td in stat_row.find_all('td'):
td.append('\n')
data_arr = [float(x) for x in stat_row.get_text().split('\n') if x]
team_stats = {"id": teamId, "year": year, "games_played": data_arr[0], "points": data_arr[1], "rebounds": data_arr[2],
"assists": data_arr[3], "steals": data_arr[4], "blocks": data_arr[5], "turnovers": data_arr[6], "fg_pct": data_arr[7],
"ft_pct": data_arr[8], "3P_pct": data_arr[9]}
return team_stats
def get_team_stats(game_data):
subset = game_data[['year', 'team1_id']].drop_duplicates()
subset2 = game_data[['year', 'team2_id']].drop_duplicates()
tuples = [tuple(x) for x in subset.to_numpy()]
tuples += [tuple(x) for x in subset2.to_numpy()]
stats = []
for tup in set(tuples):
stats.append(get_stats(tup[1], tup[0]))
return pd.DataFrame(stats).sort_values(by=['year', 'id']).reset_index(drop=True)
team_stats = get_team_stats(game_data)
df = game_data
df = pd.merge(df, team_stats, how="left", left_on=['year', 'team1_id'], right_on=['year', 'id'])
df = pd.merge(df, team_stats, how="left", left_on=['year', 'team2_id'], right_on=['year', 'id'], suffixes=("_team1", "_team2"))
We will first use a linear regression model to identify a benchmark for a more robust prediction model. A multiple linear regression model will also enable us to identify which features explain the variation in points spread, which can help understand which features explain the variation in win probability.
import warnings
warnings.filterwarnings('ignore')
df = df[['year', 'team1_name', 'team2_name', 'team1_score', 'team2_score', 'team1_seed', 'team2_seed',
'points_team1', 'rebounds_team1', 'assists_team1', 'steals_team1', 'blocks_team1',
'turnovers_team1', 'fg_pct_team1', 'ft_pct_team1', '3P_pct_team1', 'games_played_team2',
'points_team2', 'rebounds_team2', 'assists_team2', 'steals_team2', 'blocks_team2',
'turnovers_team2', 'fg_pct_team2', 'ft_pct_team2', '3P_pct_team2']]
reg_df = df[['year', 'team1_name', 'team2_name']]
reg_df['score_diff'] = df['team1_score'] - df['team2_score']
reg_df['seed_diff'] = df['team1_seed'] - df['team2_seed']
reg_df['avg_points_diff'] = df['points_team1'] - df['points_team2']
reg_df['avg_rebounds_diff'] = df['rebounds_team1'] - df['rebounds_team2']
reg_df['avg_assists_diff'] = df['assists_team1'] - df['assists_team2']
reg_df['avg_steals_diff'] = df['steals_team1'] - df['steals_team2']
reg_df['avg_blocks_diff'] = df['blocks_team1'] - df['blocks_team2']
reg_df['avg_turnovers_diff'] = df['turnovers_team1'] - df['turnovers_team2']
reg_df['avg_fg_pct_diff'] = df['fg_pct_team1'] - df['fg_pct_team2']
reg_df['avg_ft_pct_diff'] = df['ft_pct_team1'] - df['ft_pct_team2']
reg_df['avg_3P_pct_diff'] = df['3P_pct_team1'] - df['3P_pct_team2']
test_df = [rows for _, rows in reg_df.groupby('year')][-1]
dfs = [rows for _, rows in reg_df.groupby('year')][0:-1]
reg_dfs = pd.concat(dfs)
import statsmodels.formula.api as smf
winrate_model = smf.ols(formula="score_diff ~ seed_diff + avg_points_diff + avg_rebounds_diff + avg_assists_diff + avg_steals_diff + avg_blocks_diff + avg_fg_pct_diff + avg_ft_pct_diff + avg_3P_pct_diff", data=reg_dfs).fit()
winrate_model.summary()
Dep. Variable: | score_diff | R-squared: | 0.258 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.248 |
Method: | Least Squares | F-statistic: | 25.24 |
Date: | Mon, 21 Dec 2020 | Prob (F-statistic): | 2.30e-37 |
Time: | 19:42:35 | Log-Likelihood: | -2571.0 |
No. Observations: | 663 | AIC: | 5162. |
Df Residuals: | 653 | BIC: | 5207. |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.7756 | 0.535 | 1.450 | 0.147 | -0.274 | 1.825 |
seed_diff | -1.0090 | 0.080 | -12.624 | 0.000 | -1.166 | -0.852 |
avg_points_diff | 0.0012 | 0.126 | 0.009 | 0.993 | -0.246 | 0.248 |
avg_rebounds_diff | 0.1978 | 0.205 | 0.966 | 0.335 | -0.204 | 0.600 |
avg_assists_diff | -0.2251 | 0.243 | -0.925 | 0.355 | -0.703 | 0.253 |
avg_steals_diff | 0.7276 | 0.315 | 2.308 | 0.021 | 0.109 | 1.347 |
avg_blocks_diff | 0.1646 | 0.314 | 0.524 | 0.601 | -0.453 | 0.782 |
avg_fg_pct_diff | 0.0460 | 0.228 | 0.202 | 0.840 | -0.402 | 0.494 |
avg_ft_pct_diff | 0.1358 | 0.111 | 1.228 | 0.220 | -0.081 | 0.353 |
avg_3P_pct_diff | 0.0942 | 0.167 | 0.565 | 0.573 | -0.233 | 0.422 |
Omnibus: | 13.563 | Durbin-Watson: | 1.826 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 15.941 |
Skew: | 0.259 | Prob(JB): | 0.000346 |
Kurtosis: | 3.556 | Cond. No. | 10.6 |
As we can see from the regression table, the difference in seeds explains a great deal in the difference between team scores. This would make sense as the seed number somewhat encodes skill as lower seeded teams are usually at the top of their respective leagues.
While linear regression works well in predicting point spread, we care about win probabilities between teams. As a result we will use a logistic regression to classify whether Team 1 will beat Team 2 based on the difference in their average team metrics.
log_reg_df = reg_dfs
log_reg_df['win'] = log_reg_df['score_diff'] > 0
import statsmodels.api as sm
log_reg = sm.Logit(log_reg_df[['win']], log_reg_df[['seed_diff', 'avg_points_diff', 'avg_rebounds_diff', 'avg_assists_diff', 'avg_steals_diff', 'avg_fg_pct_diff', 'avg_ft_pct_diff', 'avg_3P_pct_diff']]).fit()
log_reg.summary()
Optimization terminated successfully. Current function value: 0.572254 Iterations 5
Dep. Variable: | win | No. Observations: | 663 |
---|---|---|---|
Model: | Logit | Df Residuals: | 655 |
Method: | MLE | Df Model: | 7 |
Date: | Mon, 21 Dec 2020 | Pseudo R-squ.: | 0.1411 |
Time: | 19:42:35 | Log-Likelihood: | -379.40 |
converged: | True | LL-Null: | -441.74 |
Covariance Type: | nonrobust | LLR p-value: | 8.120e-24 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
seed_diff | -0.1511 | 0.015 | -9.928 | 0.000 | -0.181 | -0.121 |
avg_points_diff | -0.0075 | 0.024 | -0.313 | 0.754 | -0.055 | 0.040 |
avg_rebounds_diff | 0.0207 | 0.036 | 0.577 | 0.564 | -0.050 | 0.091 |
avg_assists_diff | -0.0540 | 0.046 | -1.162 | 0.245 | -0.145 | 0.037 |
avg_steals_diff | 0.1265 | 0.059 | 2.135 | 0.033 | 0.010 | 0.243 |
avg_fg_pct_diff | 0.0314 | 0.044 | 0.710 | 0.478 | -0.055 | 0.118 |
avg_ft_pct_diff | 0.0206 | 0.021 | 0.968 | 0.333 | -0.021 | 0.062 |
avg_3P_pct_diff | -0.0033 | 0.032 | -0.103 | 0.918 | -0.066 | 0.059 |
When running a logistic regression, accuracy does not provide enough information about the model as we don't understand the variability in how decisions were made. Therefore a confusion matrix is often used to identify precision and recall - two other metrics that better explain how effective a model worked in classification.
Xtest = test_df[['seed_diff', 'avg_points_diff', 'avg_rebounds_diff', 'avg_assists_diff', 'avg_steals_diff', 'avg_fg_pct_diff', 'avg_ft_pct_diff', 'avg_3P_pct_diff']]
test_df['win'] = test_df['score_diff'] > 0
ytest = test_df['win']
yhat = log_reg.predict(Xtest)
prediction = pd.Series(list(map(round, yhat)))
actual = pd.Series([1 if x else 0 for x in list(ytest.values)])
confusion_matrix = pd.crosstab(actual, prediction)
precision = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[1][0]) # Precision
recall = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[0][1]) # Recall
print("Precision: %.2f" % precision)
print("Recall: %.2f" % recall)
Precision: 0.47 Recall: 0.70
As we can see, our recall is much higher than our precision, suggesting that our model was overpredicting wins. This could be due to the statistically significant variable of seed difference, which is likely a key contributor in predictions.
The fun of March Madness comes from predicting an accurate bracket based on the teams that have been seeded. This is a very fun experience as college basketball fans and statisticians alike try their hardest to pick the perfect bracket. However, the odds of picking a perfect bracket are about 1 in 120 billion, making it all the more fun.
bracket_2019 = pd.read_csv('bracket_data_2019.csv') # Auto-annotated dataset based on game_data
bracket_2019 = bracket_2019[bracket_2019.year == 2019]
# Matchups of March Madness brackets by region
bracketology = [
[
[
[1, 16], [8, 9]
], [
[5, 12], [4, 13]
]
], [
[
[6, 11], [3, 14]
], [
[7, 10], [2, 15]
]
]
]
bracket_2019 = pd.merge(bracket_2019, team_stats, how="left", left_on=['year', 'team1_id'], right_on=['year', 'id'])
bracket_2019 = pd.merge(bracket_2019, team_stats, how="left", left_on=['year', 'team2_id'], right_on=['year', 'id'], suffixes=("_team1", "_team2"))
region = bracket_2019[bracket_2019.match_round == 1].groupby('region')
regions = [rows for _, rows in region]
def bracket_calc(bracket, region):
if isinstance(bracket[0], list):
new_game = [bracket_calc(bracket[0], region), bracket_calc(bracket[1], region)]
test_game = (region[(region.team1_seed == new_game[0]) | (region.team2_seed == new_game[0])])
test = log_reg.predict(test_game[['seed_diff', 'avg_points_diff', 'avg_rebounds_diff', 'avg_assists_diff', 'avg_steals_diff', 'avg_fg_pct_diff', 'avg_ft_pct_diff', 'avg_3P_pct_diff']])
return new_game[0 if list(test)[0] > 0.5 else 1]
else:
test_game = (region[(region.team1_seed == bracket[0]) | (region.team2_seed == bracket[0])])
test = log_reg.predict(test_game[['seed_diff', 'avg_points_diff', 'avg_rebounds_diff', 'avg_assists_diff', 'avg_steals_diff', 'avg_fg_pct_diff', 'avg_ft_pct_diff', 'avg_3P_pct_diff']])
return bracket[0 if list(test)[0] > 0.5 else 1]
top_teams = []
reg_str = ['Midwest', 'South', 'West', 'East']
for reg in regions:
df = reg[['year', 'team1_name', 'team2_name', 'team1_id', 'team2_id', 'team1_score', 'team2_score', 'team1_seed', 'team2_seed',
'points_team1', 'rebounds_team1', 'assists_team1', 'steals_team1', 'blocks_team1',
'turnovers_team1', 'fg_pct_team1', 'ft_pct_team1', '3P_pct_team1', 'games_played_team2',
'points_team2', 'rebounds_team2', 'assists_team2', 'steals_team2', 'blocks_team2',
'turnovers_team2', 'fg_pct_team2', 'ft_pct_team2', '3P_pct_team2']]
reg_df = df[['year', 'team1_name', 'team2_name', 'team1_id', 'team2_id', 'team1_seed', 'team2_seed']]
reg_df['score_diff'] = df['team1_score'] - df['team2_score']
reg_df['seed_diff'] = df['team1_seed'] - df['team2_seed']
reg_df['avg_points_diff'] = df['points_team1'] - df['points_team2']
reg_df['avg_rebounds_diff'] = df['rebounds_team1'] - df['rebounds_team2']
reg_df['avg_assists_diff'] = df['assists_team1'] - df['assists_team2']
reg_df['avg_steals_diff'] = df['steals_team1'] - df['steals_team2']
reg_df['avg_blocks_diff'] = df['blocks_team1'] - df['blocks_team2']
reg_df['avg_turnovers_diff'] = df['turnovers_team1'] - df['turnovers_team2']
reg_df['avg_fg_pct_diff'] = df['fg_pct_team1'] - df['fg_pct_team2']
reg_df['avg_ft_pct_diff'] = df['ft_pct_team1'] - df['ft_pct_team2']
reg_df['avg_3P_pct_diff'] = df['3P_pct_team1'] - df['3P_pct_team2']
print("Top in %s: %d seed" % (reg_str.pop(), bracket_calc(bracketology, reg_df)))
Top in East: 1 seed Top in West: 1 seed Top in South: 1 seed Top in Midwest: 1 seed
Our model predicted that for each of the regions in 2019, based off the logistic regression model trained, that the 1 seed would win each round. This is obviously due to overcorrelation between our features and win probability - likely due to seed number. Had we omitted that variable, it might have been more effective at identifying win probability.
Our current model works in a static fashion, not taking into account the placements within the tournaments as games progress, which is likely a good predictor for future win probability. Addtionally, seasonal stats seem to be quite ineffective in explaining the variance in win probability and there are likely more granular features that can be used to explain it better and even track it across time. Ultimately, the issue here will be to avoid trying to overcomplicate the model and have confounding variables, and instead try to understand what metrics of teams best predict their success in a tournament.
March Madness is a tough competition to predict. There are way too many variables affecting the outcome, and in the case of single elimination tournaments, a Black Swan even is likely to alter the entire prediction. Ultimately a static model like we used here is not going to be a useful predictor of such tournaments, and a model that factors in variable win probabilities as the tournament progresses would likely be a better predictor.