from IPython.core.display import Image
Image(filename='steph.jpg')
Let's time travel back to October 7, 1979. The 34th annual National Basketball Association is about to begin, and with that, comes the introduction of the 3-point line and field goal. Let's read what the New York Times' NBA Preview (https://www.nytimes.com/1979/10/07/archives/nba-preview-new-faces-and-some-gimmicks.html) has to say about this new way to score :
"But all the players in the National Basketball Association will have one new gimmick in common — the 3‐point field Goal."
"The N.B.A. is hoping the new players, the new gimmick, the return of pro basketball to Utah, a new alignment and a new schedule with more interconference play, will help rebuild the sagging image the league suffered last season when attendance leveled off and television ratings dropped sharply. The declines were especially apparent in the major markets — New York, Los Angeles, Chicago, Boston and Philadelphia."
As we can see, the 3 point shot was originally seen as a gimmick, a trick or quirk to try to woo fans that they desperately needed, both in the stands and in the media.
Here's a quick explanation of what a 3 pointer is: "Players can get 3 points by shooting from beyond 22 feet away from the basket at the sidelines and 23 feet and 9 inches away from the basket everywhere else."
Also, here's a picture of the NBA 3 point arc. Anything shot outside of the outer arc is a 3 pointer.
from IPython.core.display import Image
Image(filename='arc.png')
Within the NBA community at the time, the viewpoint of the 3 pointer was just as cynical, as not all the coaches were thrilled with this new advent.
“It may change our game at the end of the quarters,” said John MacLeod, coach of the Phoenix Suns, “but I'm not going to set up plays for guys to bomb from 23 feet. I think that's very boring basketball.”
So it looks like at the time, there was no hope and belief for NBA 3 Pointers. The media thought it was a gimmick, and the NBA coaches were adamant on not drawing plays (opportunities to score) for players to shoot from 23+ feet away.
Were they right? Would teams really choose to shoot 3 pointers further away from the basket than rather just stick with the method that worked for them for the last 33 years: shooting 2 pointers near the basket? At a time where media and attendace rates were dropping, would the 3 pointer help bring in popularity?
In this tutorial, our goal is to look at NBA team's data around the 3-pointer and see if we can find any insight regarding the adoption and use of 3-pointers from the moment it was introduced in 1979, as well as its ties to teams' success and scoring output. We'll see its impact in the league, and show that it's more than just "a gimmick."
You will need the following libraries for this project:
I've listed the more information about utilizing these relevant libaries at the bottom of this tutorial, under 'Resources"
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn import linear_model
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
This is the the Data Collection stage of the data lifecycle. During this stage, we'll focus on getting our initial data from a specific source.
We'll get our data from two Kaggle CSV files. The data from the Season_Stats.csv file has individual NBA Players' data from 1950 to the present, while the team_records has data in regards to NBA teams' success on the court (win record, playoff information, ratings)
More information about the Kaggle datasets:
# read csv files into pandas dataframes
players_stats = pd.read_csv("Seasons_Stats.csv")
team_records = pd.read_csv("Team_Records.csv")
players_stats.head()
team_records.head()
As we see above, our two dataframes contain a lot of information regarding the data of individual NBA players as well as NBA teams. While this is a good start, we'll need to be manipulating and computing our data in order for it to provide us with the specific information that we want to extract from it as well as be able to do analysis upon.
From a high-level overview of these current two data frames, We already see some issues that we're going to have to deal with as we process our data. We see a lot of NaNs that need to be dealt with in the first DataFrame, as well as a lot of columns in both DataFrames that need to be deemed if necessary and worth keeping.
This is the Data Processing state of the data lifecycle. During this stage, we will be re-structuring, re-organizing, and tidying our data (which is right now, our two dataframes) to prepare our data for readability and performing analysis on.
We can name what we'll be doing in this section as Data Wrangling, which is the process of cleaning, structuring, and enriching our raw data in a desired format for us to effictively apply analysis and machine learning upon. For additonal resources explaining our data wranging, hit up link: https://www.trifacta.com/data-wrangling/
For example, some of the data wrangling we'll be doing is section involves the clarification of some of the columns in our DataFrames (structuring our data to only including necessary columns), dealing with NaN values (cleaning and tidying our data), and calculating additional three point metrics to our data for teams (enriching our raw data). Let's begin.
Our players_stats DataFrame includes about 50 columns on NBA player data, which is a lot to be interpreting. Since we honestly are just looking at scoring, especially just 3 point scoring, we can filter our dataframe to only keep columns relevant to those metrics.
We'll keep the columns:
And ditch the rest of the columns, as they don't relate to specific 3 point metrics (such as defensive stats (Rebounds, Steals, Blocks), assists, etc.).
#create new data frame with only needed columns
players_stats = players_stats[['Year', 'Player', 'Pos', 'Tm',
'3PAr', '3P', '3PA', '3P%', 'PTS']].copy()
players_stats.head()
Wait a minute... why does the values for our players_stats dataframe contain NaN/null values, specifically for our desired 3 point statistics?
Well that's because, as mentioned before, the 3 point shot was introduced in the 1979-1980 season! So before that time, since the shot didn't even exist, there was no recorded data for this statistic for each player!
We can easily resolve this by just dropping all the rows in the 3P% column that have the value 'NaN', thus keeping our players_stats dataframe to only include players data from 1980 onwards, when our focused metric started to exist.
#drop NaN 3P% rows -- will drop all before 1980 bc that's when 3 point line created
players_stats = players_stats[np.isfinite(players_stats['3P%'])]
players_stats.head()
Our players_stats dataset that we extracted the data from into our players_stats dataframe deals with players who played on 2+ different teams in a season (usually due to being traded) by creating an additional row for that player, with the Tm (team) labled 'TOT' (total), and included a sum of their statistics for each of the teams that the player played for, for that year.
Since we will be doing an analysis with teams and 3 point statistics coming up later in the tutorial, we can get rid of any rows that have a team name of 'TOT', as that's actually not a team name at all. We'll still have the player's statistics for the different teams they played for that year (in different rows), just not a sum of their statistics for that year (in all one row).
#drop rows with team name as 'TOT' for clarity
players_stats = players_stats[players_stats.Tm != "TOT"]
players_stats.Tm.unique() # should not include 'TOT'
In the current NBA, there are 30 teams in the league. However, if you look at the number of unique team names in our dataframe above, you'll see that there are more than 30.
This is because the NBA, throughout its whole existence, has had a history of teams moving from cities to cities (for example, in 2012 the New Jersey Nets moved to Brooklyn and became the Brooklyn Nets), and teams just changing their names (the Charlotte Bobcats changing their name to the Charlotte Hornets).
We decided to investigate this, and found an image that perfectly displays an NBA teams' name/location history throughout the years :
from IPython.core.display import Image
Image(filename='historyOfTeamNames.png')
As we can see from the image above, some teams (like the Houston Rockets), never changed names, while other teams (like the New Orlean Pelicans), have gone through at least one name change from the years 1980 - onwards.
If you're interested in looking at the history of name changes throughout the entireity of the NBA's history, check out the full image here: https://en.wikipedia.org/wiki/Timeline_of_the_National_Basketball_Association
So to deal with the history of name or location changes of an NBA Team for our players_stats dataframe, we replaced each of the old Team Name acronyms with its respective current team name. So for example, the Vancouver Grizzlies (VAN) mentionings were replaced with Memphis Grizzlies (MEM) mentionings, as we can see that the Vancouver Grizzlies moved to Memphis around 2001-02, but they both share the same team history.
# Replace old team names with current team names
players_stats = players_stats.replace('VAN', 'MEM')
players_stats = players_stats.replace('NJN', 'BRK')
players_stats = players_stats.replace('SDC', 'LAC')
players_stats = players_stats.replace('SEA', 'OKC')
players_stats = players_stats.replace('WSB', 'WAS')
players_stats = players_stats.replace('KCK', 'SAC')
players_stats = players_stats.replace('SEA', 'OKC')
players_stats = players_stats.replace('CHH', 'CHA')
players_stats = players_stats.replace('NOK', 'NOP')
players_stats = players_stats.replace('NOH', 'NOP')
players_stats = players_stats.replace('CHH', 'CHA')
players_stats = players_stats.replace('CHO', 'CHA')
len (players_stats.Tm.unique()) #Should be 30, as there are currently 30 NBA Tms
Great! We have successfully replaced all the old team name acronyms with their current modern day ones, as we can see that our player_stats dataframe has 30 unique NBA teams.
In our team_records dataframe, we'll keep the columns:
These are the columns that will be relevant to our analysis, as the columns represent a team's success (wins) as well as a metric that reflects how well they play (offensive rating).
team_records = team_records[['Season', 'Team', 'W', 'L', 'W/L%', 'ORtg']].copy()
team_records.head()
As mentioned before, three pointers were newly introduced in 1979, so like how we did with the players_stats dataframe, we will filter our team_records dataframe to only include data from 1980 onwards, when our focused 3 point metrics started to exist.
team_records = team_records.sort_values(by=['Season'],ascending=True)
# Filtering data to only include data from 1980 onwards
team_records = team_records[team_records.Season >= "1980-81"]
team_records.head()
We define 'tidy' data as:
When we look at our team_records dataframe above, it's tidy in terms of #2 & #3, but however, we have an issue that comes up with #1. For some awkward reason, we discovered that in the Team column, the orignal creator of the dataset put an asterisk (*) next to the team name if they made the playoffs that year. So for example, in the dataframe up above, the Lakers have a (*) next to their name as they made the playoffs in '80-'81.
This is an awful way to store information on whether a team made the playoffs or not, and it makes our data untidy as now two variables (team name and ifMadePlayoffs) occupy one column, when it should occupy two columns for two seperate variables.
We'll clean the untidyness of the team_records by replacing all the teams with a (*) next to their name with just the same string without the (*).
In our analysis we do not deal with if the teams made the playoffs or not, but if we truly wanted to complete the tidyness process of our data, we'll then add another column which says true or false on if they made the playoffs. This will make it then that both the variables have their own columns.
In order to learn more about untidy data and how to resolve and clean the data, check out this link: https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it
#Replace starred names -- noticed the stars UNTIDY DATAAA
team_records = team_records.replace('Detroit Pistons*', 'Detroit Pistons')
team_records = team_records.replace('Seattle SuperSonics*', 'Seattle SuperSonics')
team_records = team_records.replace('Dallas Mavericks*', 'Dallas Mavericks')
team_records = team_records.replace('Kansas City Kings*', 'Kansas City Kings')
team_records = team_records.replace('Portland Trail Blazers*', 'Portland Trail Blazers')
team_records = team_records.replace('San Diego Clippers*', 'San Diego Clippers')
team_records = team_records.replace('Philadelphia 76ers*', 'Philadelphia 76ers')
team_records = team_records.replace('New Jersey Nets*', 'New Jersey Nets')
team_records = team_records.replace('Milwaukee Bucks*', 'Milwaukee Bucks')
team_records = team_records.replace('Golden State Warriors*', 'Golden State Warriors')
team_records = team_records.replace('Indiana Pacers*', 'Indiana Pacers')
team_records = team_records.replace('Chicago Bulls*', 'Chicago Bulls')
team_records = team_records.replace('New York Knicks*', 'New York Knicks')
team_records = team_records.replace('Denver Nuggets*', 'Denver Nuggets')
team_records = team_records.replace('Boston Celtics*', 'Boston Celtics')
team_records = team_records.replace('San Antonio Spurs*', 'San Antonio Spurs')
team_records = team_records.replace('Houston Rockets*', 'Houston Rockets')
team_records = team_records.replace('Phoenix Suns*', 'Phoenix Suns')
team_records = team_records.replace('Utah Jazz*', 'Utah Jazz')
team_records = team_records.replace('Los Angeles Lakers*', 'Los Angeles Lakers')
team_records = team_records.replace('Washington Bullets*', 'Washington Bullets')
team_records = team_records.replace('Atlanta Hawks*', 'Atlanta Hawks')
team_records = team_records.replace('Sacramento Kings*', 'Sacramento Kings')
team_records = team_records.replace('Los Angeles Clippers*', 'Los Angeles Clippers')
team_records = team_records.replace('Sacramento Kings*', 'Sacramento Kings')
team_records = team_records.replace('Cleveland Cavaliers*', 'Cleveland Cavaliers')
team_records = team_records.replace('Miami Heat*', 'Miami Heat')
team_records = team_records.replace('Charlotte Hornets*', 'Charlotte Hornets')
team_records = team_records.replace('Orlando Magic*', 'Orlando Magic')
team_records = team_records.replace('Minnesota Timberwolves*', 'Minnesota Timberwolves')
team_records = team_records.replace('Vancouver Grizzlies*', 'Vancouver Grizzlies')
team_records = team_records.replace('Toronto Raptors*', 'Toronto Raptors')
team_records = team_records.replace('Washington Wizards*', 'Washington Wizards')
team_records = team_records.replace('Memphis Grizzlies*', 'Memphis Grizzlies')
team_records = team_records.replace('New Orleans Hornets*', 'New Orleans Hornets')
team_records = team_records.replace('Charlotte Bobcats*', 'Charlotte Bobcats')
team_records = team_records.replace('New Orleans/Oklahoma City Hornets*', 'New Orleans/Oklahoma City Hornets')
team_records = team_records.replace('Oklahoma City Thunder*', 'Oklahoma City Thunder')
team_records = team_records.replace('Brooklyn Nets*', 'Brooklyn Nets')
team_records = team_records.replace('New Orleans Pelicans*', 'New Orleans Pelicans')
team_records.head()
We'll replace all the old team names with the current team names in our team_records dataframe. Then we create an acronym column, and fill it in with the row's respective team name's acronym.
#Replace old team names
team_records = team_records.replace('Vancouver Grizzlies', 'Memphis Grizzlies')
team_records = team_records.replace('New Jersey Nets', 'Brooklyn Nets')
team_records = team_records.replace('San Diego Clippers', 'Los Angeles Clippers')
team_records = team_records.replace('Seattle SuperSonics', 'Oklahoma City Thunder')
team_records = team_records.replace('Washington Bullets', 'Washington Wizards')
team_records = team_records.replace('Kansas City Kings', 'Sacramento Kings')
team_records = team_records.replace('Charlotte Bobcats', 'Charlotte Hornets')
team_records = team_records.replace('New Orleans/Oklahoma City Hornets', 'New Orleans Pelicans')
team_records = team_records.replace('New Orleans Hornets', 'New Orleans Pelicans')
#Add team acronym column
team_records['Acronym'] = team_records['Team']
team_records['Acronym'] = team_records['Acronym'].replace("Atlanta Hawks", "ATL")
team_records['Acronym'] = team_records['Acronym'].replace("Brooklyn Nets", "BKN")
team_records['Acronym'] = team_records['Acronym'].replace("Boston Celtics", "BOS")
team_records['Acronym'] = team_records['Acronym'].replace("Detroit Pistons", "DET")
team_records['Acronym'] = team_records['Acronym'].replace("Charlotte Hornets", "CHA")
team_records['Acronym'] = team_records['Acronym'].replace("Chicago Bulls", "CHI")
team_records['Acronym'] = team_records['Acronym'].replace("Cleveland Cavaliers", "CLE")
team_records['Acronym'] = team_records['Acronym'].replace("Dallas Mavericks", "DAL")
team_records['Acronym'] = team_records['Acronym'].replace("Denver Nuggets", "DEN")
team_records['Acronym'] = team_records['Acronym'].replace("Detroit Pistons", "DET")
team_records['Acronym'] = team_records['Acronym'].replace("Golden State Warriors", "GSW")
team_records['Acronym'] = team_records['Acronym'].replace("Houston Rockets", "HOU")
team_records['Acronym'] = team_records['Acronym'].replace("Los Angeles Clippers", "LAC")
team_records['Acronym'] = team_records['Acronym'].replace("Los Angeles Lakers", "LAL")
team_records['Acronym'] = team_records['Acronym'].replace("Memphis Grizzlies", "MEM")
team_records['Acronym'] = team_records['Acronym'].replace("Miami Heat", "MIA")
team_records['Acronym'] = team_records['Acronym'].replace("Milwaukee Bucks", "MIL")
team_records['Acronym'] = team_records['Acronym'].replace("Minnesota Timberwolves", "MIN")
team_records['Acronym'] = team_records['Acronym'].replace("New Orleans Pelicans", "NOP")
team_records['Acronym'] = team_records['Acronym'].replace("New York Knicks", "NYK")
team_records['Acronym'] = team_records['Acronym'].replace("Oklahoma City Thunder", "OKC")
team_records['Acronym'] = team_records['Acronym'].replace("Orlando Magic", "ORL")
team_records['Acronym'] = team_records['Acronym'].replace("Philadelphia 76ers", "PHI")
team_records['Acronym'] = team_records['Acronym'].replace("Phoenix Suns", "PHX")
team_records['Acronym'] = team_records['Acronym'].replace("Portland Trail Blazers", "POR")
team_records['Acronym'] = team_records['Acronym'].replace("Sacramento Kings", "SAC")
team_records['Acronym'] = team_records['Acronym'].replace("San Antonio Spurs", "SAS")
team_records['Acronym'] = team_records['Acronym'].replace("Toronto Raptors", "TOR")
team_records['Acronym'] = team_records['Acronym'].replace("Utah Jazz", "UTA")
team_records['Acronym'] = team_records['Acronym'].replace("Washington Wizards", "WAS")
team_records.head()
We'll have the 'Season' column of the dataframe just include the starting year of the season (so instead of 1980-81 just have 1980). We then encode both 'season' in our team_record dataframe and 'year' in our players_stats dataframe as integers, instead of strings.
team_records.Season = team_records.Season.str.slice(0, 4)
#convert season datatype to int
team_records = team_records.astype({"Season": int})
team_records.head()
#convert year datatype to int
players_stats = players_stats.astype({"Year": int})
players_stats.head()
Utilizing our players_stats dataframe, we'll group the data by Year and Team, which will give us groups of dataframes for each (team, year) pairings in our players_stats dataframe.
As we iterate through each groups, we'll calculate each team's:
and store those values in a new dataframe.
#add 3pt stats to team dataframe
arr = []
x = players_stats.groupby(["Year", "Tm"])
for (year, team) , g in players_stats.groupby(["Year", "Tm"]):
three_percentage_mean = g["3P%"].mean()
three_attempt_ratio_mean = g["3PAr"].mean()
three_made_sum = g["3P"].sum()
three_attempt_sum = g["3PA"].sum()
points_sum = g["PTS"].sum()
arr.append([year, team,three_percentage_mean,three_attempt_ratio_mean, three_made_sum,three_attempt_sum, points_sum])
result = pd.DataFrame(arr)
result = result.rename(index=str, columns = {0: "Season", 1: "Acronym", 2: "3P%_mean", 3: "3PAr_mean", 4: "3P_sum", 5: "3PA_sum", 6: "PTS_sum"})
result.head()
We have two dataframes: result and team_records. One, for each Season and Team observation, includes metrics on team success (records, ratings). Another, for each Season and Team observation, includes metrics on 3 pointers. Let's merge the two dataframes to get a structured, unified view of our data, and to make our interpretation and analysis focused on one dataframe (thus restructuring our data for analysis).
To combine our data for both dataframes, we use an inner join on "Season" and "Acronym" to merge our two dataframes. We use an inner join because we want observations that have that specific "Season","Acronym" combination common to both of our dataframes.
To learn more about the different types of joins on two tables possible, in order to merge data and tables, you can read more here: https://www.diffen.com/difference/Inner_Join_vs_Outer_Join
merged_df = team_records.merge(result, how = "inner", left_on = ["Season", "Acronym"], right_on = ["Season", "Acronym"])
merged_df.tail()
##Some of the teams do not have data for all 82 games for the 2017 season
# so we will not use that data.
merged_df = merged_df[merged_df.Season <=2016]
merged_df.tail()
Just to have more data available for analyzing the 3 point shot's adoption, impact, and success in the NBA, we compute the points scored off 3's for each (team,year) as well as the other points scored not off 3's.
merged_df['3PScored'] = 0
merged_df['OtherScored'] = 0
for index,row in merged_df.iterrows():
threePoints = row["3P_sum"] * 3 #Number of threes * 3 = 3PScored
other = row["PTS_sum"] - threePoints #TotalPoints - 3PScored = other
merged_df.at[index, "3PScored"] = threePoints
merged_df.at[index, "OtherScored"] = other
merged_df.tail()
This is the Exploratory Data Analysis & Visualization stage of the data lifecycle. During this stage, we will utilze the summary statistics that we computed as well as graphical representations to discover trends and assumptions of our data.
A formal definition and explanation of Exploratory Data Analysis can be found here as a resource: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
In our introduction, we saw how when the 3 point line and field goal was originally introduced to the NBA, coaches remarked on how they "wouldn't want to draw plays for their players to shoot from over 23ft" and because of that were not too passionate about adopting this 3 point shot (gimmick).
So, with this, let's see if there's a general increase in 3 Point Attempts per team over time, despite the remark said above.
fig, ax = plt.subplots(figsize=(20,10))
for key , g in merged_df.groupby(["Team"]):
ax.plot(g["Season"], g["3PA_sum"], label = key)
ax.set_xlim(1980,2016)
ax.legend()
plt.title('Three Point Attempts per Team Over Time');
#Low spikes due to NBA lockout -- did not play games
During the years 1995-99 and 2011-2012, when the two dips in the graph above occurred, the NBA had several lockouts, where the players stopped working (playing) due to settling their collective bargaining agreement. Because they were essentially on strike, teams played on average of 50 games rather than the full 82 games per season. So evidently, less games = less 3 point attempts occuring
As we can see from the graph above, other than the years of the dips (explained above), we can see that teams have been adopting the 3 point shot as the years have gone on.
Even though the NBA community was hesistant originally about the introduction of the shot and how often it'd be occuring (since it's MUCH further away from the basket), we can see that it's been prevalent and impactful to the league as the years have gone on. It's definetly been respected and adopted by the league, as we can see on the left the number of attempts has actually quintupled through the years!
Let's explore a correlation between 3 point percentage and wins for teams. To explore different time periods, we split our data into 5 distinct time periods, and also calculate summary statistics for each of the teams during those 5 time periods.
#split data into time periods of 1980-1989, 1990-2002, 2003-2010, 2011-present)
df1 = merged_df.loc[merged_df['Season'] < 1990]
df2 = merged_df.loc[merged_df['Season'] < 2003]
df2 = df2.loc[merged_df['Season'] >= 1990]
df3 = merged_df.loc[merged_df['Season'] < 2011]
df3 = df3.loc[merged_df['Season'] >= 2003]
df4 = merged_df.loc[merged_df['Season'] >= 2011]
df4.head() # Should be seasons from 2011 onwards
Within each of the 5 year dataframes, we'll calculate summary statistics revolving around the 3 point metrics.
As we loop through each dataframe's teams, for:
#1980-1989
arr = []
for (team, grp) in df1.groupby(['Team']):
threeAttempts = grp["3PA_sum"].sum()
threeRatio = grp["3PAr_mean"].mean()
threeMakes = grp["3P_sum"].sum()
threePercentage = grp["3P%_mean"].mean()
wins = grp["W"].sum()
winPercentage = grp["W/L%"].mean()
ORtg = grp["ORtg"].mean()
threePointSum = grp["3PScored"].sum()
otherPointSum = grp["OtherScored"].sum()
arr.append([team, threeAttempts, threeRatio, threeMakes, threePercentage, wins, winPercentage, ORtg, threePointSum, otherPointSum])
df1_averages = pd.DataFrame(arr)
df1_averages.columns = ["Team", "3PA_sum", "3PAr_mean", "3P_sum", "3P%_mean", "W", "W/L%", "ORtg", "3PScored", "OtherScored"]
df1_averages.head()
#1990-2002
arr = []
for (team, grp) in df2.groupby(['Team']):
threeAttempts = grp["3PA_sum"].sum()
threeRatio = grp["3PAr_mean"].mean()
threeMakes = grp["3P_sum"].sum()
threePercentage = grp["3P%_mean"].mean()
wins = grp["W"].sum()
winPercentage = grp["W/L%"].mean()
ORtg = grp["ORtg"].mean()
threePointSum = grp["3PScored"].sum()
otherPointSum = grp["OtherScored"].sum()
arr.append([team, threeAttempts, threeRatio, threeMakes, threePercentage, wins, winPercentage, ORtg, threePointSum, otherPointSum])
df2_averages = pd.DataFrame(arr)
df2_averages.columns = ["Team", "3PA_sum", "3PAr_mean", "3P_sum", "3P%_mean", "W", "W/L%", "ORtg", "3PScored", "OtherScored"]
df2_averages.head()
#2003-2010
arr = []
for (team, grp) in df3.groupby(['Team']):
threeAttempts = grp["3PA_sum"].sum()
threeRatio = grp["3PAr_mean"].mean()
threeMakes = grp["3P_sum"].sum()
threePercentage = grp["3P%_mean"].mean()
wins = grp["W"].sum()
winPercentage = grp["W/L%"].mean()
ORtg = grp["ORtg"].mean()
threePointSum = grp["3PScored"].sum()
otherPointSum = grp["OtherScored"].sum()
arr.append([team, threeAttempts, threeRatio, threeMakes, threePercentage, wins, winPercentage, ORtg, threePointSum, otherPointSum])
df3_averages = pd.DataFrame(arr)
df3_averages.columns = ["Team", "3PA_sum", "3PAr_mean", "3P_sum", "3P%_mean", "W", "W/L%", "ORtg", "3PScored", "OtherScored"]
df3_averages.head()
#2011-2016
arr = []
for (team, grp) in df4.groupby(['Team']):
threeAttempts = grp["3PA_sum"].sum()
threeRatio = grp["3PAr_mean"].mean()
threeMakes = grp["3P_sum"].sum()
threePercentage = grp["3P%_mean"].mean()
wins = grp["W"].sum()
winPercentage = grp["W/L%"].mean()
ORtg = grp["ORtg"].mean()
threePointSum = grp["3PScored"].sum()
otherPointSum = grp["OtherScored"].sum()
arr.append([team, threeAttempts, threeRatio, threeMakes, threePercentage, wins, winPercentage, ORtg, threePointSum, otherPointSum])
df4_averages = pd.DataFrame(arr)
df4_averages.columns = ["Team", "3PA_sum", "3PAr_mean", "3P_sum", "3P%_mean", "W", "W/L%", "ORtg", "3PScored", "OtherScored"]
df4_averages.head()
#1980-1989 graph
fig, ax = plt.subplots();
df1_averages.plot(ax = ax, kind = 'scatter', x = '3P%_mean', y = 'W', title= 'Three Point Percentage vs. Wins per Team for Years 1980-1989', figsize = (10, 5))
for (name, grp) in df1_averages.groupby(['Team']):
plt.text(grp['3P%_mean'], grp['W'], str(name))
#plotting the linear regression line
plt.plot(np.unique(df1_averages['3P%_mean']), np.poly1d(np.polyfit(x = df1_averages['3P%_mean'], y = df1_averages['W'], deg = 1))(np.unique(df1_averages['3P%_mean'])), color = 'black')
plt.show()
#1990-2002 graph
fig, ax = plt.subplots();
df2_averages.plot(ax = ax, kind = 'scatter', x = '3P%_mean', y = 'W', title= 'Three Point Percentage vs. Wins per Team for Years 1990-2002', figsize = (10, 5))
for (name, grp) in df2_averages.groupby(['Team']):
plt.text(grp['3P%_mean'], grp['W'], str(name))
plt.plot(np.unique(df2_averages['3P%_mean']), np.poly1d(np.polyfit(x = df2_averages['3P%_mean'], y = df2_averages['W'], deg = 1))(np.unique(df2_averages['3P%_mean'])), color = 'black')
plt.show()
#2003-2010 graph
fig, ax = plt.subplots();
df3_averages.plot(ax = ax, kind = 'scatter', x = '3P%_mean', y = 'W', title= 'Three Point Percentage vs. Wins per Team for Years 2003-2010', figsize = (10, 5))
for (name, grp) in df3_averages.groupby(['Team']):
plt.text(grp['3P%_mean'], grp['W'], str(name))
plt.plot(np.unique(df3_averages['3P%_mean']), np.poly1d(np.polyfit(x = df3_averages['3P%_mean'], y = df3_averages['W'], deg = 1))(np.unique(df3_averages['3P%_mean'])), color = 'black')
plt.show()
#2011-2016 graph
fig, ax = plt.subplots();
df4_averages.plot(ax = ax, kind = 'scatter', x = '3P%_mean', y = 'W', title= 'Three Point Percentage vs. Wins per Team for Years 2011-2016', figsize = (10, 5))
for (name, grp) in df4_averages.groupby(['Team']):
plt.text(grp['3P%_mean'], grp['W'], str(name))
plt.plot(np.unique(df4_averages['3P%_mean']), np.poly1d(np.polyfit(x = df4_averages['3P%_mean'], y = df4_averages['W'], deg = 1))(np.unique(df4_averages['3P%_mean'])), color = 'black')
plt.show()
Our results show us that originally, there was actually an inverse correlation, as the teams that were not shooting as many 3 pointers were still successful and winning games. Perhaps they didn't want to break their original way of winning games. The majority of teams during the 80s-90s time period still had a low 3 point percentage though, so it definitly looks like not a lot of teams at that time adopted the shot.
However, as time goes on, (with the exception of 2003-2010), it seems as though the correlation is becoming positive as time goes on. By the time we get to the years 2011-2016, the correlation is flipped and now showing a positive trend in three point percentage vs. wins.
So perhaps even though teams didn't abruptly adopt and implement the 3 point shot into their games at first, and may result to the negative correlation at the beginning of the time period, by the recent years (2011 - 2016), it seems as though the shot has been adopted and impactful to teams in terms of implementing the shot in their games and winning in the process.
So maybe we didn't see a clear cut correlation from the first graph. Let's investigate and see if maybe just shooting more threes in general has any correlation to teams winning.
#1980-1989 3 point attempt ratio graph
fig, ax = plt.subplots();
df1_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'W', title= 'Three Point Attempt Ratio vs. Wins per Team for Years 1980-1989', figsize = (10, 5))
for (name, grp) in df1_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['W'], str(name))
plt.plot(np.unique(df1_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df1_averages['3PAr_mean'], y = df1_averages['W'], deg = 1))(np.unique(df1_averages['3PAr_mean'])), color = 'black')
plt.show()
#1990-2002 3 point attempt ratio graph
fig, ax = plt.subplots();
df2_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'W', title= 'Three Point Attempt Ratio vs. Wins per Team for Years 1990-2002', figsize = (10, 5))
for (name, grp) in df2_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['W'], str(name))
plt.plot(np.unique(df2_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df2_averages['3PAr_mean'], y = df2_averages['W'], deg = 1))(np.unique(df2_averages['3PAr_mean'])), color = 'black')
plt.show()
#2003-2010 3 point attempt ratio graph
fig, ax = plt.subplots();
df3_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'W', title= 'Three Point Attempt Ratio vs. Wins per Team for Years 2003-2010', figsize = (10, 5))
for (name, grp) in df3_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['W'], str(name))
plt.plot(np.unique(df3_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df3_averages['3PAr_mean'], y = df3_averages['W'], deg = 1))(np.unique(df3_averages['3PAr_mean'])), color = 'black')
plt.show()
#2011-2016 3 point attempt ratio graph
fig, ax = plt.subplots();
df4_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'W', title= 'Three Point Attempt Ratio vs. Wins per Team for Years 2011-2016', figsize = (10, 5))
for (name, grp) in df4_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['W'], str(name))
plt.plot(np.unique(df4_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df4_averages['3PAr_mean'], y = df4_averages['W'], deg = 1))(np.unique(df4_averages['3PAr_mean'])), color = 'black')
plt.show()
Now we can see that a positive correlation is definetly forming as the years go on from 1980 - 2016. Originally, in the 80s, there looked to be no correlation. This can definetly be attributed to the fact that just shooting more 3's during this era had no effect on winning. Perhaps teams during this era didn't adopt the shot and implement it in their game directly enough to start winning, or perhaps this era, which was dominated by big, tall, centers (players above 6' 10'') was more focused on using their physical tools to score inside the arc rather than outside.
However, we can definetly state that as the years go on to modern times, shooting and attempting more threes has definetly had an impact on wins. Especially with the last 2 eras, we see the postive correlation forming as teams are shooting more threes, and while not necessarily making some of them, are still at least tending to win more. So the adoption of the 3 point score is happening, and for the better, as we see its impact on wins and success.
Since we're interested in seeing the 3 point shot's adoption and impact, we also wanted to see if there's a correlation between just shooting more threes (the attempt rate) and how many points are being produced on average (offensive rating).
#1980-1989 Ortg graph
fig, ax = plt.subplots();
df1_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'ORtg', title= 'Three Point Attempt Ratio vs. Offensive Rating per Team for Years 1980-1989', figsize = (10, 5))
for (name, grp) in df1_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['ORtg'], str(name))
plt.plot(np.unique(df1_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df1_averages['3PAr_mean'], y = df1_averages['ORtg'], deg = 1))(np.unique(df1_averages['3PAr_mean'])), color = 'black')
plt.show()
#1990-2002 Ortg graph
fig, ax = plt.subplots();
df2_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'ORtg', title= 'Three Point Attempt Ratio vs. Offensive Rating per Team for Years 1990-2002', figsize = (10, 5))
for (name, grp) in df2_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['ORtg'], str(name))
plt.plot(np.unique(df2_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df2_averages['3PAr_mean'], y = df2_averages['ORtg'], deg = 1))(np.unique(df2_averages['3PAr_mean'])), color = 'black')
plt.show()
#2003-2010 Ortg graph
fig, ax = plt.subplots();
df3_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'ORtg', title= 'Three Point Attempt Ratio vs. Offensive Rating per Team for Years 2003-2010', figsize = (10, 5))
for (name, grp) in df3_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['ORtg'], str(name))
plt.plot(np.unique(df3_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df3_averages['3PAr_mean'], y = df3_averages['ORtg'], deg = 1))(np.unique(df3_averages['3PAr_mean'])), color = 'black')
plt.show()
#2011-2016 Ortg graph
fig, ax = plt.subplots();
df4_averages.plot(ax = ax, kind = 'scatter', x = '3PAr_mean', y = 'ORtg', title= 'Three Point Attempt Ratio vs. Offensive Rating per Team for Years 2011-2016', figsize = (10, 5))
for (name, grp) in df4_averages.groupby(['Team']):
plt.text(grp['3PAr_mean'], grp['ORtg'], str(name))
plt.plot(np.unique(df4_averages['3PAr_mean']), np.poly1d(np.polyfit(x = df4_averages['3PAr_mean'], y = df4_averages['ORtg'], deg = 1))(np.unique(df4_averages['3PAr_mean'])), color = 'black')
plt.show()
And evidently, it seems as though as the years go on, and the 3 point shot gets more widely accepted through the league as time goes on, a positive trend forms between attempting more threes and scoring production.
We see a trend of just overall more scoring as teams attempt more threes, especially within the last two eras. This is actually really important to note, as basketball fans ourselves, scoring and just overall putting the ball in the basketball is an exciting thing to see. When the three point shot was introduced, one reason for its justification as mentioned in the introduction was to improve game attendance and viewership.
With the positive correlation of three point attempts to offensive rating, it's definetly exciting to see that the league is scoring more, and in a variety of different ways that are further away from the basket.
Finally to cap it off, let's see if the number of 3 points scored by a team correlates to wins.
#1980-1989 graph
fig, ax = plt.subplots();
df1_averages.plot(ax = ax, kind = 'scatter', x = '3P_sum', y = 'W', title= 'Three Point Makes vs. Wins per Team for Years 1980-1989', figsize = (10, 5))
for (name, grp) in df1_averages.groupby(['Team']):
plt.text(grp['3P_sum'], grp['W'], str(name))
plt.plot(np.unique(df1_averages['3P_sum']), np.poly1d(np.polyfit(x = df1_averages['3P_sum'], y = df1_averages['W'], deg = 1))(np.unique(df1_averages['3P_sum'])), color = 'black')
plt.show()
#1990-2003 graph
fig, ax = plt.subplots();
df2_averages.plot(ax = ax, kind = 'scatter', x = '3P_sum', y = 'W', title= 'Three Point Makes vs. Wins per Team for Years 1990-2003', figsize = (10, 5))
for (name, grp) in df2_averages.groupby(['Team']):
plt.text(grp['3P_sum'], grp['W'], str(name))
plt.plot(np.unique(df2_averages['3P_sum']), np.poly1d(np.polyfit(x = df2_averages['3P_sum'], y = df2_averages['W'], deg = 1))(np.unique(df2_averages['3P_sum'])), color = 'black')
plt.show()
#2003-2011 graph
fig, ax = plt.subplots();
df3_averages.plot(ax = ax, kind = 'scatter', x = '3P_sum', y = 'W', title= 'Three Point Makes vs. Wins per Team for Years 2003-2011', figsize = (10, 5))
for (name, grp) in df3_averages.groupby(['Team']):
plt.text(grp['3P_sum'], grp['W'], str(name))
plt.plot(np.unique(df3_averages['3P_sum']), np.poly1d(np.polyfit(x = df3_averages['3P_sum'], y = df3_averages['W'], deg = 1))(np.unique(df3_averages['3P_sum'])), color = 'black')
plt.show()
#2011-2016 graph
fig, ax = plt.subplots();
df4_averages.plot(ax = ax, kind = 'scatter', x = '3P_sum', y = 'W', title= 'Three Point Makes vs. Wins per Team for Years 2011-2016', figsize = (10, 5))
for (name, grp) in df4_averages.groupby(['Team']):
plt.text(grp['3P_sum'], grp['W'], str(name))
plt.plot(np.unique(df4_averages['3P_sum']), np.poly1d(np.polyfit(x = df4_averages['3P_sum'], y = df4_averages['W'], deg = 1))(np.unique(df4_averages['3P_sum'])), color = 'black')
plt.show()
And, impressively, even though in some earlier eras a metric like the attempt rate or 3 point scoring efficiency would not lead to a postive correlation with wins, in all eras, we can definetly be postive in saying that a correlation exists with making more threes and winning.
Even though teams originally were hesitant and didn't believe in implementing the three pointer into their game and plays in 1979, we can see from this plot that the teams who did willingly shoot and make more, rather than just not shoot at all or barely do, generally had more success when it came to scoring more and winning more (which is entertaining, popular, and exciting).
Finally, just to prove and solidify the notion that 3 pointers have been adopted and implemented within NBA teams' games as the years have gone, and is here to stay and keep growing, here's a pie chart to display the points scored off 3 pointers vs points scored off other (2 pointers or free throws).
# Getting size of the total pie chart, with boundaries being 3 points scored
# in that specific era vs other points scored in that specific era.
sizes1 = [df1["3PScored"].sum(), df1["OtherScored"].sum()]
sizes2 = [df2["3PScored"].sum(), df2["OtherScored"].sum()]
sizes3 = [df3["3PScored"].sum(), df3["OtherScored"].sum()]
sizes4 = [df4["3PScored"].sum(), df4["OtherScored"].sum()]
# Defining colors and dimensions to plot our pie charts.
labels = "Points from Threes", "Points from Other"
colors = ['lightskyblue', 'gold']
explode = (0.1, 0)
plt.figure(0)
ax1 = plt.pie(sizes1, explode=explode, labels = labels, colors = colors, autopct='%1.1f%%')
plt.title("1980-1989 ");
plt.figure(1)
ax2 = plt.pie(sizes2, explode=explode, labels = labels, colors = colors, autopct='%1.1f%%')
plt.title("1990-2003");
plt.figure(2)
ax3 = plt.pie(sizes3, explode=explode, labels = labels, colors = colors, autopct='%1.1f%%')
plt.title("2003-2010");
plt.figure(3)
ax4 = plt.pie(sizes4, explode=explode, labels = labels, colors = colors, autopct='%1.1f%%')
plt.title("2011-2016");
From only accounting for a whopping 3% of all points scored, to now accounting to close to (1/4)th of all points scored, we can definetly state that the 3 pointer has been adopted and has impacted the league definitely, despite original claims saying that it wouldn't do so at all.
This is the Machine Learning/Predictive Modeling stage of the data lifecycle. During this stage, we will be using linear regression in order to obtain a predictive model of our data. We will be using the statsmodel.formula.api library to model our train data, and thus come up with predictions for the data that we feed it.
To read more about the using the statsmodel.formula.api to apply linear regression, visit this here: https://www.statsmodels.org/stable/index.html
We are looking at the impacts of 3 different winning statistics for a team, (Number of wins, Offensive Rating, and Points Scored) to see if they have an impact on the points scored of 3 pointers.
Null Hypothesis: None of the 3 different winning statistics have an impact on points scored off 3 pointers.
To test the null hypothesis, we will perform Linear Regression on our dataset, using the Oridnary Least-Squares technique through the statsmodel.formula.api library.
We want to only consider winning statistics and predict points scored of threes, so we can drop some of our columns from our dataframe that we won't need for our regression model.
df4 = df4.rename(index=str, columns = {"3P_sum": "ThreePSum"})
df4 = df4.rename(index=str, columns = {"3PA_sum": "ThreePAttemptSum"})
df4 = df4.rename(index=str, columns = {"3P%_mean": "ThreePercentageMean"})
df4 = df4.rename(index=str, columns = {"3PAr_mean": "ThreeRatioMean"})
df4 = df4.rename(index=str, columns = {"3PScored": "ThreePScored"})
df4 = df4.drop('L', axis=1)
df4 = df4.drop('W/L%', axis=1)
df4 = df4.drop('OtherScored', axis=1)
df4 = df4.drop('Acronym', axis=1)
df4 = df4.drop('ThreePercentageMean', axis=1)
df4 = df4.drop('ThreeRatioMean', axis=1)
df4.head()
Now we can prepare our training data and our expected predictions, and then feed that into our ols() function to get back a linear regression model to predict our data.
import warnings
warnings.filterwarnings('ignore')
X = df4.drop('ThreePScored', axis=1)
y = df4['ThreePScored']
# Split data into Train and Test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
X_train['ThreePScored'] = y_train
# Fit the Linear Regression on Train split
lr = smf.ols(formula='ThreePScored ~ W + ORtg + PTS_sum', data=X_train).fit()
# Predict using Test split
preds_lr = lr.predict(X_test)
# Plot how the predicted win_ratio compares to actual win ratio
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted');
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(preds_lr, hist=False, label="Linear Regression Predictions", ax=ax)
So we have our predicted model, utilizing wins, offensive rating, and points scored to predict points scored off 3's. Let's check the summary of our model, and check out our coefficients for our input
lr.summary()
Let's use a real metric that can be tested to show how well our model is. We'll use the F-Test for Goodness of fit. If the F-Value is greater than the calculated F-Statistic, and the P-Value is less than alpha (which is 0.05 in our case), a significant model has been produced.
Here's more information about the F-Test: https://subscription.packtpub.com/book/application_development/9781783284375/8/ch08lvl1sec124/t-test-and-f-test
This test can definetly help us know if we can reject our null hypothesis.
# Import F-Table to look up F-Statistic
from scipy.stats import f
from sklearn import model_selection
from sklearn import linear_model
# F-Test to evaluate goodness of fit
test = lr.f_test(np.identity(len(lr.params)))
print(' Model - Calculated F-Statistic: ' + str(f.ppf(.95,test.df_num,test.df_denom)) + \
' F-Value: ' + str(test.fvalue[0][0]) + ' P-Value: ' + str(test.pvalue))
We have produced a significant model, as our F-Value is greater than the calculated F-Statistic, and our P-value is definitely less than 0.05. So we can definetly use our model to predict the points scored off 3 pointers.
We passed the F-Test, which shows us that we have produced a significant model. When we called .summary() look at some of the parameters of our regression model, it seems as though Offensive Rating has a major impact on the ability to predict points scored off 3 pointers, as its coefficient is significant and plays a major part in our regression model.
We have found a coefficient of a winning statistic (ORtg) that has an impact on points scored off 3 pointers, so we can definitely reject the null hypothesis.
So our goal of this tutorial was, as mentioned earlier in our introduction:
In this tutorial, our goal is to look at NBA teams' data around the 3-pointer and see if we can find any insight regarding the adoption and use of 3-pointers from the moment it was introduced in 1979, as well as its ties to teams' success and scoring output. We'll see its impact in the league, and show that more than just "a gimmick."
With our exploratory data analysis, visualizations, and machine learning, we have shown how popular and prevalent the 3 pointer has become in the NBA. Teams are shooting and making more of them, and this has correlated to an increase in wins and offensive output. Positive correlation, especially within the last 10 years, is prevalent among 3 point attempts, makes, and efficiency.
NBA Teams have adopted, been impacted by, and found success with 3 pointers since it was first introduced in 1979, despite the original negative reaction that it faced. John MacLeod, coach of the Phoenix Suns said, “but I'm not going to set up plays for guys to bomb from 23 feet. I think that's very boring basketball.” But however now we've seen quite the opposite; NBA teams are adopting, utilizing, and implementing them in their game as much as they can, and it leads to wins and a higher offensive rating.
It was also introduced as a way for the league to woo in their audience back, and increase in-game attendance and viewership. Currently, the NBA's league rating is the highest that it has ever been, and because of teams now scoring in more ways than just two pointers, and scoring much more because of 3's, who wouldn't tune in to watch a game?
Oh, you still wouldn't? Well here's a clip to change your mind (by the way, this is a game winner, and look at where Steph Curry is shooting it from!):
from IPython.core.display import Image
Image(filename='stephokc.png')
As the prevalence of the 3 pointer in the NBA keeps increasing as times go on, we'd like to mention some crucial decisions that need to be made on how to evaluate this phenomenon.
from IPython.core.display import Image
Image(filename='bensimmons.png')