Modules for Import¶

In [1]:
import pandas as pd
from scipy.stats import shapiro

from datetime import datetime, date

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.io as pio

from IPython.display import HTML, Javascript, display

from pypalettes import load_cmap
import seaborn as sns
import cufflinks as cf
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.io as pio
from plotly.offline import init_notebook_mode

Read df from disk¶

In [2]:
# Read from CSV
df = pd.read_csv('nba-stats-data.csv', low_memory=False)
print(df)
        assists  blocks comment  defReb   fga  fgm   fgp  fta  ftm    ftp  \
0           3.0     1.0     NaN     7.0   6.0  1.0  16.7  4.0  3.0   75.0   
1           3.0     1.0     NaN     7.0   6.0  1.0  16.7  4.0  3.0   75.0   
2           3.0     1.0     NaN     7.0   6.0  1.0  16.7  4.0  3.0   75.0   
3           0.0     0.0     NaN     5.0   8.0  5.0  62.5  0.0  0.0    0.0   
4           1.0     0.0     NaN     6.0  10.0  5.0  50.0  4.0  4.0  100.0   
...         ...     ...     ...     ...   ...  ...   ...  ...  ...    ...   
394809      1.0     0.0     NaN     2.0   7.0  0.0   0.0  1.0  1.0  100.0   
394810      0.0     1.0     NaN     1.0  10.0  1.0  10.0  3.0  3.0  100.0   
394811      2.0     1.0     NaN     6.0   5.0  3.0  60.0  0.0  0.0    0.0   
394812      5.0     0.0     NaN     3.0  12.0  6.0  50.0  5.0  4.0   80.0   
394813      0.0     0.0     NaN     0.0   1.0  0.0   0.0  0.0  0.0    0.0   

        ...         DOB   age  feet  meters  position active  jersey_number  \
0       ...  1985-02-10  40.0   6"7    2.01         F   True            8.0   
1       ...  1985-02-10  40.0   6"7    2.01         F   True            8.0   
2       ...  1985-02-10  40.0   6"7    2.01         F   True            8.0   
3       ...  1989-07-01  35.0   6"4    1.93       G-F   True           24.0   
4       ...  1986-06-03  38.0   6"9    2.06       C-F   True           42.0   
...     ...         ...   ...   ...     ...       ...    ...            ...   
394809  ...         NaN   NaN   NaN     NaN       NaN    NaN            NaN   
394810  ...         NaN   NaN   NaN     NaN       NaN    NaN            NaN   
394811  ...         NaN   NaN   NaN     NaN       NaN    NaN            NaN   
394812  ...         NaN   NaN   NaN     NaN       NaN    NaN            NaN   
394813  ...         NaN   NaN   NaN     NaN       NaN    NaN            NaN   

        years_pro  rookie_year  pounds  
0            15.0       2006.0   257.0  
1            15.0       2006.0   257.0  
2            15.0       2006.0   257.0  
3             9.0       2012.0   195.0  
4            14.0       2007.0   240.0  
...           ...          ...     ...  
394809        NaN          NaN     NaN  
394810        NaN          NaN     NaN  
394811        NaN          NaN     NaN  
394812        NaN          NaN     NaN  
394813        NaN          NaN     NaN  

[394814 rows x 45 columns]
In [3]:
# set season (year) as main DF index to allow for time series analysis of player data
df = df.reset_index().set_index('season')
df.index = pd.to_datetime(df.index, format='%Y').to_period('Y')
df = df.sort_index(axis=0)

Data Analysis¶

  • Set font size
  • Define function to create time series plot
  • Define function to summarise top players for a given performance statistic (i.e. Points scored, Blocks, Steals, Rebounds, Field Goal Percentage etc.)
  • Define function to highlight cell greater or equal threshold of 0.7 and less than 1
In [4]:
fontsize=22

# convenience function for time series plots
def create_ts_plot(data, title, ylabel, color="navy", rhs_vals=None):
    # create a time series plot
    pd.options.plotting.backend = "matplotlib"
    ax = data.plot(figsize=(9, 4.3), legend=False, lw=2, zorder=3, color=color)
    plt.title(title, fontsize=fontsize-1)
    plt.ylabel(ylabel)
    plt.xlabel(None)
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)
    return ax

# convenience function to aggregate top players by a given statistic or feature
def summary_top_players(grp, stat,  sort_by = 'mean', results = 10):
    df[stat] = pd.to_numeric(df[stat], errors='coerce')# handle nans and plusMinus values which are char in format ''+/-[0-9]*'
    df_aggs = round(df.groupby([grp])[stat].agg(['mean', 'sum']), 2)
    return df_aggs.sort_values(by=sort_by, ascending=False).head(results)

# convenience method to highlight data frame cell lime green
def highlighter(cell_value, thresh_lower = 0.7, thresh_upper = 0.99):
    if cell_value < thresh_upper and cell_value >= thresh_lower:
        return "background-color: #32CD32"
    elif cell_value == 1:
        return "background-color: yellow"
  • Convert pounds feature to numeric column and get mean value
In [5]:
df['pounds'] = pd.to_numeric(df['pounds'])
print("Average player weight %s lbs." % round(df['pounds'].mean(skipna = True), 2))
Average player weight 219.89 lbs.
  • Convert meters feature to numeric column and get mean value
In [6]:
df['meters'] = pd.to_numeric(df['meters'])
round(df['meters'].mean(skipna = True), 2)
print("Average player height %s meters" % round(df['meters'].mean(skipna = True), 2))
Average player height 1.99 meters
  • Check is there a correlation between player weight in lbs and height in meters
In [7]:
df[['pounds', 'meters']].corr()
Out[7]:
pounds meters
pounds 1.000000 0.801738
meters 0.801738 1.000000
  • There is a strong correlation between player weight and height
  • Taller players are more likely to be heavier, However, there is still room for some outliers to this trend
  • Lets visualise this as an interactive scatter plot
In [8]:
df.plot('pounds', 'meters', 'scatter')
Out[8]:
<Axes: xlabel='pounds', ylabel='meters'>
No description has been provided for this image
  • I think this is a good approach to find some important features in this dataset related to player performance
  • I am now going to look at the correlation table for all numeric columns
  • Then I will apply a cell highlighter for interesting fields in the correlation matrix with a strong relationship or corerelation (over 70%)
In [9]:
numeric_cols = ['assists', 'blocks', 'defReb', 'fga', 'fgm', 'fgp', 'fta', 'ftm', 'ftp', 'offReb', 'pFouls', 
                'points', 'steals', 'totReb', 'tpa', 'tpm', 'tpp', 'turnovers', 'age', 'meters', 'min', 'plusMinus',
                'pounds', 'years_pro', 'rookie_year', 'jersey_number']

df_corr_c = df[numeric_cols].corr()

df_corr_c.style.map(highlighter)
Out[9]:
  assists blocks defReb fga fgm fgp fta ftm ftp offReb pFouls points steals totReb tpa tpm tpp turnovers age meters min plusMinus pounds years_pro rookie_year jersey_number
assists 1.000000 0.079840 0.341542 0.548966 0.484388 0.193481 0.369765 0.374962 0.311134 0.099026 0.245023 0.506260 0.316305 0.304469 0.390774 0.302927 0.203165 0.452770 0.128832 -0.252423 0.567290 0.159905 -0.166412 0.219776 0.106663 -0.147452
blocks 0.079840 1.000000 0.336355 0.195425 0.220408 0.196177 0.191473 0.159980 0.140813 0.278480 0.211656 0.208109 0.111982 0.367662 0.024140 0.017648 0.004017 0.141856 0.077320 0.251926 0.240309 0.085996 0.243935 0.110978 0.068239 0.012307
defReb 0.341542 0.336355 1.000000 0.510386 0.510206 0.327576 0.409690 0.370083 0.322418 0.420608 0.365465 0.504478 0.236909 0.940426 0.228649 0.181592 0.117611 0.376868 0.137425 0.261249 0.569548 0.162405 0.312349 0.224570 0.120884 -0.014604
fga 0.548966 0.195425 0.510386 1.000000 0.897029 0.322502 0.538305 0.538127 0.465845 0.287690 0.381801 0.897463 0.359770 0.507344 0.708842 0.565154 0.315441 0.494552 0.157100 -0.128543 0.794232 0.081852 -0.062387 0.262914 0.151250 -0.148484
fgm 0.484388 0.220408 0.510206 0.897029 1.000000 0.554352 0.513336 0.503563 0.435286 0.302518 0.362593 0.968215 0.326853 0.512760 0.566037 0.585903 0.396349 0.452816 0.148848 -0.029965 0.728871 0.181727 0.029610 0.245894 0.144353 -0.114200
fgp 0.193481 0.196177 0.327576 0.322502 0.554352 1.000000 0.226268 0.202018 0.253894 0.233494 0.297128 0.512236 0.178603 0.343931 0.145002 0.297741 0.422328 0.215789 0.087811 0.130382 0.263894 0.169360 0.146847 0.127607 0.077829 -0.015409
fta 0.369765 0.191473 0.409690 0.538305 0.513336 0.226268 1.000000 0.953071 0.629833 0.258113 0.273107 0.662380 0.238647 0.417436 0.268027 0.203644 0.110822 0.381546 0.118008 0.044996 0.486043 0.085496 0.108121 0.213502 0.106079 -0.071535
ftm 0.374962 0.159980 0.370083 0.538127 0.503563 0.202018 0.953071 1.000000 0.705585 0.203123 0.251555 0.671765 0.229993 0.365823 0.300367 0.230797 0.129817 0.371399 0.113967 0.000897 0.477959 0.095969 0.055910 0.206695 0.100236 -0.077106
ftp 0.311134 0.140813 0.322418 0.465845 0.435286 0.253894 0.629833 0.705585 1.000000 0.184509 0.265014 0.544077 0.215306 0.321535 0.292309 0.229177 0.157550 0.301711 0.095384 -0.023133 0.435448 0.083457 0.009371 0.161579 0.086646 -0.067285
offReb 0.099026 0.278480 0.420608 0.287690 0.302518 0.233494 0.258113 0.203123 0.184509 1.000000 0.266119 0.269383 0.131953 0.704013 -0.045042 -0.049622 -0.051849 0.180521 0.062571 0.330219 0.283937 0.040610 0.374211 0.109273 0.061528 0.040387
pFouls 0.245023 0.211656 0.365465 0.381801 0.362593 0.297128 0.273107 0.251555 0.265014 0.266119 1.000000 0.364967 0.218776 0.385816 0.234468 0.187246 0.148717 0.332724 0.089405 0.092152 0.432809 0.008543 0.116605 0.130562 0.075931 -0.029604
points 0.506260 0.208109 0.504478 0.897463 0.968215 0.512236 0.662380 0.671765 0.544077 0.269383 0.364967 1.000000 0.334354 0.495859 0.631962 0.656948 0.442276 0.472023 0.154545 -0.062424 0.744084 0.186617 -0.002027 0.258023 0.145995 -0.122387
steals 0.316305 0.111982 0.236909 0.359770 0.326853 0.178603 0.238647 0.229993 0.215306 0.131953 0.218776 0.334354 1.000000 0.234903 0.253356 0.196634 0.138973 0.262363 0.084613 -0.096721 0.359245 0.113339 -0.054177 0.124488 0.075410 -0.084824
totReb 0.304469 0.367662 0.940426 0.507344 0.512760 0.343931 0.417436 0.365823 0.321535 0.704013 0.385816 0.495859 0.234903 1.000000 0.162106 0.123553 0.072635 0.362664 0.131156 0.328686 0.557822 0.142350 0.385286 0.216941 0.117785 0.003703
tpa 0.390774 0.024140 0.228649 0.708842 0.566037 0.145002 0.268027 0.300367 0.292309 -0.045042 0.234468 0.631962 0.253356 0.162106 1.000000 0.824836 0.435104 0.309413 0.104815 -0.318287 0.585642 0.071578 -0.303864 0.156840 0.098521 -0.137527
tpm 0.302927 0.017648 0.181592 0.565154 0.585903 0.297741 0.203644 0.230797 0.229177 -0.049622 0.187246 0.656948 0.196634 0.123553 0.824836 1.000000 0.705648 0.241962 0.088194 -0.248232 0.474144 0.165116 -0.238167 0.132337 0.078425 -0.106122
tpp 0.203165 0.004017 0.117611 0.315441 0.396349 0.422328 0.110822 0.129817 0.157550 -0.051849 0.148717 0.442276 0.138973 0.072635 0.435104 0.705648 1.000000 0.155573 0.043654 -0.201655 0.272089 0.145718 -0.199015 0.066599 0.035838 -0.080371
turnovers 0.452770 0.141856 0.376868 0.494552 0.452816 0.215789 0.381546 0.371399 0.301711 0.180521 0.332724 0.472023 0.262363 0.362664 0.309413 0.241962 0.155573 1.000000 0.111894 -0.059444 0.468933 -0.019882 0.014308 0.201057 0.098182 -0.098600
age 0.128832 0.077320 0.137425 0.157100 0.148848 0.087811 0.118008 0.113967 0.095384 0.062571 0.089405 0.154545 0.084613 0.131156 0.104815 0.088194 0.043654 0.111894 1.000000 0.016241 0.107033 0.041469 0.176578 0.658807 0.900022 -0.111559
meters -0.252423 0.251926 0.261249 -0.128543 -0.029965 0.130382 0.044996 0.000897 -0.023133 0.330219 0.092152 -0.062424 -0.096721 0.328686 -0.318287 -0.248232 -0.201655 -0.059444 0.016241 1.000000 -0.085559 0.003383 0.801738 0.022226 0.006351 0.337993
min 0.567290 0.240309 0.569548 0.794232 0.728871 0.263894 0.486043 0.477959 0.435448 0.283937 0.432809 0.744084 0.359245 0.557822 0.585642 0.474144 0.272089 0.468933 0.107033 -0.085559 1.000000 0.089483 -0.049958 0.146762 0.125697 -0.075726
plusMinus 0.159905 0.085996 0.162405 0.081852 0.181727 0.169360 0.085496 0.095969 0.083457 0.040610 0.008543 0.186617 0.113339 0.142350 0.071578 0.165116 0.145718 -0.019882 0.041469 0.003383 0.089483 1.000000 0.017134 0.067028 0.021486 -0.004002
pounds -0.166412 0.243935 0.312349 -0.062387 0.029610 0.146847 0.108121 0.055910 0.009371 0.374211 0.116605 -0.002027 -0.054177 0.385286 -0.303864 -0.238167 -0.199015 0.014308 0.176578 0.801738 -0.049958 0.017134 1.000000 0.201739 0.018154 0.275634
years_pro 0.219776 0.110978 0.224570 0.262914 0.245894 0.127607 0.213502 0.206695 0.161579 0.109273 0.130562 0.258023 0.124488 0.216941 0.156840 0.132337 0.066599 0.201057 0.658807 0.022226 0.146762 0.067028 0.201739 1.000000 0.402829 -0.098223
rookie_year 0.106663 0.068239 0.120884 0.151250 0.144353 0.077829 0.106079 0.100236 0.086646 0.061528 0.075931 0.145995 0.075410 0.117785 0.098521 0.078425 0.035838 0.098182 0.900022 0.006351 0.125697 0.021486 0.018154 0.402829 1.000000 -0.168386
jersey_number -0.147452 0.012307 -0.014604 -0.148484 -0.114200 -0.015409 -0.071535 -0.077106 -0.067285 0.040387 -0.029604 -0.122387 -0.084824 0.003703 -0.137527 -0.106122 -0.080371 -0.098600 -0.111559 0.337993 -0.075726 -0.004002 0.275634 -0.098223 -0.168386 1.000000
  • This is still a bit awkward to decode so I am going to list the same info as a list with the feature pairs that have the strongest correlation prioritiesd to the top of the list
In [10]:
df_corr_c = df[numeric_cols].corr().abs().unstack().sort_values(ascending=False)
df_corr_c[(df_corr_c < 1) & (df_corr_c >= 0.7)]# exclude below 74% correlation and 100% for matching cols in corr matrix
Out[10]:
points       fgm            0.968215
fgm          points         0.968215
fta          ftm            0.953071
ftm          fta            0.953071
defReb       totReb         0.940426
totReb       defReb         0.940426
rookie_year  age            0.900022
age          rookie_year    0.900022
points       fga            0.897463
fga          points         0.897463
fgm          fga            0.897029
fga          fgm            0.897029
tpm          tpa            0.824836
tpa          tpm            0.824836
meters       pounds         0.801738
pounds       meters         0.801738
min          fga            0.794232
fga          min            0.794232
min          points         0.744084
points       min            0.744084
min          fgm            0.728871
fgm          min            0.728871
tpa          fga            0.708842
fga          tpa            0.708842
tpm          tpp            0.705648
tpp          tpm            0.705648
ftp          ftm            0.705585
ftm          ftp            0.705585
offReb       totReb         0.704013
totReb       offReb         0.704013
dtype: float64
  • We can see 'points' (Points scored) and field goals made ('fgm') have the strongest correlation which is interesting.
  • We can see other fields of interest that have a strong correlation with 'points' are 'fga' (Field Golas Attempted), and 'min' (Minutes Played)
  • Rebounding related fields have a strong relationship (defensive, offensive and total rebounds) which makes sense
  • Weight and Height, Field Goal related, Three Pointer features (tpm, tpa), Free Throws (ftp, fta, ftm), and age / years playing professional all figure as field pairs/tuples that have strong correlation which is expected.
  • A player that scores a lot of field goals tend to score more points, more so than a higher scoring rate of free throws or three pointers
  • On the other side, if we look back at the correlation matrix: Blocks, Steals and Assists are often stand alone performance statistics that to not seem to have a strong correlation with any other feature
  • This is also true for categories that have a negative aspect to player performance such as personal founls (pFouls) and turnovers.
  • PlusMinus is a very prominent feature as it measures the positive or negative impact and contribution a player has for his team for a given game while on court, over the course of a season this is a telling statistic.
  • I would like to view a time series line plot of the avergae player weight per season
In [11]:
df_weight = df.copy()
df_weight.index = df_weight.index.to_timestamp()

df_weight = df_weight['pounds'].resample("1YE").mean()
create_ts_plot(df_weight, "Average Weight by Year", "Weight in Pounds", "darkgreen");
No description has been provided for this image
  • From this plot it appears the players are lighter overall since 2015 and trending down in weight by 5/6 lbs from 2015 - 2023
  • Lets verify this by checking the mean weight of players per the 2015 season data and again for 2023
In [12]:
avg_weight_2015 = round(df[df.index == '2015']['pounds'].mean(), 2)
print("Average weight for 2015: %s pounds" % avg_weight_2015)
Average weight for 2015: 222.31 pounds
In [13]:
avg_weight_2024 = round(df[df.index == '2024']['pounds'].mean(), 2)
print("Average weight for 2024: %s pounds" % avg_weight_2024)
Average weight for 2024: 217.21 pounds
In [14]:
print("Difference in weight from 2015 to 2024: %s lbs" % (round(avg_weight_2015 - avg_weight_2024, 2)))
Difference in weight from 2015 to 2024: 5.1 lbs
  • Heaviest Players in the NBA from 2015-2024
In [15]:
df[['player_name', 'pounds', 'meters', 'feet']].reset_index(drop=True).sort_values(by=['pounds'], ascending=False).drop_duplicates()[0:5]
Out[15]:
player_name pounds meters feet
79593 Boban Marjanovic 290.0 2.21 7"3
230879 Jusuf Nurkic 290.0 2.11 6"11
96058 Nikola Jokic 284.0 2.11 6"11
216397 Zion Williamson 284.0 1.98 6"6
247237 Joel Embiid 280.0 2.13 7"0
  • Notably Zion Williamson is quite heavy for his height (under 2 meters)
  • Lighest Players in the NBA from 2015-2024
In [16]:
df[['player_name', 'pounds', 'meters', 'feet']].reset_index(drop=True).sort_values(by=['pounds'], ascending=True).drop_duplicates()[0:5]
Out[16]:
player_name pounds meters feet
250481 Tyrell Terry 160.0 1.88 6"2
335401 Isaiah Joe 165.0 1.93 6"4
358675 Xavier Moon 165.0 1.88 6"2
184584 Kyle Guy 167.0 1.85 6"1
358408 Bones Hyland 169.0 1.88 6"2
  • View average height for players in 2017
In [17]:
print("Average height for players in 2017: %s meters" % round(df[df.index == '2017']['meters'].mean(), 3))
Average height for players in 2017: 1.998 meters
  • View average height for players in 2021
In [18]:
print("Average height for players in 2021: %s meters" % round(df[df.index == '2021']['meters'].mean(), 3))
Average height for players in 2021: 1.99 meters
  • I would like to view a time series line plot of the avergae player height per season to get a better idea of this data and trend
In [19]:
df_height = df.copy()
df_height.index = df_height.index.to_timestamp()

df_height = df_height['meters'].resample("1YE").mean()
create_ts_plot(df_height, "Average Height by Year", 
               "Height in Meters", "violet");
No description has been provided for this image
  • As opposed to the trend in average player weight (trending down), the average player height has not varied much from 2015 - 2024.
  • The average height staying at 1.99 to 2 meters (6"6 in feet and inches).
  • Next, lets get an idea of the distribution of total minutes played by NBA players from 2015-2024
  • I would like to know the if it is a normal distribution and are there any outliers
In [20]:
# create histogram visual
mins_played_summary = summary_top_players('player_name', 'min', results = 800)['sum']
h = mins_played_summary.hist(backend='plotly', labels=dict(index='Player Count', value='Minutes Played', 
                                                           variable='Total Minutes Played'))
h.layout.yaxis.title.text = 'Player Count'
h
  • Not a bell shape or the expected normal distribution, we can confirm this with the Shapiro-Wilk test where we can reject the null hypothesis when p value is less than 0.05 and say the data does not come from a normal distribution
In [21]:
test_normality = shapiro(mins_played_summary)

print(test_normality)
ShapiroResult(statistic=0.8014081288597795, pvalue=3.564594976648819e-30)
In [22]:
is_not_normal_dist = (test_normality.pvalue < 0.05) and (test_normality.statistic > 0.8)
print("Can we reject the null hypothesis and say this is not a normal distribution (bell shaped): %s" % is_not_normal_dist)
Can we reject the null hypothesis and say this is not a normal distribution (bell shaped): True
  • We can see most players fall into bins where they have played less than 5k minutes in this time (the majority have less than 500 mins playing time at a count of 340 players)
  • However, the distribution is right skewed distribution with a gap before a group of players that have between 7.5k and 15k approx. playing minutes in that time frame of 10 years, this may be due to playoff runs of teams that have smaller squads or players that seem to avoid injury more often than not.
  • Lets view the players in this group, Note: It will be interesting if they end up in the group of players with the best team impact.
In [23]:
df_mins = summary_top_players('player_name', 'min', sort_by = 'sum', results = 20)
df_mins[df_mins['sum'] > 7500]
df_mins
Out[23]:
mean sum
player_name
Taurean Prince 24.39 19028.0
Josh Hart 33.61 15258.0
Norman Powell 27.07 15105.0
Domantas Sabonis 34.75 15012.0
Pascal Siakam 34.28 14810.0
Nikola Vucevic 33.15 13858.0
Alex Caruso 24.87 12981.0
Kyrie Irving 36.13 12718.0
Kevin Durant 35.98 12450.0
Jarrett Allen 31.46 12334.0
Malik Beasley 26.21 11796.0
Tim Hardaway Jr. 26.84 11060.0
Mason Plumlee 20.11 10920.0
Terry Rozier 32.66 10844.0
Jerami Grant 33.70 10448.0
Kelly Oubre Jr. 31.12 10268.0
Russell Westbrook 25.76 10048.0
Kyle Lowry 26.95 9862.0
Eric Gordon 26.01 9260.0
Georges Niang 19.93 9090.0
  • Lets view the top players in different stat groupings by average (mean is default sort order) or total (sum when specified in sort_by parameter) where appropriate.
In [24]:
summary_top_players('player_name', 'points', sort_by = 'sum', results=5)
Out[24]:
mean sum
player_name
Kevin Durant 27.01 35170.0
Russell Westbrook 21.16 33398.0
Kyrie Irving 24.24 29574.0
Andre Drummond 12.56 27846.0
Nikola Vucevic 18.16 27172.0
In [25]:
summary_top_players('player_name', 'points', results=5)
Out[25]:
mean sum
player_name
Luka Doncic 28.23 13974.0
Kevin Durant 27.01 35170.0
Stephen Curry 26.92 19465.0
Giannis Antetokounmpo 26.45 20475.0
Joel Embiid 26.41 14181.0
In [26]:
summary_top_players('player_name', 'assists', sort_by = 'sum', results=5)
Out[26]:
mean sum
player_name
Russell Westbrook 8.58 13532.0
Kyle Lowry 6.23 9276.0
Eric Bledsoe 4.98 7404.0
James Harden 8.71 6952.0
Kyrie Irving 5.46 6658.0
In [27]:
summary_top_players('player_name', 'assists', results=5)
Out[27]:
mean sum
player_name
Trae Young 9.38 4757.0
James Harden 8.71 6952.0
John Wall 8.64 6204.0
Russell Westbrook 8.58 13532.0
Tyrese Haliburton 8.57 2802.0
In [28]:
summary_top_players('player_name', 'fgp', sort_by = 'sum', results=5)
Out[28]:
mean sum
player_name
JaVale McGee 52.96 146170.4
Mason Plumlee 57.75 134266.2
Andre Drummond 53.20 117955.2
Taurean Prince 40.98 100804.0
Doug McDermott 43.51 93841.5
In [29]:
summary_top_players('player_name', 'ftp', sort_by = 'sum', results=5)# free throw percentage
Out[29]:
mean sum
player_name
Kevin Durant 85.17 110896.6
Norman Powell 50.71 110602.5
Gordon Hayward 66.28 106380.6
Russell Westbrook 66.24 104525.6
Mason Plumlee 44.33 103078.5
In [30]:
summary_top_players('player_name', 'tpp', sort_by = 'sum', results=6)# three-point percentage
Out[30]:
mean sum
player_name
Taurean Prince 34.07 83812.8
Doug McDermott 34.46 74325.0
Norman Powell 32.36 70580.7
Evan Fournier 35.05 62773.5
Alec Burks 32.77 61352.7
Gordon Hayward 34.49 55351.2
In [31]:
summary_top_players('player_name', 'offReb', sort_by = 'sum', results=5)# offensive rebounds
Out[31]:
mean sum
player_name
Andre Drummond 3.94 8742.0
Steven Adams 3.87 5160.0
Mason Plumlee 2.10 4878.0
JaVale McGee 1.39 3848.0
Domantas Sabonis 2.62 3528.0
In [32]:
summary_top_players('player_name', 'defReb', sort_by = 'sum', results=5)# defensive rebounds
Out[32]:
mean sum
player_name
Andre Drummond 8.16 18099.0
Nikola Vucevic 8.20 12262.0
Mason Plumlee 4.74 11016.0
Russell Westbrook 6.59 10396.0
Domantas Sabonis 7.63 10276.0
In [33]:
summary_top_players('player_name', 'totReb', results=5)# Total rebounds by average
Out[33]:
mean sum
player_name
Andre Drummond 12.11 26841.0
Rudy Gobert 12.00 9447.0
Giannis Antetokounmpo 10.79 8349.0
Karl-Anthony Towns 10.74 7354.0
Anthony Davis 10.74 7369.0
In [34]:
summary_top_players('player_name', 'totReb', sort_by = 'sum', results=5)# Total rebounds by total/sum
Out[34]:
mean sum
player_name
Andre Drummond 12.11 26841.0
Mason Plumlee 6.84 15894.0
Nikola Vucevic 10.44 15614.0
Domantas Sabonis 10.26 13804.0
DeAndre Jordan 9.18 12870.0
In [35]:
summary_top_players('player_name', 'blocks', sort_by = 'sum', results=5)# blocked shots
Out[35]:
mean sum
player_name
JaVale McGee 0.96 2648.0
Andre Drummond 1.12 2478.0
Mason Plumlee 0.80 1866.0
Kevin Durant 1.26 1640.0
Rudy Gobert 2.02 1586.0
In [36]:
summary_top_players('player_name', 'blocks', results=6)# blocked shots by average
Out[36]:
mean sum
player_name
Selom Mawugbe 4.50 9.0
Victor Wembanyama 3.64 393.0
Walker Kessler 2.34 422.0
Chet Holmgren 2.32 255.0
Myles Turner 2.17 1499.0
Anthony Davis 2.15 1472.0
  • Notable inclusions here for average blocked shots instead of total/sum blocked shots in 10 years (Highly touted rookies Victor Wenbamyama & Chet Holmgren)
In [37]:
summary_top_players('player_name', 'steals', sort_by = 'sum', results=5)
Out[37]:
mean sum
player_name
Andre Drummond 1.29 2871.0
Russell Westbrook 1.50 2364.0
Eric Bledsoe 1.34 2001.0
Taurean Prince 0.80 1960.0
Kyle Lowry 1.28 1900.0
In [38]:
summary_top_players('player_name', 'fgm', sort_by = 'sum', results=5)
Out[38]:
mean sum
player_name
Kevin Durant 9.52 12392.0
Russell Westbrook 7.70 12152.0
Andre Drummond 5.22 11562.0
Nikola Vucevic 7.66 11452.0
Kyrie Irving 9.02 11004.0
In [39]:
summary_top_players('player_name', 'fgm', results=6)
Out[39]:
mean sum
player_name
DMitrik Trice 10.00 10.0
LeBron James 9.86 7521.0
Giannis Antetokounmpo 9.70 7505.0
Luka Doncic 9.64 4772.0
Kevin Durant 9.52 12392.0
Anthony Davis 9.11 6251.0
In [40]:
summary_top_players('player_name', 'fga', results=5)
Out[40]:
mean sum
player_name
Luka Doncic 20.58 10185.0
Donovan Mitchell 19.28 11493.0
Stephen Curry 18.98 13723.0
Damian Lillard 18.90 13588.0
LeBron James 18.84 14378.0
In [41]:
summary_top_players('player_name', 'fga', sort_by='sum', results=5)
Out[41]:
mean sum
player_name
Russell Westbrook 17.56 27712.0
Kevin Durant 18.35 23886.0
Nikola Vucevic 15.57 23294.0
Kyrie Irving 18.73 22846.0
Andre Drummond 9.81 21741.0
  • Lets look at fgm (Field Goals Made) a bit closer and sort by mean instead of sum for aggregation type, as it has strong correlation with points scored
In [42]:
df_field_goals_made = summary_top_players('player_name', 'fgm')
df_field_goals_made = df_field_goals_made[df_field_goals_made['sum'] > 5000]
df_field_goals_made
Out[42]:
mean sum
player_name
LeBron James 9.86 7521.0
Giannis Antetokounmpo 9.70 7505.0
Kevin Durant 9.52 12392.0
Anthony Davis 9.11 6251.0
Kyrie Irving 9.02 11004.0
Stephen Curry 8.95 6468.0
  • This is promising as I would expect these players to appear in the elite player group at the end of the analysis
  • Lets look at plusMinus score which is a good starting point to see players that contribute well to the team but maybe not in the top 5 of specific areas like points or rebounds.
In [43]:
summary_top_players('player_name', 'plusMinus', sort_by = 'sum')
Out[43]:
mean sum
player_name
Kevin Durant 6.47 8420.0
Stephen Curry 7.54 5453.0
Kyrie Irving 4.29 5234.0
Danny Green 3.85 5004.0
Kyle Lowry 3.26 4852.0
Steven Adams 3.48 4632.0
Draymond Green 5.81 4431.0
Pascal Siakam 2.92 4078.0
Nikola Jokic 4.84 4036.0
Jayson Tatum 5.75 3995.0
  • We can see players like Kyle Lowry, Pascal Siakam, and Draymond Green figure here and contribute well over time while their average/mean is not at the same level as Stephen Curry or Kevin Durant
  • I would like to see the countries represented in the NBA over the past 10 years
  • I want to visualise this information by the total minutes played per country
In [44]:
df_country_grp = df.loc[~((df['min'] == 0) | (df['min'].isna()))]

country_grp = df_country_grp.groupby('country')['min'].agg({'sum'}).sort_values('sum', ascending=False)
country_grp
Out[44]:
sum
country
USA 949700.0
Canada 28047.0
Australia 22018.0
France 21078.0
Germany 20187.0
Cameroon 19487.0
Lithuania 15012.0
Montenegro 13888.0
Bahamas 11729.0
Croatia 11405.0
Turkey 10133.0
Serbia 8326.0
Slovenia 7689.0
Dominican Republic 7636.0
Japan 6688.0
United Kingdom 6671.0
Saint Lucia 5888.0
Greece 5868.0
Latvia 5702.0
Austria 4481.0
Bosnia and Herzegovina 4312.0
Nigeria 4090.0
Jamaica 3611.0
Spain 3479.0
Ukraine 3317.0
New Zealand 3040.0
Sudan 2261.0
DRC 1766.0
Angola 1463.0
South Sudan 1248.0
Italy 715.0
Brazil 577.0
Republic of the Congo 234.0
Argentina 105.0
Gabon 28.0
Mali 20.0
  • The Bahamas shooting above its weight here (by way of population per Country) with the likes of Buddy Hield & Deandre Ayton
In [45]:
print("The number of countries with representation in the NBA over the last 10 years is: %s" % len(country_grp))# Count of countries represented
The number of countries with representation in the NBA over the last 10 years is: 36
  • Lets remove the USA because as expected they are well represented in NBA basketball, far more than any other country
In [46]:
# remove USA (heavily weighted towards players from USA)
hbar = country_grp[1:].plot(backend='plotly', kind='bar', labels = dict(variable='Minutes Played', 
                                                                        value='Total Minutes Played', index='Country'))
hbar.layout.yaxis.title.text = 'Country'
hbar

Data Verification Task¶

  • Check the player and games for Mali's total minutes played
In [47]:
df[(df['country'] == 'Mali') & (df['min'] > 0)][['player_name', 'min', 'game_id']]# data verification of minutes played for Mali
Out[47]:
player_name min game_id
season
2023 Cheick Diallo 10.0 12508
2023 Cheick Diallo 5.0 12514
2023 Cheick Diallo 5.0 12546
In [48]:
print("The number of unique NBA players listed from 2015 - 2024 seasons is: %s" % len(df['player_name'].unique()))# number of unique NBA players listed from 2015 - 2024
The number of unique NBA players listed from 2015 - 2024 seasons is: 2044
  • Time to try out some more time series visualisations
  • Lets view 3 east coast teams by plusMinus score over time
In [49]:
df[df['team_name'].isin({'Boston Celtics', 'Miami Heat', 'Milwaukee Bucks'})].groupby('team_name')['plusMinus'].plot(
    x=df.index.year, legend=True, figsize=(20, 10), use_index=False, title="Plus Minus By Team", 
    fontsize=fontsize, zorder=3)
Out[49]:
team_name
Boston Celtics     Axes(0.125,0.11;0.775x0.77)
Miami Heat         Axes(0.125,0.11;0.775x0.77)
Milwaukee Bucks    Axes(0.125,0.11;0.775x0.77)
Name: plusMinus, dtype: object
No description has been provided for this image
  • Now I want to compare LeBron vs Steph Curry plusMinus score over time as a visual
In [50]:
df[df['player_name'].isin({'LeBron James', 'Stephen Curry'})].groupby('player_name')['plusMinus'].plot(
    x=df.index.year, legend=True, figsize=(20, 10), use_index=False, title="Plus Minus By Season", 
    fontsize=fontsize, zorder=3)
Out[50]:
player_name
LeBron James     Axes(0.125,0.11;0.775x0.77)
Stephen Curry    Axes(0.125,0.11;0.775x0.77)
Name: plusMinus, dtype: object
No description has been provided for this image
  • Lets try something similar with visualising season high for points scored for a less noisy plot
In [51]:
df_pts_per_yr = df.copy()
df_pts_per_yr.index = df_pts_per_yr.index.to_timestamp()

df_pts_per_yr = df_pts_per_yr['points'].resample("1YE").max()
ax = create_ts_plot(df_pts_per_yr, "Points Scored by Season High", "Points Scored", "darkred")
No description has been provided for this image
  • Standout performances were on the up in 2023 with a season high 73 points (60-73pts was range for 10 years season high)
  • Lets do the same for assists
In [52]:
df_assts_per_yr = df.copy()
df_assts_per_yr.index = df_assts_per_yr.index.to_timestamp()

df_assts_per_yr = df_assts_per_yr['assists'].resample("1YE").max()
ax = create_ts_plot(df_assts_per_yr, "Assists Made by Season High", "Assists Maximum")
No description has been provided for this image
  • 25 assists in a single game was the peak in the past 10 seasons
  • The range for season high assists in a single game was [19, 25]
  • Lets see the players that achieved these numbers
In [53]:
df[['player_name', 'assists']].sort_values(by='assists', ascending=False)[0:9]
Out[53]:
player_name assists
season
2017 Rajon Rondo 25.0
2018 Russell Westbrook 24.0
2020 Russell Westbrook 24.0
2018 Russell Westbrook 24.0
2020 Russell Westbrook 24.0
2023 Tyrese Haliburton 23.0
2016 Russell Westbrook 22.0
2016 Russell Westbrook 22.0
2024 Trae Young 22.0
  • Russell Westbrook remarkably appearing 6 times for this milestone (Games from 2016 - 2020)
  • Lets take another view at points scored, instead of viewing by season high lets make a box plot to see the avergae, median, outliers
In [54]:
df_pts_high = df.loc[~((df['points'] == 0) | (df['points'].isna()))]
  
sns.set_theme(style="darkgrid")
palette = load_cmap("pastel").hex

plt.figure(figsize=(10, 5))
sns.boxplot(data=df_pts_high, x=df_pts_high.index, y='points', palette=palette, hue='season')
plt.title("Points Scored Distribution by Season", fontsize=fontsize-1)
plt.xlabel("Season")
plt.ylabel("Points Scored");
plt.ylim(-5, 75);
No description has been provided for this image
  • Again this confirms 22/23' were strong seasons for scoring after lows in 2020/21' (Note: 2020/2021 was a shortened season for the covid bubble where lakers won the championship 2020 championship followed by the Bucks in 2021)
  • Luka Doncic max for 2023 at 73 pts is the top scoring game over the past 10 years
  • Giannis also has his franchise record for the Bucks as a notable outlier in the box plot above and summary table below (64pts in 2023)
  • Devin Bookers 70 point game in 2016 is a notable outlier from 2015 all the way up to 20203/2024 when single game scoring started to see some further outliers
  • The median generally falling between 7-9 points.
  • The bulk of player point totals per game falling between 5 and 15 points
In [55]:
df[['player_name', 'points']].sort_values(by='points', ascending=False)[0:6]# Highest single game scorers in the past 10 years
Out[55]:
player_name points
season
2023 Luka Doncic 73.0
2022 Damian Lillard 71.0
2022 Donovan Mitchell 71.0
2016 Devin Booker 70.0
2023 Joel Embiid 70.0
2023 Giannis Antetokounmpo 64.0
  • Lets get the descriptive statitics for points and assists over 10 seasons
  • We can see the inter quartile range, min, max, standard deviation and averages for the whole data set which again confirms some of the previous analysis
In [56]:
round(df[['points', 'assists']].describe(), 2)
Out[56]:
points assists
count 374176.00 374176.00
mean 9.24 2.01
std 8.44 2.45
min 0.00 0.00
25% 2.00 0.00
50% 8.00 1.00
75% 14.00 3.00
max 73.00 25.00
In [57]:
df_tpm_per_yr = df.copy()
df_tpm_per_yr.index = df_tpm_per_yr.index.to_timestamp()
# df_tpm_per_yr.index = [dt.datetime.strptime(date, '%m/%d/%Y').date() for date in df_tpm_per_yr.index]

df_tpm_per_yr = df_tpm_per_yr['tpm'].resample("1YE").max()
ax = create_ts_plot(df_tpm_per_yr, "Three Pointers Made by Season High", "Count", "royalblue")
No description has been provided for this image
  • Lets take a look at the players that made the most 3-pointers in a single game
In [58]:
df[['player_name', 'tpm']].sort_values(by='tpm', ascending=False)[0:6]
Out[58]:
player_name tpm
season
2018 Klay Thompson 14.0
2019 Zach LaVine 13.0
2016 Stephen Curry 13.0
2022 Damian Lillard 13.0
2020 Damian Lillard 12.0
2015 Stephen Curry 12.0
  • The same analysis for three pointers made in a single game (season high) throws up some interesting results
  • Klay Thompsons 14 3pointers remains the peak for the last 9 seasons which means perhaps field goals and free throws made up a lot of the high scoring games mentioned earlier
  • The game specific stats are available from another end-point which may be worth adding as a next step to this analysis
  • Next, Create function to calculate productivity matrix to generate heat maps and charts for top players in the NBA over the last 10 seasons
  • This matrix needs to normalise all numeric features to the range [0, 1]
  • I am adding some filters related to minutes played, fouls and turnovers to rule out some players that may not have contributed well over a longer time period (rookies, players injured for long duration etc.) with notable drawbacks to their individual performance.
In [59]:
cols_pos = ['assists', 'blocks', 'steals', 'fgm', 'fgp', 'ftm', 'ftp', 'totReb', 'tpm', 'tpp', 'points', 'min', 'plusMinus']
cols_neg = ['pFouls', 'turnovers', 'min']


def calc_productivity_matrix(df_temp, cols, min_thresh = 0.75):
    # group by player and get mean / averages for 2015-2023 (9 years)
    df_prod = df_temp.groupby("player_name")[cols].mean()

    # min-max normalisation of data frame
    df_norm_prod = (df_prod - df_prod.min()) / (df_prod.max() - df_prod.min())

    # drop records with NA values
    df_prod_heatmap = round(df_norm_prod.dropna(), 2)
    
    # filter out players with less playing experience
    if 'min' in df_prod_heatmap.columns:
        df_prod_heatmap = df_prod_heatmap[df_prod_heatmap['min'] > min_thresh]

    # calculate row sum as a sort of productivity indicator (can be based on positive or negative contributions)
    df_prod_heatmap['player_sum'] = df_prod_heatmap.sum(axis=1)

    # sort and find top 15 players
    df_prod_heatmap = df_prod_heatmap.sort_values('player_sum', ascending=False)[0:24]
    return df_prod_heatmap[cols]
  • View matrix of players making a strongly positive impact for their teams
In [60]:
positive_prod = calc_productivity_matrix(df, cols_pos)
positive_prod
Out[60]:
assists blocks steals fgm fgp ftm ftp totReb tpm tpp points min plusMinus
player_name
James Harden 0.93 0.14 0.38 0.78 0.43 0.93 0.84 0.52 0.70 0.35 0.93 0.94 0.58
Luka Doncic 0.86 0.10 0.31 0.96 0.46 0.74 0.72 0.71 0.66 0.33 1.00 1.00 0.55
Kevin Durant 0.54 0.28 0.21 0.95 0.53 0.73 0.85 0.58 0.48 0.40 0.96 0.97 0.63
Stephen Curry 0.62 0.07 0.35 0.89 0.47 0.56 0.83 0.42 1.00 0.41 0.95 0.88 0.66
Joel Embiid 0.36 0.35 0.22 0.87 0.48 1.00 0.80 0.88 0.24 0.31 0.94 0.92 0.61
LeBron James 0.84 0.15 0.31 0.99 0.52 0.56 0.70 0.66 0.42 0.33 0.92 0.95 0.58
Giannis Antetokounmpo 0.57 0.29 0.29 0.97 0.55 0.79 0.67 0.89 0.16 0.23 0.94 0.91 0.59
Damian Lillard 0.72 0.07 0.25 0.83 0.43 0.80 0.87 0.36 0.73 0.35 0.93 0.96 0.55
Victor Wembanyama 0.40 0.81 0.29 0.81 0.47 0.49 0.77 0.86 0.49 0.32 0.79 0.82 0.50
Trae Young 1.00 0.03 0.26 0.78 0.42 0.81 0.82 0.29 0.57 0.33 0.87 0.94 0.49
Anthony Davis 0.29 0.48 0.32 0.91 0.51 0.74 0.75 0.89 0.14 0.22 0.88 0.93 0.54
Nikola Jokic 0.73 0.16 0.30 0.82 0.55 0.48 0.72 0.89 0.24 0.32 0.76 0.94 0.60
Kawhi Leonard 0.39 0.14 0.42 0.84 0.49 0.63 0.80 0.53 0.42 0.38 0.84 0.90 0.63
Jayson Tatum 0.40 0.15 0.28 0.80 0.45 0.58 0.75 0.61 0.55 0.36 0.82 0.99 0.62
Kyrie Irving 0.58 0.11 0.31 0.90 0.48 0.45 0.78 0.34 0.57 0.38 0.86 0.98 0.59
Paul George 0.43 0.09 0.42 0.77 0.43 0.57 0.76 0.53 0.65 0.38 0.81 0.90 0.58
Karl-Anthony Towns 0.33 0.27 0.19 0.81 0.52 0.55 0.78 0.89 0.37 0.37 0.79 0.89 0.53
Donovan Mitchell 0.49 0.07 0.34 0.86 0.44 0.51 0.75 0.35 0.65 0.36 0.86 0.94 0.58
Anthony Edwards 0.44 0.14 0.33 0.83 0.44 0.48 0.72 0.43 0.61 0.35 0.82 0.97 0.53
Shai Gilgeous-Alexander 0.52 0.18 0.35 0.80 0.49 0.69 0.77 0.39 0.28 0.34 0.80 0.93 0.54
LaMelo Ball 0.76 0.07 0.37 0.72 0.41 0.38 0.70 0.49 0.64 0.34 0.72 0.88 0.47
Jimmy Butler 0.54 0.10 0.42 0.67 0.47 0.79 0.79 0.47 0.18 0.26 0.73 0.91 0.56
Ja Morant 0.79 0.07 0.26 0.81 0.46 0.61 0.71 0.40 0.29 0.29 0.79 0.84 0.56
Devin Booker 0.52 0.06 0.21 0.82 0.44 0.63 0.78 0.32 0.45 0.33 0.83 0.96 0.51
  • View matrix of players that have a bias/weight against the positive contributions they make (fouls and turnovers can be costly especially on a regular basis)
In [61]:
negative_prod = calc_productivity_matrix(df, cols_neg)
negative_prod
Out[61]:
pFouls turnovers min
player_name
James Harden 0.42 1.00 0.94
Luka Doncic 0.38 0.93 1.00
Trae Young 0.30 0.98 0.94
Joel Embiid 0.50 0.79 0.92
Cade Cunningham 0.47 0.84 0.89
LaMelo Ball 0.52 0.78 0.88
Giannis Antetokounmpo 0.50 0.76 0.91
Devin Booker 0.48 0.68 0.96
Julius Randle 0.50 0.66 0.93
LeBron James 0.29 0.85 0.95
Nikola Jokic 0.46 0.68 0.94
Karl-Anthony Towns 0.56 0.62 0.89
Domantas Sabonis 0.52 0.59 0.94
Victor Wembanyama 0.36 0.85 0.82
Kevin Durant 0.33 0.72 0.97
Paul George 0.45 0.67 0.90
De'Aaron Fox 0.43 0.63 0.94
Donovan Mitchell 0.40 0.64 0.94
Anthony Edwards 0.35 0.66 0.97
Paolo Banchero 0.35 0.69 0.92
Stephen Curry 0.35 0.72 0.88
Jaden Ivey 0.47 0.67 0.80
Draymond Green 0.51 0.63 0.79
Damian Lillard 0.30 0.67 0.96
  • View heatmap with annotations of positive player productivity
  • Dark squares map to low scores and pale squares map to higher ranking annotation or score for the given player/feature
In [62]:
sns.heatmap(positive_prod, annot=True)
Out[62]:
<Axes: ylabel='player_name'>
No description has been provided for this image
  • View heatmap with annotations of negative player productivity
In [63]:
sns.heatmap(negative_prod, annot=True)
Out[63]:
<Axes: ylabel='player_name'>
No description has been provided for this image
  • We can see James Harden has a very high turnover rate along with Luka Doncic and Trae Young
  • Next, lets use the players appearing high up in the list for turnovers/personal fouls as another layer of selection criteria that we can use to exclude players from the top team players in the NBA over the past 10 years
In [64]:
players_to_exclude = positive_prod.index.intersection(negative_prod.index)
players_to_exclude
Out[64]:
Index(['James Harden', 'Luka Doncic', 'Kevin Durant', 'Stephen Curry',
       'Joel Embiid', 'LeBron James', 'Giannis Antetokounmpo',
       'Damian Lillard', 'Victor Wembanyama', 'Trae Young', 'Nikola Jokic',
       'Paul George', 'Karl-Anthony Towns', 'Donovan Mitchell',
       'Anthony Edwards', 'LaMelo Ball', 'Devin Booker'],
      dtype='object', name='player_name')
  • Lets view the heat map table after excluding players with some drawbacks to their general play/performance (fouls & turnovers, not enough playing time etc.)
  • James Harden, Luka Doncic, Kevin Durant, Joel Embiid, LeBron James, Giannis Antetokounmpo, Trae Young, Nikola Jokic, and Paul George all lose out here and are all household names in the NBA
In [65]:
df_team_pos_players = positive_prod[~positive_prod.index.isin(players_to_exclude)]
df_team_pos_players
Out[65]:
assists blocks steals fgm fgp ftm ftp totReb tpm tpp points min plusMinus
player_name
Anthony Davis 0.29 0.48 0.32 0.91 0.51 0.74 0.75 0.89 0.14 0.22 0.88 0.93 0.54
Kawhi Leonard 0.39 0.14 0.42 0.84 0.49 0.63 0.80 0.53 0.42 0.38 0.84 0.90 0.63
Jayson Tatum 0.40 0.15 0.28 0.80 0.45 0.58 0.75 0.61 0.55 0.36 0.82 0.99 0.62
Kyrie Irving 0.58 0.11 0.31 0.90 0.48 0.45 0.78 0.34 0.57 0.38 0.86 0.98 0.59
Shai Gilgeous-Alexander 0.52 0.18 0.35 0.80 0.49 0.69 0.77 0.39 0.28 0.34 0.80 0.93 0.54
Jimmy Butler 0.54 0.10 0.42 0.67 0.47 0.79 0.79 0.47 0.18 0.26 0.73 0.91 0.56
Ja Morant 0.79 0.07 0.26 0.81 0.46 0.61 0.71 0.40 0.29 0.29 0.79 0.84 0.56
  • This is the final elite group, I would like to display further information about these players
In [66]:
df_elite_players = df[df['player_name'].isin(df_team_pos_players.index)][['player_name', 'team_name', 'active', 'years_pro', 'affiliation', 'college', 'country', 'age', 'rookie_year']].reset_index(drop=True).drop_duplicates().sort_values('player_name')
df_elite_players.reset_index().sort_values(by=['player_name', 'index'], ascending=[True, False])
Out[66]:
index player_name team_name active years_pro affiliation college country age rookie_year
1 1919 Anthony Davis Los Angeles Lakers True 9.0 Kentucky/USA Kentucky USA 32.0 2012.0
0 0 Anthony Davis New Orleans Pelicans True 9.0 Kentucky/USA Kentucky USA 32.0 2012.0
2 2016 Ja Morant Memphis Grizzlies NaN NaN NaN NaN NaN NaN NaN
3 909 Jayson Tatum Boston Celtics True 4.0 Duke/USA Duke USA 27.0 2017.0
4 2229 Jimmy Butler Miami Heat True 10.0 Marquette/USA Marquette USA 35.0 2011.0
7 1840 Jimmy Butler Philadelphia 76ers True 10.0 Marquette/USA Marquette USA 35.0 2011.0
6 1014 Jimmy Butler Minnesota Timberwolves True 10.0 Marquette/USA Marquette USA 35.0 2011.0
5 230 Jimmy Butler Chicago Bulls True 10.0 Marquette/USA Marquette USA 35.0 2011.0
9 2146 Kawhi Leonard LA Clippers True 10.0 San Diego State/USA San Diego State USA 33.0 2011.0
10 1320 Kawhi Leonard Toronto Raptors True 10.0 San Diego State/USA San Diego State USA 33.0 2011.0
8 319 Kawhi Leonard San Antonio Spurs True 10.0 San Diego State/USA San Diego State USA 33.0 2011.0
12 3680 Kyrie Irving Dallas Mavericks True 10.0 Duke/Australia Duke Australia 33.0 2011.0
11 2094 Kyrie Irving Brooklyn Nets True 10.0 Duke/Australia Duke Australia 33.0 2011.0
13 940 Kyrie Irving Boston Celtics True 10.0 Duke/Australia Duke Australia 33.0 2011.0
14 80 Kyrie Irving Cleveland Cavaliers True 10.0 Duke/Australia Duke Australia 33.0 2011.0
15 2324 Shai Gilgeous-Alexander Oklahoma City Thunder NaN NaN NaN NaN NaN NaN NaN
16 1741 Shai Gilgeous-Alexander LA Clippers NaN NaN NaN NaN NaN NaN NaN
  • We can see all players are active, the teams they have represented (most recent record is where index value is highest), country and college info, age and when they turned professional.
  • Jimmy Butler is the oldest and most experienced player in this group followed by Kawhi Leonard.
  • We can see Jayson Tatum & Ja Morant are loyal players as they have stayed with the team that drafted them to this point.
  • Shai Gilgeous-Alexander is the only non-american on the list, a Canadian (Kyrie has dual citizenship with Austrailia).
  • Note missing data for younger players and rookies.
  • This is a bit messy for players that have represented more than one team in the 10 year timeframe so lets get the most recent team information for each of these players only (to reduce the noise / record count).
In [67]:
df_elite_players.sort_index(axis=0, ascending=False).groupby('player_name').first()
Out[67]:
team_name active years_pro affiliation college country age rookie_year
player_name
Anthony Davis Los Angeles Lakers True 9.0 Kentucky/USA Kentucky USA 32.0 2012.0
Ja Morant Memphis Grizzlies None NaN None None None NaN NaN
Jayson Tatum Boston Celtics True 4.0 Duke/USA Duke USA 27.0 2017.0
Jimmy Butler Miami Heat True 10.0 Marquette/USA Marquette USA 35.0 2011.0
Kawhi Leonard LA Clippers True 10.0 San Diego State/USA San Diego State USA 33.0 2011.0
Kyrie Irving Dallas Mavericks True 10.0 Duke/Australia Duke Australia 33.0 2011.0
Shai Gilgeous-Alexander Oklahoma City Thunder None NaN None None None NaN NaN
  • Now we see one record for each player and only for the team they currently represent
  • Ja Morant, Jayson Tatum & Shai Gilgeous-Alexander have time on their side to improve their respective rankings as the other players are professionals for 9 years or longer
  • Lets view the productivity table as a heatmap to help identify where each player excels
  • The darker the cell background, the better or the higher the player ranking for the given feature used for analysis
In [68]:
df_team_pos_players.style.background_gradient(cmap='Blues')
Out[68]:
  assists blocks steals fgm fgp ftm ftp totReb tpm tpp points min plusMinus
player_name                          
Anthony Davis 0.290000 0.480000 0.320000 0.910000 0.510000 0.740000 0.750000 0.890000 0.140000 0.220000 0.880000 0.930000 0.540000
Kawhi Leonard 0.390000 0.140000 0.420000 0.840000 0.490000 0.630000 0.800000 0.530000 0.420000 0.380000 0.840000 0.900000 0.630000
Jayson Tatum 0.400000 0.150000 0.280000 0.800000 0.450000 0.580000 0.750000 0.610000 0.550000 0.360000 0.820000 0.990000 0.620000
Kyrie Irving 0.580000 0.110000 0.310000 0.900000 0.480000 0.450000 0.780000 0.340000 0.570000 0.380000 0.860000 0.980000 0.590000
Shai Gilgeous-Alexander 0.520000 0.180000 0.350000 0.800000 0.490000 0.690000 0.770000 0.390000 0.280000 0.340000 0.800000 0.930000 0.540000
Jimmy Butler 0.540000 0.100000 0.420000 0.670000 0.470000 0.790000 0.790000 0.470000 0.180000 0.260000 0.730000 0.910000 0.560000
Ja Morant 0.790000 0.070000 0.260000 0.810000 0.460000 0.610000 0.710000 0.400000 0.290000 0.290000 0.790000 0.840000 0.560000
  • We can already see a pattern emerging for the order of the list, tpp (three-point percentage), points and plusMinus are basically going from a paler square or light blue to navy as we go to the top of the list with Anthony Davis.
  • Lets try this in a Seaborn line plot instead to see if there are any other obvious patterns
  • Heat Maps (Numbers/Annotations/Cell colours etc.) sometimes dont show the obvious gap in some areas
In [69]:
a4_dims= [11.7, 8.27]
fig, ax = plt.subplots(figsize=a4_dims)
df_team_pos_players = df_team_pos_players.rename(columns={'assists': 'Assists',
                                                          'blocks': 'Blocks',
                                                          'steals': 'Steals',
                                                          'fgm' : 'Field Goals Made',
                                                          'fgp' : 'Field Goal Percentage',
                                                          'ftm' : 'Free Throws Made',
                                                          'ftp' : 'Free Throw Percentage',
                                                          'totReb' : 'Total Rebounds',
                                                          'tpm' : 'Three Pointers Made',
                                                          'tpp' : 'Three Pointer Percentage',
                                                          'points': 'Points',
                                                          'min': 'Minutes Played',
                                                          'plusMinus': 'Plus Minus Score'})
ax = sns.lineplot(ax=ax, data=df_team_pos_players.T, markers=True, dashes=False)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1,1), fontsize=19)
plt.xticks(rotation = "vertical", fontsize=19)
plt.yticks(fontsize=19)
Out[69]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 [Text(0, 0.0, '0.0'),
  Text(0, 0.2, '0.2'),
  Text(0, 0.4, '0.4'),
  Text(0, 0.6000000000000001, '0.6'),
  Text(0, 0.8, '0.8'),
  Text(0, 1.0, '1.0'),
  Text(0, 1.2000000000000002, '1.2')])
No description has been provided for this image
  • Anthony Davis is a ntoable shot blocker here (green line).
    • This is interesting because, through previous analysis, the latest number 1 draft pick has a similar trajectory to Anthony Davis because they are both extremely good shot blockers and rebounders, However, Victor Wembanyama of France has the ability to shoot well from three point range and possibly peform even better in relation to assists and blocks.
  • I want to view this theory by displaying AD against Victor in this format (Experience Vs Youth).
In [70]:
df_matrix_70 = calc_productivity_matrix(df, cols_pos, min_thresh=0.7)# lower minustes played threshold to pull in less experienced pros and rookies that are performing well
df_victor_ad = df_matrix_70[df_matrix_70.index.isin(['Victor Wembanyama', 'Anthony Davis'])]
a4_dims= [11.7, 8.27]
fig, ax = plt.subplots(figsize=a4_dims)
df_victor_ad = df_victor_ad.rename(columns={'assists': 'Assists',
                                                          'blocks': 'Blocks',
                                                          'steals': 'Steals',
                                                          'fgm' : 'Field Goals Made',
                                                          'fgp' : 'Field Goal Percentage',
                                                          'ftm' : 'Free Throws Made',
                                                          'ftp' : 'Free Throw Percentage',
                                                          'totReb' : 'Total Rebounds',
                                                          'tpm' : 'Three Pointers Made',
                                                          'tpp' : 'Three Pointer Percentage',
                                                          'points': 'Points',
                                                          'min': 'Minutes Played',
                                                          'plusMinus': 'Plus Minus Score'})
ax = sns.lineplot(ax=ax, data=df_victor_ad.T, markers=True, dashes=False)
sns.move_legend(ax, "upper left", bbox_to_anchor=(1,1), fontsize=19)
plt.xticks(rotation = "vertical", fontsize=19)
plt.yticks(fontsize=19)
Out[70]:
(array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 [Text(0, 0.1, '0.1'),
  Text(0, 0.2, '0.2'),
  Text(0, 0.30000000000000004, '0.3'),
  Text(0, 0.4, '0.4'),
  Text(0, 0.5, '0.5'),
  Text(0, 0.6, '0.6'),
  Text(0, 0.7000000000000001, '0.7'),
  Text(0, 0.8, '0.8'),
  Text(0, 0.9, '0.9'),
  Text(0, 1.0, '1.0')])
No description has been provided for this image
  • Overall, we can see ADs experience shines through but very promising stats so far from the latest #1 draft pick
In [71]:
plt.pcolor(df_team_pos_players)
Out[71]:
<matplotlib.collections.PolyQuadMesh at 0x14a33b2f0>
No description has been provided for this image
  • We can see how yellow matches up with the pale colours below and navy above matches closely to black or low ranking squares below
  • The annotations and index labelling really is helpful for decoding and making sense of this visualisation though so it is important to try different libraries/packages when possible like plotly, mathplotlib, Seaborn, Bokeh etc.
  • Seaborn on this occasion is a better visual with more information packed in automatically, I like because it is easy to use.
In [72]:
sns.heatmap(df_team_pos_players, annot=True)
Out[72]:
<Axes: ylabel='player_name'>
No description has been provided for this image
  • After messing around with different visuals of this final elite group of players, I think an interactive stacked bar chart might be the best option for this data overall.
  • Lets view each players bar chart first after we set a color for each statistic type
In [73]:
stat_cols = ['Assists', 'Blocks', 'Steals', 'Field Goals Made', 'Field Goal Percentage', 'Free Throws Made', 
             'Free Throw Percentage', 'Total Rebounds', 'Three Pointers Made', 'Three Pointer Percentage', 'Points',
             'Minutes Played', 'Plus Minus Score']
colours = ["limegreen", "red", "navy", "lightblue", "darkorange", "pink", "lightgreen", "orange", "darkgreen",
           "purple", "blue", "darkred", "yellow"]

stats_colours = dict(zip(stat_cols, colours))
stats_colours
Out[73]:
{'Assists': 'limegreen',
 'Blocks': 'red',
 'Steals': 'navy',
 'Field Goals Made': 'lightblue',
 'Field Goal Percentage': 'darkorange',
 'Free Throws Made': 'pink',
 'Free Throw Percentage': 'lightgreen',
 'Total Rebounds': 'orange',
 'Three Pointers Made': 'darkgreen',
 'Three Pointer Percentage': 'purple',
 'Points': 'blue',
 'Minutes Played': 'darkred',
 'Plus Minus Score': 'yellow'}
In [74]:
fig = plt.figure(figsize=(22, 22))
for i, player in enumerate(df_team_pos_players.index):
    # create the sub plot
    axc = fig.add_subplot(5, 2, i+1)
    ax = df_team_pos_players[df_team_pos_players.index == player].plot.bar(ax=axc, 
                                                                           legend=False,
                                                                           y=stat_cols, 
                                                                           color=stats_colours, 
                                                                           fontsize=fontsize, 
                                                                           zorder=3)
    # configure axis ticks / labels
    xticklabels = ""
    ax.xaxis.set_major_formatter(ticker.FixedFormatter(xticklabels))
    ax.set_title("Player Stats Split by Category", fontsize=fontsize)
    ax.set_xlabel("Player: %s" % player, fontsize=fontsize)
    ax.set_ylabel("Percentile", fontsize=fontsize)
    ax.set_ylim([.0, 1.05])
    ax.yaxis.grid=True;
fig.tight_layout()                                               
No description has been provided for this image
  • Lets merge this visualisation into one interactive plot
In [75]:
result_bc = df_team_pos_players.plot(backend='plotly', kind='bar', labels=dict(variable='Stat Category', 
                                                                               value='Normalised Proportion of Productivity'))
result_bc.layout.xaxis.title.text = 'Player Name'
result_bc
  • This visual helps decode where the top players are particularly strong and possible areas of improvement.
  • This is strong evidence for players that are within or atleast close to the top 10 performing players in the NBA over the last 10 seasons
  • Notably the minutes played, points, plusMinus score, steals, and FTP (free throw percentage) area of bars is similar for all players in this group.
  • Three pointers, assists, blocks and rebounds are key areas of differentiation at the top end of the NBA judging by this visual.

3. Discussion¶

  • The data used is spread across 2 different endpoints with a limitation that only data from the years 2015-2024 is available. The volume of data involved is still quite substantial at more than 320,000 records.

  • The data used is from over 540 API requests and there was a lot of time trying to pre-process and merge this data into something useful via player ID and other features.

  • There was a lot of nested JSON objects that had features I needed to extract, parse and format into something useful for the analysis here.

  • There were other difficulties such as missing Country and College information etc. for rookies or lesser known players such as Victor Wembanyama so the API covers a lot and is being updated all the time but is not perfect.

  • Data verification tasks included searching online for the single game records for points, assists, and three pointers made in the past 10 years. In addition, I completed some verification of the data actually loaded into the notebook along with some other spot checks (ongoing task for data of this nature as it is being updated all the time).

  • The key insights were the top players by statistic (total and averges) for certain stats in the NBA along with the final stacked bar chart and heatmaps which help identify some of the most productive and high performing players over the last 10 years.

  • I was able to track down a group of players that represent an elite group that best contributed to their respective teams over a 10 year period (3 players Jimmy Butler, Shai Gilgeous-Alexander, & Ja Morant that have yet to win a championship ring but have featured in several playoff series and all star weekends etc.).

  • I think the next step would be to pull game data and statistics by game ID. We could then use this in combination with player ID and team ID (as some players have moved / were traded to several different teams over the years) to merge this data into the main DataFrame to get an idea of the win rate associated with the players here and the split of regular season statistics vs playoff games.

  • I would like to also see how player performance varies from one franchise to another, such as LeBron at Miami vs LeBron at the Lakers.

  • Right now it is hard to see which data is regular or post season and what players do not perform well in the playoffs vs regular season games. This would be another key insight to try visualise, perhaps more important than the average 'plusMinus' score or the method I have used to find the most impactful NBA players over the last 10 seasons.

  • I think a view of the final stacked bar chart per season would allow to see what players are trending up and what players are trending down in their performance levels. This may show that I wrongly excluded players like: James Harden, Luka Doncic, Kevin Durant, Joel Embiid, LeBron James, Giannis Antetokounmpo, Trae Young, Nikola Jokic, and Paul George from the final elite players group for the past 10 years. However, from the players excluded, Durant, LeBron, Giannis and Nikola are the players to be part of a winning championship team. To that end, Gianis and Nikola Jokic are under 30, while LeBron and Kevin Durant are 40 & 36 years of age respectively.

In [ ]: