Can We Predict If a PGA Tour Player Won a Tournament in a Given Year?

Golf is picking up popularity, so I thought it would be interesting to focus my project here. I set out to find what sets apart the best golfers from the rest. I decided to explore their statistics and to see if I could predict which golfers would win in a given year. My original dataset was found on Kaggle, and the data was scraped from the PGA Tour website.

From this data, I performed an exploratory data analysis to explore the distribution of players on numerous aspects of the game, discover outliers, and further explore how the game has changed from 2010 to 2018. I also utilized numerous supervised machine learning models to predict a golfer's earnings and wins.

To predict the golfer's win, I used classification methods such as logisitic regression and Random Forest Classification. The best performance came from the Random Forest Classification method.

  1. The Data

pgaTourData.csv contains 1674 rows and 18 columns. Each row indicates a golfer's performance for that year.

# Player Name: Name of the golfer

# Rounds: The number of games that a player played

# Fairway Percentage: The percentage of time a tee shot lands on the fairway

# Year: The year in which the statistic was collected

# Avg Distance: The average distance of the tee-shot

# gir: (Green in Regulation) is met if any part of the ball is touching the putting surface while the number of strokes taken is at least two fewer than par

# Average Putts: The average number of strokes taken on the green

# Average Scrambling: Scrambling is when a player misses the green in regulation, but still makes par or better on a hole

# Average Score: Average Score is the average of all the scores a player has played in that year

# Points: The number of FedExCup points a player earned in that year

# Wins: The number of competition a player has won in that year

# Top 10: The number of competitions where a player has placed in the Top 10

# Average SG Putts: Strokes gained: putting measures how many strokes a player gains (or loses) on the greens

# Average SG Total: The Off-the-tee + approach-the-green + around-the-green + putting statistics combined

# SG:OTT: Strokes gained: off-the-tee measures player performance off the tee on all par-4s and par-5s

# SG:APR: Strokes gained: approach-the-green measures player performance on approach shots

# SG:ARG: Strokes gained: around-the-green measures player performance on any shot within 30 yards of the edge of the green

# Money: The amount of prize money a player has earned from tournaments

# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('pgaTourData.csv')

# Examining the first 5 data
df.head()
Player Name Rounds Fairway Percentage Year Avg Distance gir Average Putts Average Scrambling Average Score Points Wins Top 10 Average SG Putts Average SG Total SG:OTT SG:APR SG:ARG Money
0 Henrik Stenson 60.0 75.19 2018 291.5 73.51 29.93 60.67 69.617 868 NaN 5.0 -0.207 1.153 0.427 0.960 -0.027 $2,680,487
1 Ryan Armour 109.0 73.58 2018 283.5 68.22 29.31 60.13 70.758 1,006 1.0 3.0 -0.058 0.337 -0.012 0.213 0.194 $2,485,203
2 Chez Reavie 93.0 72.24 2018 286.5 68.67 29.12 62.27 70.432 1,020 NaN 3.0 0.192 0.674 0.183 0.437 -0.137 $2,700,018
3 Ryan Moore 78.0 71.94 2018 289.2 68.80 29.17 64.16 70.015 795 NaN 5.0 -0.271 0.941 0.406 0.532 0.273 $1,986,608
4 Brian Stuard 103.0 71.44 2018 278.9 67.12 29.11 59.23 71.038 421 NaN 3.0 0.164 0.062 -0.227 0.099 0.026 $1,089,763

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312 entries, 0 to 2311
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         2312 non-null   object 
 1   Rounds              1678 non-null   float64
 2   Fairway Percentage  1678 non-null   float64
 3   Year                2312 non-null   int64  
 4   Avg Distance        1678 non-null   float64
 5   gir                 1678 non-null   float64
 6   Average Putts       1678 non-null   float64
 7   Average Scrambling  1678 non-null   float64
 8   Average Score       1678 non-null   float64
 9   Points              2296 non-null   object 
 10  Wins                293 non-null    float64
 11  Top 10              1458 non-null   float64
 12  Average SG Putts    1678 non-null   float64
 13  Average SG Total    1678 non-null   float64
 14  SG:OTT              1678 non-null   float64
 15  SG:APR              1678 non-null   float64
 16  SG:ARG              1678 non-null   float64
 17  Money               2300 non-null   object 
dtypes: float64(14), int64(1), object(3)
memory usage: 325.2+ KB

df.shape

(2312, 18)
  1. Data Cleaning

After looking at the dataframe, the data needs to be cleaned:

-For the columns Top 10 and Wins, convert the NaNs to 0s

-Change Top 10 and Wins into an int

-Drop NaN values for players who do not have the full statistics

-Change the columns Rounds into int

-Change points to int

-Remove the dollar sign ($) and commas in the column Money

df['Top 10'].fillna(0, inplace=True)
df['Top 10'] = df['Top 10'].astype(int)

# Replace NaN with 0 in # of wins
df['Wins'].fillna(0, inplace=True)
df['Wins'] = df['Wins'].astype(int)

# Drop NaN values 
df.dropna(axis = 0, inplace=True)
df['Rounds'] = df['Rounds'].astype(int)

# Change Points to int 
df['Points'] = df['Points'].apply(lambda x: x.replace(',',''))
df['Points'] = df['Points'].astype(int)

# Remove the $ and commas in money 
df['Money'] = df['Money'].apply(lambda x: x.replace('$',''))
df['Money'] = df['Money'].apply(lambda x: x.replace(',',''))
df['Money'] = df['Money'].astype(float)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1674 entries, 0 to 1677
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         1674 non-null   object 
 1   Rounds              1674 non-null   int64  
 2   Fairway Percentage  1674 non-null   float64
 3   Year                1674 non-null   int64  
 4   Avg Distance        1674 non-null   float64
 5   gir                 1674 non-null   float64
 6   Average Putts       1674 non-null   float64
 7   Average Scrambling  1674 non-null   float64
 8   Average Score       1674 non-null   float64
 9   Points              1674 non-null   int64  
 10  Wins                1674 non-null   int64  
 11  Top 10              1674 non-null   int64  
 12  Average SG Putts    1674 non-null   float64
 13  Average SG Total    1674 non-null   float64
 14  SG:OTT              1674 non-null   float64
 15  SG:APR              1674 non-null   float64
 16  SG:ARG              1674 non-null   float64
 17  Money               1674 non-null   float64
dtypes: float64(12), int64(5), object(1)
memory usage: 248.5+ KB

df.describe()

Rounds Fairway Percentage Year Avg Distance gir Average Putts Average Scrambling Average Score Points Wins Top 10 Average SG Putts Average SG Total SG:OTT SG:APR SG:ARG Money
count 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1674.000000 1.674000e+03
mean 78.769415 61.448614 2014.002987 290.786081 65.667103 29.163542 58.120687 70.922877 631.125448 0.206691 2.337515 0.025408 0.147527 0.037019 0.065192 0.020192 1.488682e+06
std 14.241512 5.057758 2.609352 8.908379 2.743211 0.518966 3.386783 0.698738 452.741472 0.516601 2.060691 0.344145 0.695400 0.379702 0.380895 0.223493 1.410333e+06
min 45.000000 43.020000 2010.000000 266.400000 53.540000 27.510000 44.010000 68.698000 3.000000 0.000000 0.000000 -1.475000 -3.209000 -1.717000 -1.680000 -0.930000 2.465000e+04
25% 69.000000 57.955000 2012.000000 284.900000 63.832500 28.802500 55.902500 70.494250 322.000000 0.000000 1.000000 -0.187750 -0.260250 -0.190250 -0.180000 -0.123000 5.656412e+05
50% 80.000000 61.435000 2014.000000 290.500000 65.790000 29.140000 58.290000 70.904500 530.000000 0.000000 2.000000 0.040000 0.147000 0.055000 0.081000 0.022500 1.046144e+06
75% 89.000000 64.910000 2016.000000 296.375000 67.587500 29.520000 60.420000 71.343750 813.750000 0.000000 3.000000 0.258500 0.568500 0.287750 0.314500 0.175750 1.892478e+06
max 120.000000 76.880000 2018.000000 319.700000 73.520000 31.000000 69.330000 74.400000 4169.000000 5.000000 14.000000 1.130000 2.406000 1.485000 1.533000 0.660000 1.203046e+07
  1. Exploratory Data Analysis
# Looking at the distribution of data
f, ax = plt.subplots(nrows = 6, ncols = 3, figsize=(20,20))
distribution = df.loc[:,df.columns!='Player Name'].columns
rows = 0
cols = 0
for i, column in enumerate(distribution):
    p = sns.distplot(df[column], ax=ax[rows][cols])
    cols += 1
    if cols == 3:
        cols = 0
        rows += 1

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

From the distributions plotted, most of the graphs are normally distributed. However, we can observe that Money, Points, Wins, and Top 10s are all skewed to the right. This could be explained by the separation of the best players and the average PGA Tour player. The best players have multiple placings in the Top 10 with wins that allows them to earn more from tournaments, while the average player will have no wins and only a few Top 10 placings that prevent them from earning as much.

# Looking at the number of players with Wins for each year 
win = df.groupby('Year')['Wins'].value_counts()
win = win.unstack()
win.fillna(0, inplace=True)

# Converting win into ints
win = win.astype(int)

print(win)

Wins    0   1  2  3  4  5
Year                     
2010  166  21  5  0  0  0
2011  156  25  5  0  0  0
2012  159  26  4  1  0  0
2013  152  24  3  0  0  1
2014  142  29  3  2  0  0
2015  150  29  2  1  1  0
2016  152  28  4  1  0  0
2017  156  30  0  3  1  0
2018  158  26  5  3  0  0

From this table, we can see that most players end the year without a win. It's pretty rare to find a player that has won more than once!

players = win.apply(lambda x: np.sum(x), axis=1)
percent_no_win = win[0]/players
percent_no_win = percent_no_win*100
print(percent_no_win)
Year
2010    86.458333
2011    83.870968
2012    83.684211
2013    84.444444
2014    80.681818
2015    81.967213
2016    82.162162
2017    82.105263
2018    82.291667
dtype: float64
# Plotting percentage of players without a win each year 
fig, ax = plt.subplots()
bar_width = 0.8
opacity = 0.7 
index = np.arange(2010, 2019)

plt.bar(index, percent_no_win, bar_width, alpha = opacity)
plt.xticks(index)
plt.xlabel('Year')
plt.ylabel('%')
plt.title('Percentage of Players without a Win')

Text(0.5, 1.0, 'Percentage of Players without a Win')

From the box plot above, we can observe that the percentages of players without a win are around 80%. There was very little variation in the percentage of players without a win in the past 8 years.

# Plotting the number of wins on a bar chart 
fig, ax = plt.subplots()
index = np.arange(2010, 2019)
bar_width = 0.2
opacity = 0.7 

def plot_bar(index, win, labels):
    plt.bar(index, win, bar_width, alpha=opacity, label=labels)

# Plotting the bars
rects = plot_bar(index, win[0], labels = '0 Wins')
rects1 = plot_bar(index + bar_width, win[1], labels = '1 Wins')
rects2 = plot_bar(index + bar_width*2, win[2], labels = '2 Wins')
rects3 = plot_bar(index + bar_width*3, win[3], labels = '3 Wins')
rects4 = plot_bar(index + bar_width*4, win[4], labels = '4 Wins')
rects5 = plot_bar(index + bar_width*5, win[5], labels = '5 Wins')

plt.xticks(index + bar_width, index)
plt.xlabel('Year')
plt.ylabel('Number of Players')
plt.title('Distribution of Wins each Year')
plt.legend()

<matplotlib.legend.Legend at 0x7f6f3b0236d0>

By looking at the distribution of Wins each year, we can see that it is rare for most players to even win a tournament in the PGA Tour. Majority of players do not win, and a very few number of players win more than once a year.

top10 = df.groupby('Year')['Top 10'].value_counts()
top10 = top10.unstack()
top10.fillna(0, inplace=True)
players = top10.apply(lambda x: np.sum(x), axis=1)

no_top10 = top10[0]/players * 100
print(no_top10)
Year
2010    17.187500
2011    25.268817
2012    23.157895
2013    18.888889
2014    16.477273
2015    18.579235
2016    20.000000
2017    15.789474
2018    17.187500
dtype: float64

By looking at the percentage of players that did not place in the top 10 by year, We can observe that only approximately 20% of players did not place in the Top 10. In addition, the range for these player that did not place in the Top 10 is only 9.47%. This tells us that this statistic does not vary much on a yearly basis.

distance = df[['Year','Player Name','Avg Distance']].copy()
distance.sort_values(by='Avg Distance', inplace=True, ascending=False)
print(distance.head())
      Year     Player Name  Avg Distance
162   2018    Rory McIlroy         319.7
1481  2011     J.B. Holmes         318.4
174   2018   Trey Mullinax         318.3
732   2015  Dustin Johnson         317.7
350   2017    Rory McIlroy         316.7

Rory McIlroy is one of the longest hitters in the game, setting the average driver distance to be 319.7 yards in 2018. He was also the longest hitter in 2017 with an average of 316.7 yards.

money_ranking = df[['Year','Player Name','Money']].copy()
money_ranking.sort_values(by='Money', inplace=True, ascending=False)
print(money_ranking.head())
     Year     Player Name       Money
647  2015   Jordan Spieth  12030465.0
361  2017   Justin Thomas   9921560.0
303  2017   Jordan Spieth   9433033.0
729  2015       Jason Day   9403330.0
520  2016  Dustin Johnson   9365185.0

We can see that Jordan Spieth has made the most amount of money in a year, earning a total of 12 million dollars in 2015.

# Who made the most money each year
money_rank = money_ranking.groupby('Year')['Money'].max()
money_rank = pd.DataFrame(money_rank)


indexs = np.arange(2010, 2019)
names = []
for i in range(money_rank.shape[0]):
    temp = df.loc[df['Money'] == money_rank.iloc[i,0],'Player Name']
    names.append(str(temp.values[0]))

money_rank['Player Name'] = names
print(money_rank)

           Money     Player Name
Year                            
2010   4910477.0     Matt Kuchar
2011   6683214.0     Luke Donald
2012   8047952.0    Rory McIlroy
2013   8553439.0     Tiger Woods
2014   8280096.0    Rory McIlroy
2015  12030465.0   Jordan Spieth
2016   9365185.0  Dustin Johnson
2017   9921560.0   Justin Thomas
2018   8694821.0   Justin Thomas

With this table, we can examine the earnings of each player by year. Some of the most notable were Jordan Speith's earning of 12 million dollars and Justin Thomas earning the most money in both 2017 and 2018.

# Plot the correlation matrix between variables 
corr = df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            cmap='coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x7f6f3d00e390>

df.corr()['Wins']
Rounds                0.103162
Fairway Percentage   -0.047949
Year                  0.039006
Avg Distance          0.206294
gir                   0.120340
Average Putts        -0.168764
Average Scrambling    0.125193
Average Score        -0.390254
Points                0.750110
Wins                  1.000000
Top 10                0.473453
Average SG Putts      0.149155
Average SG Total      0.384932
SG:OTT                0.232414
SG:APR                0.259363
SG:ARG                0.134948
Money                 0.721665
Name: Wins, dtype: float64

From the correlation matrix, we can observe that Money is highly correlated to wins along with the FedExCup Points. We can also observe that the fairway percentage, year, and rounds are not correlated to Wins.

  1. Machine Learning Model (Classification)

To predict winners, I used multiple machine learning models to explore which models could accurately classify if a player is going to win in that year.

To measure the models, I used Receiver Operating Characterisitc Area Under the Curve. (ROC AUC) The ROC AUC tells us how capable the model is at distinguishing players with a win. In addition, as the data is skewed with 83% of players having no wins in that year, ROC AUC is a much better metric than the accuracy of the model.

# Importing the Machine Learning modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC  
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

Preparing the Data for Classification

We know from the calculation above that the data for wins is skewed. Even without machine learning we know that approximately 83% of the players does not lead to a win. Therefore, we will be utilizing ROC AUC as the metric of these models

df['Winner'] = df['Wins'].apply(lambda x: 1 if x>0 else 0)

# New DataFrame 
ml_df = df.copy()

# Y value for machine learning is the Winner column
target = df['Winner']

# Removing the columns Player Name, Wins, and Winner from the dataframe to avoid leakage
ml_df.drop(['Player Name','Wins','Winner'], axis=1, inplace=True)
print(ml_df.head())
   Rounds  Fairway Percentage  Year  ...  SG:APR  SG:ARG      Money
0      60               75.19  2018  ...   0.960  -0.027  2680487.0
1     109               73.58  2018  ...   0.213   0.194  2485203.0
2      93               72.24  2018  ...   0.437  -0.137  2700018.0
3      78               71.94  2018  ...   0.532   0.273  1986608.0
4     103               71.44  2018  ...   0.099   0.026  1089763.0

[5 rows x 16 columns]
per_no_win = target.value_counts()[0] / (target.value_counts()[0] + target.value_counts()[1])
per_no_win = per_no_win.round(4)*100
print(str(per_no_win)+str('%'))
83.09%

# Function for the logisitic regression 
def log_reg(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 10)
    clf = LogisticRegression().fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy of Logistic regression classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Logistic regression classifier on test set: {:.2f}'
         .format(clf.score(X_test, y_test)))
    cf_mat = confusion_matrix(y_test, y_pred)
    confusion = pd.DataFrame(data = cf_mat)
    print(confusion)
    
    print(classification_report(y_test, y_pred))

     # Returning the 5 important features 
    #rfe = RFE(clf, 5)
    # rfe = rfe.fit(X, y)
    # print('Feature Importance')
    # print(X.columns[rfe.ranking_ == 1].values)
    
    print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))

log_reg(ml_df, target)

Accuracy of Logistic regression classifier on training set: 0.90
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  345   8
1   28  38
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       353
           1       0.83      0.58      0.68        66

    accuracy                           0.91       419
   macro avg       0.88      0.78      0.81       419
weighted avg       0.91      0.91      0.91       419

ROC AUC Score: 0.78

From the logisitic regression, we got an accuracy of 0.9 on the training set and an accuracy of 0.91 on the test set. This was surprisingly accurate for a first run. However, the ROC AUC Score of 0.78 could be improved. Therefore, I decided to add more features as a way of possibly improving the model.

# Adding Domain Features 
ml_d = ml_df.copy()
# Top 10 / Money might give us a better understanding on how well they placed in the top 10
ml_d['Top10perMoney'] = ml_d['Top 10'] / ml_d['Money']

# Avg Distance / Fairway Percentage to give us a ratio that determines how accurate and far a player hits 
ml_d['DistanceperFairway'] = ml_d['Avg Distance'] / ml_d['Fairway Percentage']

# Money / Rounds to see on average how much money they would make playing a round of golf 
ml_d['MoneyperRound'] = ml_d['Money'] / ml_d['Rounds']

log_reg(ml_d, target)

Accuracy of Logistic regression classifier on training set: 0.91
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  342  11
1   27  39
              precision    recall  f1-score   support

           0       0.93      0.97      0.95       353
           1       0.78      0.59      0.67        66

    accuracy                           0.91       419
   macro avg       0.85      0.78      0.81       419
weighted avg       0.90      0.91      0.90       419

ROC AUC Score: 0.78

# Adding Polynomial Features to the ml_df 
mldf2 = ml_df.copy()
poly = PolynomialFeatures(2)
poly = poly.fit(mldf2)
poly_feature = poly.transform(mldf2)
print(poly_feature.shape)

# Creating a DataFrame with the polynomial features 
poly_feature = pd.DataFrame(poly_feature, columns = poly.get_feature_names(ml_df.columns))
print(poly_feature.head())

(1674, 153)
     1  Rounds  Fairway Percentage  ...  SG:ARG^2  SG:ARG Money       Money^2
0  1.0    60.0               75.19  ...  0.000729    -72373.149  7.185011e+12
1  1.0   109.0               73.58  ...  0.037636    482129.382  6.176234e+12
2  1.0    93.0               72.24  ...  0.018769   -369902.466  7.290097e+12
3  1.0    78.0               71.94  ...  0.074529    542343.984  3.946611e+12
4  1.0   103.0               71.44  ...  0.000676     28333.838  1.187583e+12

[5 rows x 153 columns]

log_reg(poly_feature, target)

Accuracy of Logistic regression classifier on training set: 0.90
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  346   7
1   32  34
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       353
           1       0.83      0.52      0.64        66

    accuracy                           0.91       419
   macro avg       0.87      0.75      0.79       419
weighted avg       0.90      0.91      0.90       419

ROC AUC Score: 0.75

From feature engineering, there were no improvements in the ROC AUC Score. In fact as I added more features, the accuracy and the ROC AUC Score decreased. This could signal to us that another machine learning algorithm could better predict winners.

## Randon Forest Model

def random_forest(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 10)
    clf = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy of Random Forest classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Random Forest classifier on test set: {:.2f}'
         .format(clf.score(X_test, y_test)))
    
    cf_mat = confusion_matrix(y_test, y_pred)
    confusion = pd.DataFrame(data = cf_mat)
    print(confusion)
    
    print(classification_report(y_test, y_pred))
    
    # Returning the 5 important features 
    rfe = RFE(clf, 5)
    rfe = rfe.fit(X, y)
    print('Feature Importance')
    print(X.columns[rfe.ranking_ == 1].values)
    
    print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))

random_forest(ml_df, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  342  11
1   16  50
              precision    recall  f1-score   support

           0       0.96      0.97      0.96       353
           1       0.82      0.76      0.79        66

    accuracy                           0.94       419
   macro avg       0.89      0.86      0.87       419
weighted avg       0.93      0.94      0.93       419

Feature Importance
['Average Score' 'Points' 'Top 10' 'Average SG Total' 'Money']
ROC AUC Score: 0.86

random_forest(ml_d, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  343  10
1   16  50
              precision    recall  f1-score   support

           0       0.96      0.97      0.96       353
           1       0.83      0.76      0.79        66

    accuracy                           0.94       419
   macro avg       0.89      0.86      0.88       419
weighted avg       0.94      0.94      0.94       419

Feature Importance
['Average Score' 'Points' 'Average SG Total' 'Money' 'MoneyperRound']
ROC AUC Score: 0.86

random_forest(poly_feature, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  340  13
1   14  52
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       353
           1       0.80      0.79      0.79        66

    accuracy                           0.94       419
   macro avg       0.88      0.88      0.88       419
weighted avg       0.94      0.94      0.94       419

Feature Importance
['Year Points' 'Average Putts Points' 'Average Scrambling Top 10'
 'Average Score Points' 'Points^2']
ROC AUC Score: 0.88

The Random Forest Model scored highly on the ROC AUC Score, obtaining a value of 0.89. With this, we observed that the Random Forest Model could accurately classify players with and without a win.

  1. Conclusion

It's been interesting to learn about numerous aspects of the game that differentiate the winner and the average PGA Tour player. For example, we can see that the fairway percentage and greens in regulations do not seem to contribute as much to a player's win. However, all the strokes gained statistics contribute pretty highly to wins for these players. It was interesting to see which aspects of the game that the professionals should put their time into. This also gave me the idea of track my personal golf statistics, so that I could compare it to the pros and find areas of my game that need the most improvement.

Machine Learning Model I've been able to examine the data of PGA Tour players and classify if a player will win that year or not. With the random forest classification model, I was able to achieve an ROC AUC of 0.89 and an accuracy of 0.95 on the test set. This was a significant improvement from the ROC AUC of 0.78 and accuracy of 0.91. Because the data is skewed with approximately 80% of players not earning a win, the primary measure of the model was the ROC AUC. I was able to improve my model from ROC AUC score of 0.78 to a score of 0.89 by simply trying 3 different models, adding domain features, and polynomial features.

The End!!