Can We Predict If a PGA Tour Player Won a Tournament in a Given Year?

Golf is picking up popularity, so I thought it would be interesting to focus my project here. I set out to find what sets apart the best golfers from the rest. I decided to explore their statistics and to see if I could predict which golfers would win in a given year. My original dataset was found on Kaggle, and the data was scraped from the PGA Tour website.

From this data, I performed an exploratory data analysis to explore the distribution of players on numerous aspects of the game, discover outliers, and further explore how the game has changed from 2010 to 2018. I also utilized numerous supervised machine learning models to predict a golfer's earnings and wins.

To predict the golfer's win, I used classification methods such as logisitic regression and Random Forest Classification. The best performance came from the Random Forest Classification method.

The Data

pgaTourData.csv contains 1674 rows and 18 columns. Each row indicates a golfer's performance for that year.

# Player Name: Name of the golfer

# Rounds: The number of games that a player played

# Fairway Percentage: The percentage of time a tee shot lands on the fairway

# Year: The year in which the statistic was collected

# Avg Distance: The average distance of the tee-shot

# gir: (Green in Regulation) is met if any part of the ball is touching the putting surface while the number of strokes taken is at least two fewer than par

# Average Putts: The average number of strokes taken on the green

# Average Scrambling: Scrambling is when a player misses the green in regulation, but still makes par or better on a hole

# Average Score: Average Score is the average of all the scores a player has played in that year

# Points: The number of FedExCup points a player earned in that year

# Wins: The number of competition a player has won in that year

# Top 10: The number of competitions where a player has placed in the Top 10

# Average SG Putts: Strokes gained: putting measures how many strokes a player gains (or loses) on the greens

# Average SG Total: The Off-the-tee + approach-the-green + around-the-green + putting statistics combined

# SG:OTT: Strokes gained: off-the-tee measures player performance off the tee on all par-4s and par-5s

# SG:APR: Strokes gained: approach-the-green measures player performance on approach shots

# SG:ARG: Strokes gained: around-the-green measures player performance on any shot within 30 yards of the edge of the green

# Money: The amount of prize money a player has earned from tournaments

# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('pgaTourData.csv')

# Examining the first 5 data
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312 entries, 0 to 2311
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         2312 non-null   object 
 1   Rounds              1678 non-null   float64
 2   Fairway Percentage  1678 non-null   float64
 3   Year                2312 non-null   int64  
 4   Avg Distance        1678 non-null   float64
 5   gir                 1678 non-null   float64
 6   Average Putts       1678 non-null   float64
 7   Average Scrambling  1678 non-null   float64
 8   Average Score       1678 non-null   float64
 9   Points              2296 non-null   object 
 10  Wins                293 non-null    float64
 11  Top 10              1458 non-null   float64
 12  Average SG Putts    1678 non-null   float64
 13  Average SG Total    1678 non-null   float64
 14  SG:OTT              1678 non-null   float64
 15  SG:APR              1678 non-null   float64
 16  SG:ARG              1678 non-null   float64
 17  Money               2300 non-null   object 
dtypes: float64(14), int64(1), object(3)
memory usage: 325.2+ KB

df.shape

(2312, 18)

Data Cleaning

After looking at the dataframe, the data needs to be cleaned:

-For the columns Top 10 and Wins, convert the NaNs to 0s

-Change Top 10 and Wins into an int

-Drop NaN values for players who do not have the full statistics

-Change the columns Rounds into int

-Change points to int

-Remove the dollar sign ($) and commas in the column Money

df['Top 10'].fillna(0, inplace=True)
df['Top 10'] = df['Top 10'].astype(int)

# Replace NaN with 0 in # of wins
df['Wins'].fillna(0, inplace=True)
df['Wins'] = df['Wins'].astype(int)

# Drop NaN values 
df.dropna(axis = 0, inplace=True)

df['Rounds'] = df['Rounds'].astype(int)

# Change Points to int 
df['Points'] = df['Points'].apply(lambda x: x.replace(',',''))
df['Points'] = df['Points'].astype(int)

# Remove the $ and commas in money 
df['Money'] = df['Money'].apply(lambda x: x.replace('$',''))
df['Money'] = df['Money'].apply(lambda x: x.replace(',',''))
df['Money'] = df['Money'].astype(float)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1674 entries, 0 to 1677
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         1674 non-null   object 
 1   Rounds              1674 non-null   int64  
 2   Fairway Percentage  1674 non-null   float64
 3   Year                1674 non-null   int64  
 4   Avg Distance        1674 non-null   float64
 5   gir                 1674 non-null   float64
 6   Average Putts       1674 non-null   float64
 7   Average Scrambling  1674 non-null   float64
 8   Average Score       1674 non-null   float64
 9   Points              1674 non-null   int64  
 10  Wins                1674 non-null   int64  
 11  Top 10              1674 non-null   int64  
 12  Average SG Putts    1674 non-null   float64
 13  Average SG Total    1674 non-null   float64
 14  SG:OTT              1674 non-null   float64
 15  SG:APR              1674 non-null   float64
 16  SG:ARG              1674 non-null   float64
 17  Money               1674 non-null   float64
dtypes: float64(12), int64(5), object(1)
memory usage: 248.5+ KB

df.describe()

Exploratory Data Analysis

# Looking at the distribution of data
f, ax = plt.subplots(nrows = 6, ncols = 3, figsize=(20,20))
distribution = df.loc[:,df.columns!='Player Name'].columns
rows = 0
cols = 0
for i, column in enumerate(distribution):
    p = sns.distplot(df[column], ax=ax[rows][cols])
    cols += 1
    if cols == 3:
        cols = 0
        rows += 1

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

From the distributions plotted, most of the graphs are normally distributed. However, we can observe that Money, Points, Wins, and Top 10s are all skewed to the right. This could be explained by the separation of the best players and the average PGA Tour player. The best players have multiple placings in the Top 10 with wins that allows them to earn more from tournaments, while the average player will have no wins and only a few Top 10 placings that prevent them from earning as much.

# Looking at the number of players with Wins for each year 
win = df.groupby('Year')['Wins'].value_counts()
win = win.unstack()
win.fillna(0, inplace=True)

# Converting win into ints
win = win.astype(int)

print(win)

Wins    0   1  2  3  4  5
Year                     
2010  166  21  5  0  0  0
2011  156  25  5  0  0  0
2012  159  26  4  1  0  0
2013  152  24  3  0  0  1
2014  142  29  3  2  0  0
2015  150  29  2  1  1  0
2016  152  28  4  1  0  0
2017  156  30  0  3  1  0
2018  158  26  5  3  0  0

From this table, we can see that most players end the year without a win. It's pretty rare to find a player that has won more than once!

players = win.apply(lambda x: np.sum(x), axis=1)
percent_no_win = win[0]/players
percent_no_win = percent_no_win*100
print(percent_no_win)

Year
2010    86.458333
2011    83.870968
2012    83.684211
2013    84.444444
2014    80.681818
2015    81.967213
2016    82.162162
2017    82.105263
2018    82.291667
dtype: float64

# Plotting percentage of players without a win each year 
fig, ax = plt.subplots()
bar_width = 0.8
opacity = 0.7 
index = np.arange(2010, 2019)

plt.bar(index, percent_no_win, bar_width, alpha = opacity)
plt.xticks(index)
plt.xlabel('Year')
plt.ylabel('%')
plt.title('Percentage of Players without a Win')

Text(0.5, 1.0, 'Percentage of Players without a Win')

From the box plot above, we can observe that the percentages of players without a win are around 80%. There was very little variation in the percentage of players without a win in the past 8 years.

# Plotting the number of wins on a bar chart 
fig, ax = plt.subplots()
index = np.arange(2010, 2019)
bar_width = 0.2
opacity = 0.7 

def plot_bar(index, win, labels):
    plt.bar(index, win, bar_width, alpha=opacity, label=labels)

# Plotting the bars
rects = plot_bar(index, win[0], labels = '0 Wins')
rects1 = plot_bar(index + bar_width, win[1], labels = '1 Wins')
rects2 = plot_bar(index + bar_width*2, win[2], labels = '2 Wins')
rects3 = plot_bar(index + bar_width*3, win[3], labels = '3 Wins')
rects4 = plot_bar(index + bar_width*4, win[4], labels = '4 Wins')
rects5 = plot_bar(index + bar_width*5, win[5], labels = '5 Wins')

plt.xticks(index + bar_width, index)
plt.xlabel('Year')
plt.ylabel('Number of Players')
plt.title('Distribution of Wins each Year')
plt.legend()

<matplotlib.legend.Legend at 0x7f6f3b0236d0>

By looking at the distribution of Wins each year, we can see that it is rare for most players to even win a tournament in the PGA Tour. Majority of players do not win, and a very few number of players win more than once a year.

top10 = df.groupby('Year')['Top 10'].value_counts()
top10 = top10.unstack()
top10.fillna(0, inplace=True)
players = top10.apply(lambda x: np.sum(x), axis=1)

no_top10 = top10[0]/players * 100
print(no_top10)

Year
2010    17.187500
2011    25.268817
2012    23.157895
2013    18.888889
2014    16.477273
2015    18.579235
2016    20.000000
2017    15.789474
2018    17.187500
dtype: float64

By looking at the percentage of players that did not place in the top 10 by year, We can observe that only approximately 20% of players did not place in the Top 10. In addition, the range for these player that did not place in the Top 10 is only 9.47%. This tells us that this statistic does not vary much on a yearly basis.

distance = df[['Year','Player Name','Avg Distance']].copy()
distance.sort_values(by='Avg Distance', inplace=True, ascending=False)
print(distance.head())

      Year     Player Name  Avg Distance
162   2018    Rory McIlroy         319.7
1481  2011     J.B. Holmes         318.4
174   2018   Trey Mullinax         318.3
732   2015  Dustin Johnson         317.7
350   2017    Rory McIlroy         316.7

Rory McIlroy is one of the longest hitters in the game, setting the average driver distance to be 319.7 yards in 2018. He was also the longest hitter in 2017 with an average of 316.7 yards.

money_ranking = df[['Year','Player Name','Money']].copy()
money_ranking.sort_values(by='Money', inplace=True, ascending=False)
print(money_ranking.head())

     Year     Player Name       Money
647  2015   Jordan Spieth  12030465.0
361  2017   Justin Thomas   9921560.0
303  2017   Jordan Spieth   9433033.0
729  2015       Jason Day   9403330.0
520  2016  Dustin Johnson   9365185.0

We can see that Jordan Spieth has made the most amount of money in a year, earning a total of 12 million dollars in 2015.

# Who made the most money each year
money_rank = money_ranking.groupby('Year')['Money'].max()
money_rank = pd.DataFrame(money_rank)


indexs = np.arange(2010, 2019)
names = []
for i in range(money_rank.shape[0]):
    temp = df.loc[df['Money'] == money_rank.iloc[i,0],'Player Name']
    names.append(str(temp.values[0]))

money_rank['Player Name'] = names
print(money_rank)

           Money     Player Name
Year                            
2010   4910477.0     Matt Kuchar
2011   6683214.0     Luke Donald
2012   8047952.0    Rory McIlroy
2013   8553439.0     Tiger Woods
2014   8280096.0    Rory McIlroy
2015  12030465.0   Jordan Spieth
2016   9365185.0  Dustin Johnson
2017   9921560.0   Justin Thomas
2018   8694821.0   Justin Thomas

With this table, we can examine the earnings of each player by year. Some of the most notable were Jordan Speith's earning of 12 million dollars and Justin Thomas earning the most money in both 2017 and 2018.

# Plot the correlation matrix between variables 
corr = df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            cmap='coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x7f6f3d00e390>

df.corr()['Wins']

Rounds                0.103162
Fairway Percentage   -0.047949
Year                  0.039006
Avg Distance          0.206294
gir                   0.120340
Average Putts        -0.168764
Average Scrambling    0.125193
Average Score        -0.390254
Points                0.750110
Wins                  1.000000
Top 10                0.473453
Average SG Putts      0.149155
Average SG Total      0.384932
SG:OTT                0.232414
SG:APR                0.259363
SG:ARG                0.134948
Money                 0.721665
Name: Wins, dtype: float64

From the correlation matrix, we can observe that Money is highly correlated to wins along with the FedExCup Points. We can also observe that the fairway percentage, year, and rounds are not correlated to Wins.

Machine Learning Model (Classification)

To predict winners, I used multiple machine learning models to explore which models could accurately classify if a player is going to win in that year.

To measure the models, I used Receiver Operating Characterisitc Area Under the Curve. (ROC AUC) The ROC AUC tells us how capable the model is at distinguishing players with a win. In addition, as the data is skewed with 83% of players having no wins in that year, ROC AUC is a much better metric than the accuracy of the model.

# Importing the Machine Learning modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC  
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

Preparing the Data for Classification

We know from the calculation above that the data for wins is skewed. Even without machine learning we know that approximately 83% of the players does not lead to a win. Therefore, we will be utilizing ROC AUC as the metric of these models

df['Winner'] = df['Wins'].apply(lambda x: 1 if x>0 else 0)

# New DataFrame 
ml_df = df.copy()

# Y value for machine learning is the Winner column
target = df['Winner']

# Removing the columns Player Name, Wins, and Winner from the dataframe to avoid leakage
ml_df.drop(['Player Name','Wins','Winner'], axis=1, inplace=True)
print(ml_df.head())

   Rounds  Fairway Percentage  Year  ...  SG:APR  SG:ARG      Money
0      60               75.19  2018  ...   0.960  -0.027  2680487.0
1     109               73.58  2018  ...   0.213   0.194  2485203.0
2      93               72.24  2018  ...   0.437  -0.137  2700018.0
3      78               71.94  2018  ...   0.532   0.273  1986608.0
4     103               71.44  2018  ...   0.099   0.026  1089763.0

[5 rows x 16 columns]

per_no_win = target.value_counts()[0] / (target.value_counts()[0] + target.value_counts()[1])
per_no_win = per_no_win.round(4)*100
print(str(per_no_win)+str('%'))

83.09%

# Function for the logisitic regression 
def log_reg(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 10)
    clf = LogisticRegression().fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy of Logistic regression classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Logistic regression classifier on test set: {:.2f}'
         .format(clf.score(X_test, y_test)))
    cf_mat = confusion_matrix(y_test, y_pred)
    confusion = pd.DataFrame(data = cf_mat)
    print(confusion)
    
    print(classification_report(y_test, y_pred))

     # Returning the 5 important features 
    #rfe = RFE(clf, 5)
    # rfe = rfe.fit(X, y)
    # print('Feature Importance')
    # print(X.columns[rfe.ranking_ == 1].values)
    
    print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))

log_reg(ml_df, target)

Accuracy of Logistic regression classifier on training set: 0.90
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  345   8
1   28  38
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       353
           1       0.83      0.58      0.68        66

    accuracy                           0.91       419
   macro avg       0.88      0.78      0.81       419
weighted avg       0.91      0.91      0.91       419

ROC AUC Score: 0.78

From the logisitic regression, we got an accuracy of 0.9 on the training set and an accuracy of 0.91 on the test set. This was surprisingly accurate for a first run. However, the ROC AUC Score of 0.78 could be improved. Therefore, I decided to add more features as a way of possibly improving the model.

# Adding Domain Features 
ml_d = ml_df.copy()
# Top 10 / Money might give us a better understanding on how well they placed in the top 10
ml_d['Top10perMoney'] = ml_d['Top 10'] / ml_d['Money']

# Avg Distance / Fairway Percentage to give us a ratio that determines how accurate and far a player hits 
ml_d['DistanceperFairway'] = ml_d['Avg Distance'] / ml_d['Fairway Percentage']

# Money / Rounds to see on average how much money they would make playing a round of golf 
ml_d['MoneyperRound'] = ml_d['Money'] / ml_d['Rounds']

log_reg(ml_d, target)

Accuracy of Logistic regression classifier on training set: 0.91
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  342  11
1   27  39
              precision    recall  f1-score   support

           0       0.93      0.97      0.95       353
           1       0.78      0.59      0.67        66

    accuracy                           0.91       419
   macro avg       0.85      0.78      0.81       419
weighted avg       0.90      0.91      0.90       419

ROC AUC Score: 0.78

# Adding Polynomial Features to the ml_df 
mldf2 = ml_df.copy()
poly = PolynomialFeatures(2)
poly = poly.fit(mldf2)
poly_feature = poly.transform(mldf2)
print(poly_feature.shape)

# Creating a DataFrame with the polynomial features 
poly_feature = pd.DataFrame(poly_feature, columns = poly.get_feature_names(ml_df.columns))
print(poly_feature.head())

(1674, 153)
     1  Rounds  Fairway Percentage  ...  SG:ARG^2  SG:ARG Money       Money^2
0  1.0    60.0               75.19  ...  0.000729    -72373.149  7.185011e+12
1  1.0   109.0               73.58  ...  0.037636    482129.382  6.176234e+12
2  1.0    93.0               72.24  ...  0.018769   -369902.466  7.290097e+12
3  1.0    78.0               71.94  ...  0.074529    542343.984  3.946611e+12
4  1.0   103.0               71.44  ...  0.000676     28333.838  1.187583e+12

[5 rows x 153 columns]

log_reg(poly_feature, target)

Accuracy of Logistic regression classifier on training set: 0.90
Accuracy of Logistic regression classifier on test set: 0.91
     0   1
0  346   7
1   32  34
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       353
           1       0.83      0.52      0.64        66

    accuracy                           0.91       419
   macro avg       0.87      0.75      0.79       419
weighted avg       0.90      0.91      0.90       419

ROC AUC Score: 0.75

From feature engineering, there were no improvements in the ROC AUC Score. In fact as I added more features, the accuracy and the ROC AUC Score decreased. This could signal to us that another machine learning algorithm could better predict winners.

## Randon Forest Model

def random_forest(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 10)
    clf = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy of Random Forest classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Random Forest classifier on test set: {:.2f}'
         .format(clf.score(X_test, y_test)))
    
    cf_mat = confusion_matrix(y_test, y_pred)
    confusion = pd.DataFrame(data = cf_mat)
    print(confusion)
    
    print(classification_report(y_test, y_pred))
    
    # Returning the 5 important features 
    rfe = RFE(clf, 5)
    rfe = rfe.fit(X, y)
    print('Feature Importance')
    print(X.columns[rfe.ranking_ == 1].values)
    
    print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))

random_forest(ml_df, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  342  11
1   16  50
              precision    recall  f1-score   support

           0       0.96      0.97      0.96       353
           1       0.82      0.76      0.79        66

    accuracy                           0.94       419
   macro avg       0.89      0.86      0.87       419
weighted avg       0.93      0.94      0.93       419

Feature Importance
['Average Score' 'Points' 'Top 10' 'Average SG Total' 'Money']
ROC AUC Score: 0.86

random_forest(ml_d, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  343  10
1   16  50
              precision    recall  f1-score   support

           0       0.96      0.97      0.96       353
           1       0.83      0.76      0.79        66

    accuracy                           0.94       419
   macro avg       0.89      0.86      0.88       419
weighted avg       0.94      0.94      0.94       419

Feature Importance
['Average Score' 'Points' 'Average SG Total' 'Money' 'MoneyperRound']
ROC AUC Score: 0.86

random_forest(poly_feature, target)

Accuracy of Random Forest classifier on training set: 1.00
Accuracy of Random Forest classifier on test set: 0.94
     0   1
0  340  13
1   14  52
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       353
           1       0.80      0.79      0.79        66

    accuracy                           0.94       419
   macro avg       0.88      0.88      0.88       419
weighted avg       0.94      0.94      0.94       419

Feature Importance
['Year Points' 'Average Putts Points' 'Average Scrambling Top 10'
 'Average Score Points' 'Points^2']
ROC AUC Score: 0.88

The Random Forest Model scored highly on the ROC AUC Score, obtaining a value of 0.89. With this, we observed that the Random Forest Model could accurately classify players with and without a win.

Conclusion

It's been interesting to learn about numerous aspects of the game that differentiate the winner and the average PGA Tour player. For example, we can see that the fairway percentage and greens in regulations do not seem to contribute as much to a player's win. However, all the strokes gained statistics contribute pretty highly to wins for these players. It was interesting to see which aspects of the game that the professionals should put their time into. This also gave me the idea of track my personal golf statistics, so that I could compare it to the pros and find areas of my game that need the most improvement.

Machine Learning Model I've been able to examine the data of PGA Tour players and classify if a player will win that year or not. With the random forest classification model, I was able to achieve an ROC AUC of 0.89 and an accuracy of 0.95 on the test set. This was a significant improvement from the ROC AUC of 0.78 and accuracy of 0.91. Because the data is skewed with approximately 80% of players not earning a win, the primary measure of the model was the ROC AUC. I was able to improve my model from ROC AUC score of 0.78 to a score of 0.89 by simply trying 3 different models, adding domain features, and polynomial features.

The End!!

	Rounds	Fairway Percentage	Year	Avg Distance	gir	Average Putts	Average Scrambling	Average Score	Points	Wins	Top 10	Average SG Putts	Average SG Total	SG:OTT	SG:APR	SG:ARG	Money
count	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1674.000000	1.674000e+03
mean	78.769415	61.448614	2014.002987	290.786081	65.667103	29.163542	58.120687	70.922877	631.125448	0.206691	2.337515	0.025408	0.147527	0.037019	0.065192	0.020192	1.488682e+06
std	14.241512	5.057758	2.609352	8.908379	2.743211	0.518966	3.386783	0.698738	452.741472	0.516601	2.060691	0.344145	0.695400	0.379702	0.380895	0.223493	1.410333e+06
min	45.000000	43.020000	2010.000000	266.400000	53.540000	27.510000	44.010000	68.698000	3.000000	0.000000	0.000000	-1.475000	-3.209000	-1.717000	-1.680000	-0.930000	2.465000e+04
25%	69.000000	57.955000	2012.000000	284.900000	63.832500	28.802500	55.902500	70.494250	322.000000	0.000000	1.000000	-0.187750	-0.260250	-0.190250	-0.180000	-0.123000	5.656412e+05
50%	80.000000	61.435000	2014.000000	290.500000	65.790000	29.140000	58.290000	70.904500	530.000000	0.000000	2.000000	0.040000	0.147000	0.055000	0.081000	0.022500	1.046144e+06
75%	89.000000	64.910000	2016.000000	296.375000	67.587500	29.520000	60.420000	71.343750	813.750000	0.000000	3.000000	0.258500	0.568500	0.287750	0.314500	0.175750	1.892478e+06
max	120.000000	76.880000	2018.000000	319.700000	73.520000	31.000000	69.330000	74.400000	4169.000000	5.000000	14.000000	1.130000	2.406000	1.485000	1.533000	0.660000	1.203046e+07

	Player Name	Rounds	Fairway Percentage	Year	Avg Distance	gir	Average Putts	Average Scrambling	Average Score	Points	Wins	Top 10	Average SG Putts	Average SG Total	SG:OTT	SG:APR	SG:ARG	Money
0	Henrik Stenson	60.0	75.19	2018	291.5	73.51	29.93	60.67	69.617	868	NaN	5.0	-0.207	1.153	0.427	0.960	-0.027	$2,680,487
1	Ryan Armour	109.0	73.58	2018	283.5	68.22	29.31	60.13	70.758	1,006	1.0	3.0	-0.058	0.337	-0.012	0.213	0.194	$2,485,203
2	Chez Reavie	93.0	72.24	2018	286.5	68.67	29.12	62.27	70.432	1,020	NaN	3.0	0.192	0.674	0.183	0.437	-0.137	$2,700,018
3	Ryan Moore	78.0	71.94	2018	289.2	68.80	29.17	64.16	70.015	795	NaN	5.0	-0.271	0.941	0.406	0.532	0.273	$1,986,608
4	Brian Stuard	103.0	71.44	2018	278.9	67.12	29.11	59.23	71.038	421	NaN	3.0	0.164	0.062	-0.227	0.099	0.026	$1,089,763