Title
Can We Predict If a PGA Tour Player Won a Tournament in a Given Year?
Golf is picking up popularity, so I thought it would be interesting to focus my project here. I set out to find what sets apart the best golfers from the rest. I decided to explore their statistics and to see if I could predict which golfers would win in a given year. My original dataset was found on Kaggle, and the data was scraped from the PGA Tour website.
From this data, I performed an exploratory data analysis to explore the distribution of players on numerous aspects of the game, discover outliers, and further explore how the game has changed from 2010 to 2018. I also utilized numerous supervised machine learning models to predict a golfer's earnings and wins.
To predict the golfer's win, I used classification methods such as logisitic regression and Random Forest Classification. The best performance came from the Random Forest Classification method.
- The Data
pgaTourData.csv contains 1674 rows and 18 columns. Each row indicates a golfer's performance for that year.
# Player Name: Name of the golfer
# Rounds: The number of games that a player played
# Fairway Percentage: The percentage of time a tee shot lands on the fairway
# Year: The year in which the statistic was collected
# Avg Distance: The average distance of the tee-shot
# gir: (Green in Regulation) is met if any part of the ball is touching the putting surface while the number of strokes taken is at least two fewer than par
# Average Putts: The average number of strokes taken on the green
# Average Scrambling: Scrambling is when a player misses the green in regulation, but still makes par or better on a hole
# Average Score: Average Score is the average of all the scores a player has played in that year
# Points: The number of FedExCup points a player earned in that year
# Wins: The number of competition a player has won in that year
# Top 10: The number of competitions where a player has placed in the Top 10
# Average SG Putts: Strokes gained: putting measures how many strokes a player gains (or loses) on the greens
# Average SG Total: The Off-the-tee + approach-the-green + around-the-green + putting statistics combined
# SG:OTT: Strokes gained: off-the-tee measures player performance off the tee on all par-4s and par-5s
# SG:APR: Strokes gained: approach-the-green measures player performance on approach shots
# SG:ARG: Strokes gained: around-the-green measures player performance on any shot within 30 yards of the edge of the green
# Money: The amount of prize money a player has earned from tournaments
# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('pgaTourData.csv')
# Examining the first 5 data
df.head()
df.info()
df.shape
- Data Cleaning
After looking at the dataframe, the data needs to be cleaned:
-For the columns Top 10 and Wins, convert the NaNs to 0s
-Change Top 10 and Wins into an int
-Drop NaN values for players who do not have the full statistics
-Change the columns Rounds into int
-Change points to int
-Remove the dollar sign ($) and commas in the column Money
df['Top 10'].fillna(0, inplace=True)
df['Top 10'] = df['Top 10'].astype(int)
# Replace NaN with 0 in # of wins
df['Wins'].fillna(0, inplace=True)
df['Wins'] = df['Wins'].astype(int)
# Drop NaN values
df.dropna(axis = 0, inplace=True)
df['Rounds'] = df['Rounds'].astype(int)
# Change Points to int
df['Points'] = df['Points'].apply(lambda x: x.replace(',',''))
df['Points'] = df['Points'].astype(int)
# Remove the $ and commas in money
df['Money'] = df['Money'].apply(lambda x: x.replace('$',''))
df['Money'] = df['Money'].apply(lambda x: x.replace(',',''))
df['Money'] = df['Money'].astype(float)
df.info()
df.describe()
- Exploratory Data Analysis
# Looking at the distribution of data
f, ax = plt.subplots(nrows = 6, ncols = 3, figsize=(20,20))
distribution = df.loc[:,df.columns!='Player Name'].columns
rows = 0
cols = 0
for i, column in enumerate(distribution):
p = sns.distplot(df[column], ax=ax[rows][cols])
cols += 1
if cols == 3:
cols = 0
rows += 1
From the distributions plotted, most of the graphs are normally distributed. However, we can observe that Money, Points, Wins, and Top 10s are all skewed to the right. This could be explained by the separation of the best players and the average PGA Tour player. The best players have multiple placings in the Top 10 with wins that allows them to earn more from tournaments, while the average player will have no wins and only a few Top 10 placings that prevent them from earning as much.
# Looking at the number of players with Wins for each year
win = df.groupby('Year')['Wins'].value_counts()
win = win.unstack()
win.fillna(0, inplace=True)
# Converting win into ints
win = win.astype(int)
print(win)
From this table, we can see that most players end the year without a win. It's pretty rare to find a player that has won more than once!
players = win.apply(lambda x: np.sum(x), axis=1)
percent_no_win = win[0]/players
percent_no_win = percent_no_win*100
print(percent_no_win)
# Plotting percentage of players without a win each year
fig, ax = plt.subplots()
bar_width = 0.8
opacity = 0.7
index = np.arange(2010, 2019)
plt.bar(index, percent_no_win, bar_width, alpha = opacity)
plt.xticks(index)
plt.xlabel('Year')
plt.ylabel('%')
plt.title('Percentage of Players without a Win')
From the box plot above, we can observe that the percentages of players without a win are around 80%. There was very little variation in the percentage of players without a win in the past 8 years.
# Plotting the number of wins on a bar chart
fig, ax = plt.subplots()
index = np.arange(2010, 2019)
bar_width = 0.2
opacity = 0.7
def plot_bar(index, win, labels):
plt.bar(index, win, bar_width, alpha=opacity, label=labels)
# Plotting the bars
rects = plot_bar(index, win[0], labels = '0 Wins')
rects1 = plot_bar(index + bar_width, win[1], labels = '1 Wins')
rects2 = plot_bar(index + bar_width*2, win[2], labels = '2 Wins')
rects3 = plot_bar(index + bar_width*3, win[3], labels = '3 Wins')
rects4 = plot_bar(index + bar_width*4, win[4], labels = '4 Wins')
rects5 = plot_bar(index + bar_width*5, win[5], labels = '5 Wins')
plt.xticks(index + bar_width, index)
plt.xlabel('Year')
plt.ylabel('Number of Players')
plt.title('Distribution of Wins each Year')
plt.legend()
By looking at the distribution of Wins each year, we can see that it is rare for most players to even win a tournament in the PGA Tour. Majority of players do not win, and a very few number of players win more than once a year.
top10 = df.groupby('Year')['Top 10'].value_counts()
top10 = top10.unstack()
top10.fillna(0, inplace=True)
players = top10.apply(lambda x: np.sum(x), axis=1)
no_top10 = top10[0]/players * 100
print(no_top10)
By looking at the percentage of players that did not place in the top 10 by year, We can observe that only approximately 20% of players did not place in the Top 10. In addition, the range for these player that did not place in the Top 10 is only 9.47%. This tells us that this statistic does not vary much on a yearly basis.
distance = df[['Year','Player Name','Avg Distance']].copy()
distance.sort_values(by='Avg Distance', inplace=True, ascending=False)
print(distance.head())
Rory McIlroy is one of the longest hitters in the game, setting the average driver distance to be 319.7 yards in 2018. He was also the longest hitter in 2017 with an average of 316.7 yards.
money_ranking = df[['Year','Player Name','Money']].copy()
money_ranking.sort_values(by='Money', inplace=True, ascending=False)
print(money_ranking.head())
We can see that Jordan Spieth has made the most amount of money in a year, earning a total of 12 million dollars in 2015.
# Who made the most money each year
money_rank = money_ranking.groupby('Year')['Money'].max()
money_rank = pd.DataFrame(money_rank)
indexs = np.arange(2010, 2019)
names = []
for i in range(money_rank.shape[0]):
temp = df.loc[df['Money'] == money_rank.iloc[i,0],'Player Name']
names.append(str(temp.values[0]))
money_rank['Player Name'] = names
print(money_rank)
With this table, we can examine the earnings of each player by year. Some of the most notable were Jordan Speith's earning of 12 million dollars and Justin Thomas earning the most money in both 2017 and 2018.
# Plot the correlation matrix between variables
corr = df.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,
cmap='coolwarm')
df.corr()['Wins']
From the correlation matrix, we can observe that Money is highly correlated to wins along with the FedExCup Points. We can also observe that the fairway percentage, year, and rounds are not correlated to Wins.
- Machine Learning Model (Classification)
To predict winners, I used multiple machine learning models to explore which models could accurately classify if a player is going to win in that year.
To measure the models, I used Receiver Operating Characterisitc Area Under the Curve. (ROC AUC) The ROC AUC tells us how capable the model is at distinguishing players with a win. In addition, as the data is skewed with 83% of players having no wins in that year, ROC AUC is a much better metric than the accuracy of the model.
# Importing the Machine Learning modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
from sklearn.metrics import classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
Preparing the Data for Classification
We know from the calculation above that the data for wins is skewed. Even without machine learning we know that approximately 83% of the players does not lead to a win. Therefore, we will be utilizing ROC AUC as the metric of these models
df['Winner'] = df['Wins'].apply(lambda x: 1 if x>0 else 0)
# New DataFrame
ml_df = df.copy()
# Y value for machine learning is the Winner column
target = df['Winner']
# Removing the columns Player Name, Wins, and Winner from the dataframe to avoid leakage
ml_df.drop(['Player Name','Wins','Winner'], axis=1, inplace=True)
print(ml_df.head())
per_no_win = target.value_counts()[0] / (target.value_counts()[0] + target.value_counts()[1])
per_no_win = per_no_win.round(4)*100
print(str(per_no_win)+str('%'))
# Function for the logisitic regression
def log_reg(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state = 10)
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
cf_mat = confusion_matrix(y_test, y_pred)
confusion = pd.DataFrame(data = cf_mat)
print(confusion)
print(classification_report(y_test, y_pred))
# Returning the 5 important features
#rfe = RFE(clf, 5)
# rfe = rfe.fit(X, y)
# print('Feature Importance')
# print(X.columns[rfe.ranking_ == 1].values)
print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))
log_reg(ml_df, target)
From the logisitic regression, we got an accuracy of 0.9 on the training set and an accuracy of 0.91 on the test set. This was surprisingly accurate for a first run. However, the ROC AUC Score of 0.78 could be improved. Therefore, I decided to add more features as a way of possibly improving the model.
# Adding Domain Features
ml_d = ml_df.copy()
# Top 10 / Money might give us a better understanding on how well they placed in the top 10
ml_d['Top10perMoney'] = ml_d['Top 10'] / ml_d['Money']
# Avg Distance / Fairway Percentage to give us a ratio that determines how accurate and far a player hits
ml_d['DistanceperFairway'] = ml_d['Avg Distance'] / ml_d['Fairway Percentage']
# Money / Rounds to see on average how much money they would make playing a round of golf
ml_d['MoneyperRound'] = ml_d['Money'] / ml_d['Rounds']
log_reg(ml_d, target)
# Adding Polynomial Features to the ml_df
mldf2 = ml_df.copy()
poly = PolynomialFeatures(2)
poly = poly.fit(mldf2)
poly_feature = poly.transform(mldf2)
print(poly_feature.shape)
# Creating a DataFrame with the polynomial features
poly_feature = pd.DataFrame(poly_feature, columns = poly.get_feature_names(ml_df.columns))
print(poly_feature.head())
log_reg(poly_feature, target)
From feature engineering, there were no improvements in the ROC AUC Score. In fact as I added more features, the accuracy and the ROC AUC Score decreased. This could signal to us that another machine learning algorithm could better predict winners.
## Randon Forest Model
def random_forest(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state = 10)
clf = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Random Forest classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
cf_mat = confusion_matrix(y_test, y_pred)
confusion = pd.DataFrame(data = cf_mat)
print(confusion)
print(classification_report(y_test, y_pred))
# Returning the 5 important features
rfe = RFE(clf, 5)
rfe = rfe.fit(X, y)
print('Feature Importance')
print(X.columns[rfe.ranking_ == 1].values)
print('ROC AUC Score: {:.2f}'.format(roc_auc_score(y_test, y_pred)))
random_forest(ml_df, target)
random_forest(ml_d, target)
random_forest(poly_feature, target)
The Random Forest Model scored highly on the ROC AUC Score, obtaining a value of 0.89. With this, we observed that the Random Forest Model could accurately classify players with and without a win.
- Conclusion
It's been interesting to learn about numerous aspects of the game that differentiate the winner and the average PGA Tour player. For example, we can see that the fairway percentage and greens in regulations do not seem to contribute as much to a player's win. However, all the strokes gained statistics contribute pretty highly to wins for these players. It was interesting to see which aspects of the game that the professionals should put their time into. This also gave me the idea of track my personal golf statistics, so that I could compare it to the pros and find areas of my game that need the most improvement.
Machine Learning Model I've been able to examine the data of PGA Tour players and classify if a player will win that year or not. With the random forest classification model, I was able to achieve an ROC AUC of 0.89 and an accuracy of 0.95 on the test set. This was a significant improvement from the ROC AUC of 0.78 and accuracy of 0.91. Because the data is skewed with approximately 80% of players not earning a win, the primary measure of the model was the ROC AUC. I was able to improve my model from ROC AUC score of 0.78 to a score of 0.89 by simply trying 3 different models, adding domain features, and polynomial features.
The End!!