In this project we aim to demonstrate our learning and understanding of the data lifecycle by applying it to a real-world situation. Our goal in this tutorial is to try and accurately classify soccer players into positional groups based on their skills by determining the skills that are most useful for different positions.
As per 2020, soccer is by far the most watched sport in the entire world. With more than 4 billion worldwide viewers all around the globe this sport is in fact loved all around the world. Especially, FIFA World Cup is the biggest sporting event on the planet. Nothing compares to it. Not the SuperBowl. Not the Olympics. 2022 is a World Cup year and the format of this tournament is as follows: in a championship of 32 nations, where the last 16 are involved in an outright knock out, and out of those 16 teams the winner in each games advances to the next round eventually to pick a winner. As this is the FIFA World Cup year, this was the most relatable and real world situation I can use for this project. FIFA23 is a video game by EA Sports for all gaming platforms and it is as realistic as it can get. In order to rate a real-life soccer player's actual potential for its statistics and ranking in the game, EA sports used Xsens motion capture suits on players and made them play games. Moreover, EA, together with data reviewers, evaluates the individual attributes of the players, which, depending on the position, are given a coefficient and added together.
In order to obtain this information, we will be using data from the FIFA23 Ultimate Team Players database from the game FIFA23 by EA Sports.
If you are unfamiliar with the different positions in soccer, I recommend that you read this short guide. It describes the roles of different positions, which will help you understand the different attributes we will be examining and analyzing later in this tutorial. Let us first take a look at the libraries we will need to use for this tutorial.
Together with data reviewers, EA analyzes the unique player characteristics that are assigned a coefficient and merged dependent on the position. A FIFA rating is created by multiplying this number by the international reputation.
You may read this article to find out more about how these ratings are determined.
The simulation of the game is made as realistic as possible by EA Sports. They consider certain characteristics for that player's position that will have a greater impact on a player's rating when determining the overall rating of a player. For instance, a striker's total rating is more influenced by shooting than a defender's is.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt
import plotly as pltly
import plotly.express as px
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
The dataset I will be using contains data for all 18,539 players in the game FIFA23 at the time of its launch. The dataset was found on Kaggle. You can click this link to view the database of players.
The dataset contains all the player ratings, and all the attributes that one can view in the FIFA23 Ultimate Team. The dataset contains 89 different data values for each player.
I downloaded the dataset as a csv file. Let us load in the dataset, save it as a dataframe and take a look at it.
db = 'Fifa 23 Players Data.csv'
df = pd.read_csv(db)
df.head()
Known As | Full Name | Overall | Potential | Value(in Euro) | Positions Played | Best Position | Nationality | Image Link | Age | ... | LM Rating | CM Rating | RM Rating | LWB Rating | CDM Rating | RWB Rating | LB Rating | CB Rating | RB Rating | GK Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | L. Messi | Lionel Messi | 91 | 91 | 54000000 | RW | CAM | Argentina | https://cdn.sofifa.net/players/158/023/23_60.png | 35 | ... | 91 | 88 | 91 | 67 | 66 | 67 | 62 | 53 | 62 | 22 |
1 | K. Benzema | Karim Benzema | 91 | 91 | 64000000 | CF,ST | CF | France | https://cdn.sofifa.net/players/165/153/23_60.png | 34 | ... | 89 | 84 | 89 | 67 | 67 | 67 | 63 | 58 | 63 | 21 |
2 | R. Lewandowski | Robert Lewandowski | 91 | 91 | 84000000 | ST | ST | Poland | https://cdn.sofifa.net/players/188/545/23_60.png | 33 | ... | 86 | 83 | 86 | 67 | 69 | 67 | 64 | 63 | 64 | 22 |
3 | K. De Bruyne | Kevin De Bruyne | 91 | 91 | 107500000 | CM,CAM | CM | Belgium | https://cdn.sofifa.net/players/192/985/23_60.png | 31 | ... | 91 | 91 | 91 | 82 | 82 | 82 | 78 | 72 | 78 | 24 |
4 | K. Mbappé | Kylian Mbappé | 91 | 95 | 190500000 | ST,LW | ST | France | https://cdn.sofifa.net/players/231/747/23_60.png | 23 | ... | 92 | 84 | 92 | 70 | 66 | 70 | 66 | 57 | 66 | 21 |
5 rows × 89 columns
The data appears to be rather clean, except for some missing values, initially after a cursory review. One thing to note is that the datafrane has 89 columns and I may not need many of them, so I will deal with removing the ones that are not required for this tutorial.
In order to clean the data, I will be droping the columns from the dataframe that we surely won't need in this tutorial. This includes columns like the ones that have the rating for each position in soccer no matter what position the player plays in. Aditionally, we also won't be needing properties like the players wage, value, dob, age, weight, club details, and nation team details.
There are a couple of different types of positions for each player in the dataframe, as follows, Club Position (i.e) position in which the player plays in his Club team, Best Position: all positions the player can play in, and Nation Position (i.e) the position in which the player plays in his national team. We choose the Best Position, for simplicity, since it includes the possitions of the player based on his attributes, rather than just his position in a team might be adjusted due to a couple reasons by the management. Also, in the club and nation team positions columns, the positions are listed as "Sub" for substitutes on the team, which doesn't give us the players position when he does play, because substitutes do ofcourse play and might just not start a game, same goes for reserves players on the team listed as "Rev".
Hence, I drop such attribute colums from the dataframe.
columns_drop = [
# 'Known As', 'Full Name', 'Overall', 'Potential',
'Value(in Euro)', 'Positions Played',
# 'Best Position',
'Nationality', 'Image Link',
# 'Age', 'Height(in cm)', 'Weight(in kg)', 'TotalStats', 'BaseStats',
'Club Name', 'Wage(in Euro)', 'Release Clause', 'Club Position',
'Contract Until', 'Club Jersey Number', 'Joined On', 'On Loan',
# 'Preferred Foot', 'Weak Foot Rating', 'Skill Moves',
'International Reputation', 'National Team Name',
'National Team Image Link', 'National Team Position',
'National Team Jersey Number',
# 'Attacking Work Rate', 'Defensive Work Rate',
'Pace Total', 'Shooting Total', 'Passing Total',
'Dribbling Total', 'Defending Total', 'Physicality Total',
# 'Crossing', 'Finishing', 'Heading Accuracy', 'Short Passing', 'Volleys',
# 'Dribbling', 'Curve', 'Freekick Accuracy', 'LongPassing', 'BallControl',
# 'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance',
# 'Shot Power', 'Jumping', 'Stamina', 'Strength', 'Long Shots',
# 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties',
# 'Composure', 'Marking', 'Standing Tackle', 'Sliding Tackle',
# 'Goalkeeper Diving', 'Goalkeeper Handling', ' GoalkeeperKicking',
# 'Goalkeeper Positioning', 'Goalkeeper Reflexes',
'ST Rating', 'LW Rating', 'LF Rating', 'CF Rating', 'RF Rating', 'RW Rating',
'CAM Rating', 'LM Rating', 'CM Rating', 'RM Rating', 'LWB Rating',
'CDM Rating', 'RWB Rating', 'LB Rating', 'CB Rating', 'RB Rating',
'GK Rating']
df.drop(columns=columns_drop, inplace=True)
df
Known As | Full Name | Overall | Potential | Best Position | Age | Height(in cm) | Weight(in kg) | TotalStats | BaseStats | ... | Penalties | Composure | Marking | Standing Tackle | Sliding Tackle | Goalkeeper Diving | Goalkeeper Handling | GoalkeeperKicking | Goalkeeper Positioning | Goalkeeper Reflexes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | L. Messi | Lionel Messi | 91 | 91 | CAM | 35 | 169 | 67 | 2190 | 452 | ... | 75 | 96 | 20 | 35 | 24 | 6 | 11 | 15 | 14 | 8 |
1 | K. Benzema | Karim Benzema | 91 | 91 | CF | 34 | 185 | 81 | 2147 | 455 | ... | 84 | 90 | 43 | 24 | 18 | 13 | 11 | 5 | 5 | 7 |
2 | R. Lewandowski | Robert Lewandowski | 91 | 91 | ST | 33 | 185 | 81 | 2205 | 458 | ... | 90 | 88 | 35 | 42 | 19 | 15 | 6 | 12 | 8 | 10 |
3 | K. De Bruyne | Kevin De Bruyne | 91 | 91 | CM | 31 | 181 | 70 | 2303 | 483 | ... | 83 | 89 | 68 | 65 | 53 | 15 | 13 | 5 | 10 | 13 |
4 | K. Mbappé | Kylian Mbappé | 91 | 95 | ST | 23 | 182 | 73 | 2177 | 470 | ... | 80 | 88 | 26 | 34 | 32 | 13 | 5 | 7 | 11 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18534 | D. Collins | Darren Collins | 47 | 56 | CAM | 21 | 174 | 68 | 1287 | 274 | ... | 40 | 47 | 39 | 29 | 27 | 6 | 9 | 5 | 13 | 8 |
18535 | Yang Dejiang | Dejiang Yang | 47 | 57 | CDM | 17 | 175 | 60 | 1289 | 267 | ... | 33 | 45 | 46 | 50 | 52 | 6 | 12 | 11 | 8 | 6 |
18536 | L. Mullan | Liam Mullan | 47 | 67 | RM | 18 | 170 | 65 | 1333 | 277 | ... | 43 | 59 | 39 | 37 | 48 | 11 | 12 | 8 | 7 | 12 |
18537 | D. McCallion | Daithí McCallion | 47 | 61 | CB | 17 | 178 | 65 | 1113 | 226 | ... | 37 | 41 | 50 | 54 | 54 | 8 | 14 | 13 | 7 | 8 |
18538 | N. Rabha | Nabin Rabha | 47 | 50 | LB | 25 | 176 | 66 | 1277 | 269 | ... | 35 | 32 | 47 | 44 | 43 | 13 | 13 | 6 | 14 | 14 |
18539 rows × 49 columns
df.rename(columns={'Best Position': 'player_position'}, inplace=True)
For the sake of simplicity, I decided to drop Goal Keepers(GK) from the dataframe, since goalkeepers attributes are usually very very low for players other than the ones who play as goalkeeps and almost never does another position player play as a goalkeeper ever in their career
df.drop(df[df.player_position == "GK"].index, inplace=True)
Now that we have our Dataset ready, let's process and explore it!
FIFA classifies players into 17 different positions. Goalkeepers are not included, hence there are just 16. Let's examine the positions of every player in the game. A pie chart can be used to visualize the players' positions. We will use the plotly library to appropriately layout our pie charts.
position_count = df.groupby(df['player_position']).count().reset_index()
position_count = position_count[['player_position','Known As']]
position_count.rename(columns={'player_position': 'Position',"Known As": "Count"}, inplace = True)
position_count.sort_values('Count', inplace=True, ascending=False)
fig = pltly.graph_objects.Figure(data=[pltly.graph_objects.Pie(labels=position_count['Position'], values=position_count['Count'])])
fig.update_layout(title='Number of Players at each Position')
fig.show()
The average team tends to have more players at defense than any other specific position, which we can see clearly from the above pie chart, the center backs(CB) and strikers(ST) and center attacking midfielders(CAM) make up the most number of players on average in any team formation in soccer.
Players that play in specific positional groups often have a comparable set of skills. For instance, two offensive players who play different positions will have similar blocking attribute scores. We shall categorize the players in the same way as the aforementioned Bleacher Report article did in order to make the categorization process simpler. The code that follows pretty much explains how we'll organize particular positions into their corresponding positional groupings. After storing all of the positional groups in a dictionary, we'll examine how the pie chart evolves as positions are grouped.
def get_pos_groups_dfs(df):
positions_groups_df = {}
def get_pos_grp_lst(group):
res = df.loc[df['player_position'].isin(group)]
return res
def categorize_pos(pos):
for k, v in positions.items():
if pos in v:
return k
positions_groups_df['Center Backs'] = get_pos_grp_lst(['CB'])
positions_groups_df['Wing Backs'] = get_pos_grp_lst(['RB', 'RWB', 'LB', 'LWB'])
positions_groups_df['Center Midfielders'] = get_pos_grp_lst(['CDM', 'CM', 'CAM'])
positions_groups_df['Midfielders'] = get_pos_grp_lst(['LM', 'LW', 'RM', 'RW'])
positions_groups_df['Strikers'] = get_pos_grp_lst(['ST', 'CF', 'LF', 'RF'])
return positions_groups_df
# Creating groups by Position
positions_groups = {}
positions_groups['Center Backs'] = ['CB']
positions_groups['Wing Backs'] = ['RB', 'RWB', 'LB', 'LWB']
positions_groups['Center Midfielders'] = ['CM', 'CAM', 'CDM']
positions_groups['Midfielders'] = ['LM', 'LW', 'RM', 'RW']
positions_groups['Strikers'] = ['ST', 'CF', 'LF', 'RF']
def categorize_pos(pos):
for key, value in positions_groups.items():
if pos in value:
return key
df['Position Group'] = df.apply(
lambda row: categorize_pos(row['player_position']), axis=1)
# getting dictionary with dataframe for each of the groups of positions groups
positions_groups_df = get_pos_groups_dfs(df)
counts = []
for p_df in positions_groups_df.values():
counts.append(len(p_df['Full Name']))
fig = pltly.graph_objects.Figure(data=[pltly.graph_objects.Pie(
labels=list(positions_groups_df.keys()), values=counts)])
fig.update_layout(title='Total Number of Players in Each Position Group')
fig.show()
The majority of them are defensive players once again.
It is better to make a stacked bar graph that breaks down the position groups into individual positions as well to help visualize the breakdown of positions.
all_positions_count = {'CAM': [], 'CB': [], 'CDM': [], 'CF': [], 'CM': [], 'LB': [], 'LM': [], 'LW': [], 'LWB': [], 'RB': [], 'RM': [], 'RW': [], 'RWB': [], 'ST': []}
bar_data_df = pd.DataFrame()
bar_data_df['Position Group'] = list(positions_groups_df.keys())
bar_data_df.set_index('Position Group', inplace=True)
for group, p_df in positions_groups_df.items():
for pos, lst in all_positions_count.items():
lst.append(len(list(p_df.loc[df['player_position'] == pos]['player_position'])))
for pos, lst in all_positions_count.items():
bar_data_df[pos] = lst
bar_data_df.plot.bar(stacked=True, subplots=False, figsize=(15,10), title='Full Position Breakdown of FIFA Players')
<AxesSubplot: title={'center': 'Full Position Breakdown of FIFA Players'}, xlabel='Position Group'>
This stacked bar graph gives a better representation of frequencies of each positition in each of the position groups. Some quick things I can notice are that center backs, strikers and center midfielders are the most frequently occurring position group.
In order to determine a player's overall rating, certain characteristics are weighted differently depending on the player's position. Based on their skill sets, I grouped the players into positional groups. To create these various skill sets, I must now combine qualities.
I will be using the futbin player description of traits to get how to create the groups of traits for the different skillsets. Below, are simply the groups of attributes into 6 parents categories. We do so as follows:
phy = ['Jumping', 'Stamina', 'Strength', 'Aggression']
pas = ['Vision', 'Crossing', 'Freekick Accuracy',
'Short Passing', 'LongPassing', 'Curve']
pac = ['Acceleration', 'Sprint Speed',]
sho = ['Positioning', 'Finishing', 'Shot Power',
'Long Shots', 'Volleys', 'Penalties']
dri = ['Dribbling', 'BallControl', 'Agility',
'Reactions', 'Balance', 'Composure']
defe = ['Heading Accuracy', 'Interceptions',
'Standing Tackle', 'Sliding Tackle', 'Marking']
Let's now add these to a dataframe to keep track of these categorical attributes. I will also go ahead and calculate the aggregate rating for each of these attribute groups. This is done by simply taking the mean of the ratings of the attributes that are contained in each category, and saving them into the categories mentioned in the text above.
df['phy'] = df[phy].mean(axis=1)
df['pas'] = df[pas].mean(axis=1)
df['pac'] = df[pac].mean(axis=1)
df['sho'] = df[sho].mean(axis=1)
df['dri'] = df[dri].mean(axis=1)
df['defe'] = df[defe].mean(axis=1)
category_ratings_df = df[['Full Name', 'player_position', 'phy',
'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']]
category_ratings_df = category_ratings_df.rename(
columns={'player_position': 'position'})
category_ratings_df.head(10)
Full Name | position | phy | pas | pac | sho | dri | defe | Overall | |
---|---|---|---|---|---|---|---|---|---|
0 | Lionel Messi | CAM | 62.50 | 90.833333 | 81.5 | 87.166667 | 93.666667 | 37.8 | 91 |
1 | Karim Benzema | CF | 76.50 | 80.666667 | 79.5 | 87.166667 | 85.000000 | 42.8 | 91 |
2 | Robert Lewandowski | ST | 82.25 | 78.333333 | 75.5 | 90.333333 | 85.666667 | 47.2 | 91 |
3 | Kevin De Bruyne | CM | 75.00 | 91.000000 | 74.5 | 87.000000 | 85.333333 | 61.4 | 91 |
4 | Kylian Mbappé | ST | 76.00 | 77.666667 | 97.0 | 86.333333 | 89.833333 | 40.4 | 91 |
5 | Mohamed Salah | RW | 73.50 | 79.833333 | 90.0 | 87.166667 | 90.666667 | 47.2 | 90 |
8 | C. Ronaldo dos Santos Aveiro | ST | 77.75 | 78.500000 | 81.0 | 91.166667 | 84.333333 | 39.8 | 90 |
9 | Virgil van Dijk | CB | 85.00 | 68.833333 | 79.5 | 58.500000 | 73.166667 | 89.4 | 90 |
10 | Harry Kane | ST | 81.25 | 80.500000 | 68.0 | 90.666667 | 81.666667 | 50.6 | 89 |
11 | Neymar da Silva Santos Jr. | LW | 64.00 | 85.666667 | 87.0 | 84.333333 | 90.833333 | 39.2 | 89 |
The next stage will be to determine the relationship between the category qualities and the overall rating of each job group. As previously stated, various positions value qualities differently.
positions_groups_df = get_pos_groups_dfs(df)
For each position group, I will generate correlation matrices. This will allow us to observe how attribute categories affect a player's overall rating in different position groupings.
Hypothesis/Assumption:
plt.title("Correlation Heatmap of Overall and Attributes (Midfielders)", fontsize=12)
sns.heatmap(positions_groups_df["Midfielders"][[
'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Midfielders)'}>
Observation:
Hypothesis/Assumption:
plt.title("Correlation Heatmap of Overall and Attributes (Attackers)", fontsize=12)
sns.heatmap(positions_groups_df["Strikers"][[
'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Attackers)'}>
Observation:
Hypothesis/Assumption:
plt.title("Correlation Heatmap of Overall and Attributes (Defenders)", fontsize=12)
sns.heatmap(positions_groups_df["Center Backs"][[
'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
# positions_groups['Center Backs'] = ['CB']
# positions_groups['Wing Backs'] = ['RB', 'RWB', 'LB', 'LWB']
# positions_groups['Center Midfielders'] = ['CM', 'CAM', 'CDM']
# positions_groups['Midfielders'] = ['LM', 'LW', 'RM', 'RW']
# positions_groups['Strikers'] = ['ST', 'CF', 'LF', 'RF']
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Defenders)'}>
Now that we've figured out which traits are vital for players in various positions, we'll look at players' skill sets and try to estimate which position group their skill set will be most suited for. We may then compare that to the present positions in which players are playing.
In this last section, my aim is to develop a model that can accurately categorize players into positional groups based on their skillset. It is safe to believe that because there are many possible position groups that a player can be clasified into, using a Linear Discriminant Analysis (LDA) model to categorize players will be an accurate method of doing so. Furthermore, the classification task fits the assumptions of LDA well: the true groups are known already, and we have numerical independent variables (the attributes) and a categorical dependent variable (the predicted category of a given player).
First, we will divide our dataframe of players into features and labels.
When we try to fit this model, it is required that we split the data into a training set and test set. Our next step is to break up the players dataframe into these two groups. We will be using a test size of 0.3.
# Divide player dataframe into features and labels (features are the attributes and labels are the positional group)
Features = df.iloc[:, 50:56].to_numpy()
labels = df.iloc[:, 49].to_numpy()
Features_train, Features_test, labels_train, labels_test = train_test_split(
Features, labels, test_size=0.3)
Now that we have our data split into testing and training sets, we can use the training data to fit our LDA model. Once fitted, we can use the model to predict position groups for players in the test set. We can then check how accurate our model is.
# Create LDA model and train it on the training dataset
lda = LinearDiscriminantAnalysis()
lda.fit(Features_train, labels_train)
# Make predictions and assess accuracy of the predictions
predictions = lda.predict(Features_test)
print('Accuracy of our LDA model is {}'.format(
accuracy_score(labels_test, predictions)))
Accuracy of our LDA model is 0.751415857605178
Quite accurate results!
We tried running this multiple times for consistency, and always received pretty accurate results, typically with an accuracy score in the mid to high 0.7's.
5.2 Plotting to Observe Differences in Predicted and True Frequencies
Now, it's time to look at the differences in the predicted position groups. Here we will make a bar plot to show the differences in predicted and true frequencies for each positional category. The first step towards making such a plot is to create a nice and tidy dataframe containing the predicted and actual frequencies for each positional category. We will use dictionaries to keep track of the frequencies of each position group for both true and predicted values.
# Build dictionaries containing the predicted and true frequencies of positional groups
prediction_frequencies = {'Center Backs': 0, 'Wing Backs': 0, 'Center Midfielders': 0, 'Midfielders': 0, 'Strikers': 0}
true_frequencies = {'Center Backs': 0, 'Wing Backs': 0, 'Center Midfielders': 0, 'Midfielders': 0, 'Strikers': 0}
# For each instance of a positional group encountered in the predictions, increase the group's frequency by 1
for prediction in predictions:
prediction_frequencies[prediction] += 1
# For each instance of a positional group encountered in the test labels, increase the group's frequency by 1
for label in labels_test:
true_frequencies[label] += 1
# Make a dataframe to hold the frequencies for each position group
# With a row for the prediction and a row for the true positions
frequency_df = pd.DataFrame().append(prediction_frequencies,
ignore_index=True).append(true_frequencies, ignore_index=True)
frequency_df
/var/folders/0w/0myfk66n1bq57sx6ncpb6dzm0000gn/T/ipykernel_48294/317219970.py:16: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. /var/folders/0w/0myfk66n1bq57sx6ncpb6dzm0000gn/T/ipykernel_48294/317219970.py:16: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Center Backs | Wing Backs | Center Midfielders | Midfielders | Strikers | |
---|---|---|---|---|---|
0 | 1129 | 826 | 1429 | 796 | 764 |
1 | 1092 | 796 | 1428 | 847 | 781 |
# We need to transpose in order to get the data in a form suitable for plotting with matplotlib
transposed = frequency_df.transpose()
transposed.rename(columns={0: "Predicted Frequency",
1: "True Frequency"}, inplace=True)
# Create a bar chart to show the difference in predicted and true frequencies for positional groups
ax = transposed.plot.bar(color=["SkyBlue", "IndianRed"],
title="Predicted Versus True Frequencies of Positional Groups", figsize=(10, 10))
ax.set_xlabel("Positional Group")
ax.set_ylabel("Frequency")
matplotlib.rcParams.update({'font.size': 12})
plt.show()
The above plot shows the predicted and true frequencies for each positional group. As we can see, the differences between the predicted and actual frequencies are not very large, which lines up with our LDA model having an accuracy generally in the 0.7's. Also, the difference between predicted and true frequencies seems to be fairly similar across positional groups, though Wing Backs seem to be consistently classified the most accurately. This makes sense as their passing attributes are significantly higher than other positional groups.
# Confusion Matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
confusion_matrix(labels_test, predictions)
# Accuracy
accuracy_score(labels_test, predictions)
# Recall
recall_score(labels_test, predictions, average=None)
# Precision
precision_score(labels_test, predictions, average=None)
array([0.82816652, 0.70118964, 0.6620603 , 0.90968586, 0.67312349])
With this tutorial, I set out to analyze the positional makeup of the FIFA players and ultimately see if I could accurately classify FIFA players based on their skillset. In order to simplify things, I based this analysis on FIFA 23 Ultimate Team, a soccer game by EA Sports that seeks to accurately model the skills of soccer players in-game.
I started out by grabbing the dataset. I then removed Goalkeepers (GK) from our data since their rating is categories with different set of attribute profiles.
Next, I analyzed the positional breakdown of our dataset and consolidated positions into positional groups. I did so to simplify classification because positions within a positional group tend to have similar attribute profiles.
I then looked at the correlation between the attributes and overall rating for players in 5 different positional groups: Center Backs, Wing Backs, Center Midfielders, Midfielders, and Strikers. I analyzed each group separately because Madden values attributes differently for different positional groups. For example, from the heat maps, it was easy to tell that a random Striker's overall rating is highly correlated with their shooting and dribbling attribute, but for Center Backs, their shooting and dribbling attribute has very little correlation to their overall rating.
With all this information, I set out to achieve our goal of attempting to classify players based on their attributes. For this classification task, I used a Linear Discriminant Analysis model. I found that the model had an accuracy typically in the mid to high 0.7's. I then made a double bar plot of the predicted versus true frequencies of each positional group to help see if our model's classification error was greater for certain positional groups than others. Overall, the error seemed to be fairly similar across all positional groups, though Wing Backs seemed to consistently be classified the most accurately.
The scope of extension on both dataset and project is very vast. These soccer players also have differing 'Work Rate', 'Skills Moves', 'Body Type', 'Stamina Type' and 'Weak Foot'. In order to further delve deeper into a player's potential at a particular position it is also possible to consider these following traits and analyse the multidimentional representation of this data to more accurately model players. This extension was with respect to the database. There is one more possible scope of extension (with respect to this project). Imagine you are a manager for a team and you have to select best possible players for a particular formation. Soccer has several formations and certain players perform better in certain formations as per their qualities. This could be a very valuable addition to this project and can also help give an idea about traits required for every specific position in the formation and why a certain type of player performs better at it.
Assuming that the majority of players are in the positional group best suited to their attributes, the results indicate that some players may be assigned to positional groups that are not the best match for their attributes. This means that they may be able to benefit from a switch to the positional group that the model predicts is the best fit. Of course, this is tentative, as it could be that our Linear Discriminant Analysis model simply failed to find the best suited positional group for the players' attributes. There could be other unexamined skills that lead to those players being classed in a different positional group in real life. There could even be other factors! For instance, I noticed a trend where our model would predict a smaller number of Wing Backs. This could mean that in reality, players with skillsets suited for other positions such as Center Backs or Midfielders, could be trying to play out of position, because they will be more likely to make the team's roster, as teams carry more Wing Backs than other positions.