CMSC 320: Introduction to Data Science - Final Project - FIFA Positions Predictor¶

Dhairya Gandhi¶

1. Introduction¶

In this project we aim to demonstrate our learning and understanding of the data lifecycle by applying it to a real-world situation. Our goal in this tutorial is to try and accurately classify soccer players into positional groups based on their skills by determining the skills that are most useful for different positions.

1.1. Background Information¶

As per 2020, soccer is by far the most watched sport in the entire world. With more than 4 billion worldwide viewers all around the globe this sport is in fact loved all around the world. Especially, FIFA World Cup is the biggest sporting event on the planet. Nothing compares to it. Not the SuperBowl. Not the Olympics. 2022 is a World Cup year and the format of this tournament is as follows: in a championship of 32 nations, where the last 16 are involved in an outright knock out, and out of those 16 teams the winner in each games advances to the next round eventually to pick a winner. As this is the FIFA World Cup year, this was the most relatable and real world situation I can use for this project. FIFA23 is a video game by EA Sports for all gaming platforms and it is as realistic as it can get. In order to rate a real-life soccer player's actual potential for its statistics and ranking in the game, EA sports used Xsens motion capture suits on players and made them play games. Moreover, EA, together with data reviewers, evaluates the individual attributes of the players, which, depending on the position, are given a coefficient and added together.

In order to obtain this information, we will be using data from the FIFA23 Ultimate Team Players database from the game FIFA23 by EA Sports.

If you are unfamiliar with the different positions in soccer, I recommend that you read this short guide. It describes the roles of different positions, which will help you understand the different attributes we will be examining and analyzing later in this tutorial. Let us first take a look at the libraries we will need to use for this tutorial.

1.1.1 A look into FIFA Ratings¶

Together with data reviewers, EA analyzes the unique player characteristics that are assigned a coefficient and merged dependent on the position. A FIFA rating is created by multiplying this number by the international reputation.

You may read this article to find out more about how these ratings are determined.

The simulation of the game is made as realistic as possible by EA Sports. They consider certain characteristics for that player's position that will have a greater impact on a player's rating when determining the overall rating of a player. For instance, a striker's total rating is more influenced by shooting than a defender's is.

1.2. Libraries Used¶

  • Pandas: Display and organize data in dataframes
  • Numpy: Support our data
  • Seaborn: Create plots
  • Matplotlib: Format plots
  • Plotly: Create easier and aethetically appealing formatted pie charts
  • Scikit-learn: Create predictive model to group players into position groups
In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt
import plotly as pltly
import plotly.express as px
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

2. Data Collection¶

2.1. About the Dataset¶

The dataset I will be using contains data for all 18,539 players in the game FIFA23 at the time of its launch. The dataset was found on Kaggle. You can click this link to view the database of players.

The dataset contains all the player ratings, and all the attributes that one can view in the FIFA23 Ultimate Team. The dataset contains 89 different data values for each player.

I downloaded the dataset as a csv file. Let us load in the dataset, save it as a dataframe and take a look at it.

2.2. Load and View Data¶

In [ ]:
db = 'Fifa 23 Players Data.csv'
df = pd.read_csv(db)
df.head()
Out[ ]:
Known As Full Name Overall Potential Value(in Euro) Positions Played Best Position Nationality Image Link Age ... LM Rating CM Rating RM Rating LWB Rating CDM Rating RWB Rating LB Rating CB Rating RB Rating GK Rating
0 L. Messi Lionel Messi 91 91 54000000 RW CAM Argentina https://cdn.sofifa.net/players/158/023/23_60.png 35 ... 91 88 91 67 66 67 62 53 62 22
1 K. Benzema Karim Benzema 91 91 64000000 CF,ST CF France https://cdn.sofifa.net/players/165/153/23_60.png 34 ... 89 84 89 67 67 67 63 58 63 21
2 R. Lewandowski Robert Lewandowski 91 91 84000000 ST ST Poland https://cdn.sofifa.net/players/188/545/23_60.png 33 ... 86 83 86 67 69 67 64 63 64 22
3 K. De Bruyne Kevin De Bruyne 91 91 107500000 CM,CAM CM Belgium https://cdn.sofifa.net/players/192/985/23_60.png 31 ... 91 91 91 82 82 82 78 72 78 24
4 K. Mbappé Kylian Mbappé 91 95 190500000 ST,LW ST France https://cdn.sofifa.net/players/231/747/23_60.png 23 ... 92 84 92 70 66 70 66 57 66 21

5 rows × 89 columns

The data appears to be rather clean, except for some missing values, initially after a cursory review. One thing to note is that the datafrane has 89 columns and I may not need many of them, so I will deal with removing the ones that are not required for this tutorial.

2.3 Data Cleaning¶

In order to clean the data, I will be droping the columns from the dataframe that we surely won't need in this tutorial. This includes columns like the ones that have the rating for each position in soccer no matter what position the player plays in. Aditionally, we also won't be needing properties like the players wage, value, dob, age, weight, club details, and nation team details.

There are a couple of different types of positions for each player in the dataframe, as follows, Club Position (i.e) position in which the player plays in his Club team, Best Position: all positions the player can play in, and Nation Position (i.e) the position in which the player plays in his national team. We choose the Best Position, for simplicity, since it includes the possitions of the player based on his attributes, rather than just his position in a team might be adjusted due to a couple reasons by the management. Also, in the club and nation team positions columns, the positions are listed as "Sub" for substitutes on the team, which doesn't give us the players position when he does play, because substitutes do ofcourse play and might just not start a game, same goes for reserves players on the team listed as "Rev".

Hence, I drop such attribute colums from the dataframe.

In [ ]:
columns_drop = [
   #   'Known As', 'Full Name', 'Overall', 'Potential', 
    'Value(in Euro)', 'Positions Played',
   #   'Best Position', 
    'Nationality', 'Image Link', 
    #   'Age', 'Height(in cm)', 'Weight(in kg)', 'TotalStats', 'BaseStats',
       'Club Name', 'Wage(in Euro)', 'Release Clause', 'Club Position',
       'Contract Until', 'Club Jersey Number', 'Joined On', 'On Loan',
    #   'Preferred Foot', 'Weak Foot Rating', 'Skill Moves',
       'International Reputation', 'National Team Name',
       'National Team Image Link', 'National Team Position',
       'National Team Jersey Number',
    #   'Attacking Work Rate', 'Defensive Work Rate', 
       'Pace Total', 'Shooting Total', 'Passing Total',
       'Dribbling Total', 'Defending Total', 'Physicality Total',
    #    'Crossing', 'Finishing', 'Heading Accuracy', 'Short Passing', 'Volleys',
    #    'Dribbling', 'Curve', 'Freekick Accuracy', 'LongPassing', 'BallControl',
    #    'Acceleration', 'Sprint Speed', 'Agility', 'Reactions', 'Balance',
    #    'Shot Power', 'Jumping', 'Stamina', 'Strength', 'Long Shots',
    #    'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties',
    #    'Composure', 'Marking', 'Standing Tackle', 'Sliding Tackle',
    #    'Goalkeeper Diving', 'Goalkeeper Handling', ' GoalkeeperKicking',
    #    'Goalkeeper Positioning', 'Goalkeeper Reflexes', 
       'ST Rating', 'LW Rating', 'LF Rating', 'CF Rating', 'RF Rating', 'RW Rating',
       'CAM Rating', 'LM Rating', 'CM Rating', 'RM Rating', 'LWB Rating',
       'CDM Rating', 'RWB Rating', 'LB Rating', 'CB Rating', 'RB Rating',
       'GK Rating']

df.drop(columns=columns_drop, inplace=True)
df
Out[ ]:
Known As Full Name Overall Potential Best Position Age Height(in cm) Weight(in kg) TotalStats BaseStats ... Penalties Composure Marking Standing Tackle Sliding Tackle Goalkeeper Diving Goalkeeper Handling GoalkeeperKicking Goalkeeper Positioning Goalkeeper Reflexes
0 L. Messi Lionel Messi 91 91 CAM 35 169 67 2190 452 ... 75 96 20 35 24 6 11 15 14 8
1 K. Benzema Karim Benzema 91 91 CF 34 185 81 2147 455 ... 84 90 43 24 18 13 11 5 5 7
2 R. Lewandowski Robert Lewandowski 91 91 ST 33 185 81 2205 458 ... 90 88 35 42 19 15 6 12 8 10
3 K. De Bruyne Kevin De Bruyne 91 91 CM 31 181 70 2303 483 ... 83 89 68 65 53 15 13 5 10 13
4 K. Mbappé Kylian Mbappé 91 95 ST 23 182 73 2177 470 ... 80 88 26 34 32 13 5 7 11 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18534 D. Collins Darren Collins 47 56 CAM 21 174 68 1287 274 ... 40 47 39 29 27 6 9 5 13 8
18535 Yang Dejiang Dejiang Yang 47 57 CDM 17 175 60 1289 267 ... 33 45 46 50 52 6 12 11 8 6
18536 L. Mullan Liam Mullan 47 67 RM 18 170 65 1333 277 ... 43 59 39 37 48 11 12 8 7 12
18537 D. McCallion Daithí McCallion 47 61 CB 17 178 65 1113 226 ... 37 41 50 54 54 8 14 13 7 8
18538 N. Rabha Nabin Rabha 47 50 LB 25 176 66 1277 269 ... 35 32 47 44 43 13 13 6 14 14

18539 rows × 49 columns

In [ ]:
df.rename(columns={'Best Position': 'player_position'}, inplace=True)

For the sake of simplicity, I decided to drop Goal Keepers(GK) from the dataframe, since goalkeepers attributes are usually very very low for players other than the ones who play as goalkeeps and almost never does another position player play as a goalkeeper ever in their career

In [ ]:
df.drop(df[df.player_position == "GK"].index, inplace=True)

Now that we have our Dataset ready, let's process and explore it!

3. Data Processing¶

3.1. Positional Breakdown¶

FIFA classifies players into 17 different positions.  Goalkeepers are not included, hence there are just 16. Let's examine the positions of every player in the game. A pie chart can be used to visualize the players' positions. We will use the plotly library to appropriately layout our pie charts.

In [ ]:
position_count = df.groupby(df['player_position']).count().reset_index()
position_count = position_count[['player_position','Known As']]
position_count.rename(columns={'player_position': 'Position',"Known As": "Count"}, inplace = True)
position_count.sort_values('Count', inplace=True, ascending=False)

fig = pltly.graph_objects.Figure(data=[pltly.graph_objects.Pie(labels=position_count['Position'], values=position_count['Count'])])
fig.update_layout(title='Number of Players at each Position')
fig.show()

The average team tends to have more players at defense than any other specific position, which we can see clearly from the above pie chart, the center backs(CB) and strikers(ST) and center attacking midfielders(CAM) make up the most number of players on average in any team formation in soccer.

3.2. Groups by Positions¶

Players that play in specific positional groups often have a comparable set of skills. For instance, two offensive players who play different positions will have similar blocking attribute scores. We shall categorize the players in the same way as the aforementioned Bleacher Report article did in order to make the categorization process simpler. The code that follows pretty much explains how we'll organize particular positions into their corresponding positional groupings. After storing all of the positional groups in a dictionary, we'll examine how the pie chart evolves as positions are grouped.

In [ ]:
def get_pos_groups_dfs(df):
    positions_groups_df = {}

    def get_pos_grp_lst(group):
        res = df.loc[df['player_position'].isin(group)]
        return res

    def categorize_pos(pos):
        for k, v in positions.items():
            if pos in v:
                return k

    positions_groups_df['Center Backs'] = get_pos_grp_lst(['CB'])
    positions_groups_df['Wing Backs'] = get_pos_grp_lst(['RB', 'RWB', 'LB', 'LWB'])
    positions_groups_df['Center Midfielders'] = get_pos_grp_lst(['CDM', 'CM', 'CAM'])
    positions_groups_df['Midfielders'] = get_pos_grp_lst(['LM', 'LW', 'RM', 'RW'])
    positions_groups_df['Strikers'] = get_pos_grp_lst(['ST', 'CF', 'LF', 'RF'])
    return positions_groups_df
In [ ]:
# Creating groups by Position
positions_groups = {}
positions_groups['Center Backs'] = ['CB']
positions_groups['Wing Backs'] = ['RB', 'RWB', 'LB', 'LWB']
positions_groups['Center Midfielders'] = ['CM', 'CAM', 'CDM']
positions_groups['Midfielders'] = ['LM', 'LW', 'RM', 'RW']
positions_groups['Strikers'] = ['ST', 'CF', 'LF', 'RF']


def categorize_pos(pos):
    for key, value in positions_groups.items():
        if pos in value:
            return key


df['Position Group'] = df.apply(
    lambda row: categorize_pos(row['player_position']), axis=1)

# getting dictionary with dataframe for each of the groups of positions groups
positions_groups_df = get_pos_groups_dfs(df)

counts = []
for p_df in positions_groups_df.values():
    counts.append(len(p_df['Full Name']))

fig = pltly.graph_objects.Figure(data=[pltly.graph_objects.Pie(
    labels=list(positions_groups_df.keys()), values=counts)])
fig.update_layout(title='Total Number of Players in Each Position Group')
fig.show()

The majority of them are defensive players once again.

It is better to make a stacked bar graph that breaks down the position groups into individual positions as well to help visualize the breakdown of positions.

In [ ]:
all_positions_count = {'CAM': [], 'CB': [], 'CDM': [], 'CF': [], 'CM': [], 'LB': [], 'LM': [], 'LW': [], 'LWB': [], 'RB': [], 'RM': [], 'RW': [], 'RWB': [], 'ST': []}
bar_data_df = pd.DataFrame()
bar_data_df['Position Group'] = list(positions_groups_df.keys())
bar_data_df.set_index('Position Group', inplace=True)
for group, p_df in positions_groups_df.items():
    for pos, lst in all_positions_count.items():
        lst.append(len(list(p_df.loc[df['player_position'] == pos]['player_position'])))
for pos, lst in all_positions_count.items():
    bar_data_df[pos] = lst
    
bar_data_df.plot.bar(stacked=True, subplots=False, figsize=(15,10), title='Full Position Breakdown of FIFA Players')
Out[ ]:
<AxesSubplot: title={'center': 'Full Position Breakdown of FIFA Players'}, xlabel='Position Group'>

This stacked bar graph gives a better representation of frequencies of each positition in each of the position groups. Some quick things I can notice are that center backs, strikers and center midfielders are the most frequently occurring position group.

4. Exploratory Data Analysis¶

In order to determine a player's overall rating, certain characteristics are weighted differently depending on the player's position. Based on their skill sets, I grouped the players into positional groups. To create these various skill sets, I must now combine qualities.

4.1. Attribute Categories¶

I will be using the futbin player description of traits to get how to create the groups of traits for the different skillsets. Below, are simply the groups of attributes into 6 parents categories. We do so as follows:

  • Physical: Any physical characteristic associated with fitness and strength
  • Passing: Any quality that has to do with passing(long and short) or accuracy
  • Pace: Any characterics relating to speed
  • Shooting: Any attribute relating toshooting the ball
  • Dribbling: Any attribute relating to defeating defenders andmoving the ball forward
  • Defending: Any attribute relating to applying pressure and blocking the ball from going inside the goal
In [ ]:
phy = ['Jumping', 'Stamina', 'Strength', 'Aggression']
pas = ['Vision', 'Crossing', 'Freekick Accuracy',
       'Short Passing', 'LongPassing', 'Curve']
pac = ['Acceleration', 'Sprint Speed',]
sho = ['Positioning', 'Finishing', 'Shot Power',
       'Long Shots', 'Volleys', 'Penalties']
dri = ['Dribbling', 'BallControl', 'Agility',
       'Reactions', 'Balance', 'Composure']
defe = ['Heading Accuracy', 'Interceptions',
        'Standing Tackle', 'Sliding Tackle', 'Marking']

Let's now add these to a dataframe to keep track of these categorical attributes. I will also go ahead and calculate the aggregate rating for each of these attribute groups. This is done by simply taking the mean of the ratings of the attributes that are contained in each category, and saving them into the categories mentioned in the text above.

In [ ]:
df['phy'] = df[phy].mean(axis=1)
df['pas'] = df[pas].mean(axis=1)
df['pac'] = df[pac].mean(axis=1)
df['sho'] = df[sho].mean(axis=1)
df['dri'] = df[dri].mean(axis=1)
df['defe'] = df[defe].mean(axis=1)
category_ratings_df = df[['Full Name', 'player_position', 'phy',
                          'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']]
category_ratings_df = category_ratings_df.rename(
    columns={'player_position': 'position'})
category_ratings_df.head(10)
Out[ ]:
Full Name position phy pas pac sho dri defe Overall
0 Lionel Messi CAM 62.50 90.833333 81.5 87.166667 93.666667 37.8 91
1 Karim Benzema CF 76.50 80.666667 79.5 87.166667 85.000000 42.8 91
2 Robert Lewandowski ST 82.25 78.333333 75.5 90.333333 85.666667 47.2 91
3 Kevin De Bruyne CM 75.00 91.000000 74.5 87.000000 85.333333 61.4 91
4 Kylian Mbappé ST 76.00 77.666667 97.0 86.333333 89.833333 40.4 91
5 Mohamed Salah RW 73.50 79.833333 90.0 87.166667 90.666667 47.2 90
8 C. Ronaldo dos Santos Aveiro ST 77.75 78.500000 81.0 91.166667 84.333333 39.8 90
9 Virgil van Dijk CB 85.00 68.833333 79.5 58.500000 73.166667 89.4 90
10 Harry Kane ST 81.25 80.500000 68.0 90.666667 81.666667 50.6 89
11 Neymar da Silva Santos Jr. LW 64.00 85.666667 87.0 84.333333 90.833333 39.2 89

4.2. Correlation of Attributes for different positions¶

The next stage will be to determine the relationship between the category qualities and the overall rating of each job group. As previously stated, various positions value qualities differently.

In [ ]:
positions_groups_df = get_pos_groups_dfs(df)

For each position group, I will generate correlation matrices. This will allow us to observe how attribute categories affect a player's overall rating in different position groupings.

4.2.1. Midfielders¶

Hypothesis/Assumption:

  • There is a strong correlation between passing, shooting, dribbling and overall of the midfielder. This is because of 2 reasons: 1) A midfielder has to switch the game from defence to attack and its their prime role to keep that ball is being passed with accuracy and precision, and 2) A midfielder has to face the most press from the opponent's defenders as well as their attackers, therefore the dribbling attributes have to be upto the task.
In [ ]:
plt.title("Correlation Heatmap of Overall and Attributes (Midfielders)", fontsize=12)
sns.heatmap(positions_groups_df["Midfielders"][[
            'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
Out[ ]:
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Midfielders)'}>

Observation:

  • After observing the heatmap I can see that they are light in color which supports my assumption.
4.2.2. Attackers¶

Hypothesis/Assumption:

  • I am assuming that there is a strong correlation between a striker's shooting and dribbling. 1) Because their primary role is to score, but to do so, they need better than average shooting and dribbling to pass through the defenders of the opposition.
In [ ]:
plt.title("Correlation Heatmap of Overall and Attributes (Attackers)", fontsize=12)
sns.heatmap(positions_groups_df["Strikers"][[
            'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
Out[ ]:
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Attackers)'}>

Observation:

  • The result hold true because I can observe light colour in the correlation map whick implies there is a very strong correlation between overall and shooting and coherently overall and dribbling.
4.2.3. Defenders¶

Hypothesis/Assumption:

  • I am assuming that there is a strong correlation between a defender's defending and physicality. 1) Because their primary role is to not let the opposition score and defend the ball, but to do so, they need excellent strength, arial attributes like heading and jumping, tackling and aggression. These all attributes are a part of defending and physicality.
In [ ]:
plt.title("Correlation Heatmap of Overall and Attributes (Defenders)", fontsize=12)
sns.heatmap(positions_groups_df["Center Backs"][[
            'phy', 'pas', 'pac', 'sho', 'dri', 'defe', 'Overall']].corr())
# positions_groups['Center Backs'] = ['CB']
# positions_groups['Wing Backs'] = ['RB', 'RWB', 'LB', 'LWB']
# positions_groups['Center Midfielders'] = ['CM', 'CAM', 'CDM']
# positions_groups['Midfielders'] = ['LM', 'LW', 'RM', 'RW']
# positions_groups['Strikers'] = ['ST', 'CF', 'LF', 'RF']
Out[ ]:
<AxesSubplot: title={'center': 'Correlation Heatmap of Overall and Attributes (Defenders)'}>

Now that we've figured out which traits are vital for players in various positions, we'll look at players' skill sets and try to estimate which position group their skill set will be most suited for. We may then compare that to the present positions in which players are playing.

5. Machine Learning & Visualization¶

In this last section, my aim is to develop a model that can accurately categorize players into positional groups based on their skillset. It is safe to believe that because there are many possible position groups that a player can be clasified into, using a Linear Discriminant Analysis (LDA) model to categorize players will be an accurate method of doing so. Furthermore, the classification task fits the assumptions of LDA well: the true groups are known already, and we have numerical independent variables (the attributes) and a categorical dependent variable (the predicted category of a given player).

5.1. Creation and Training Model¶

First, we will divide our dataframe of players into features and labels.

When we try to fit this model, it is required that we split the data into a training set and test set. Our next step is to break up the players dataframe into these two groups. We will be using a test size of 0.3.

In [ ]:
# Divide player dataframe into features and labels (features are the attributes and labels are the positional group)
Features = df.iloc[:, 50:56].to_numpy()
labels = df.iloc[:, 49].to_numpy()

Features_train, Features_test, labels_train, labels_test = train_test_split(
    Features, labels, test_size=0.3)

Now that we have our data split into testing and training sets, we can use the training data to fit our LDA model. Once fitted, we can use the model to predict position groups for players in the test set. We can then check how accurate our model is.

In [ ]:
# Create LDA model and train it on the training dataset
lda = LinearDiscriminantAnalysis()
lda.fit(Features_train, labels_train)

# Make predictions and assess accuracy of the predictions
predictions = lda.predict(Features_test)
print('Accuracy of our LDA model is {}'.format(
    accuracy_score(labels_test, predictions)))
Accuracy of our LDA model is 0.751415857605178

Quite accurate results!

We tried running this multiple times for consistency, and always received pretty accurate results, typically with an accuracy score in the mid to high 0.7's.

5.2 Plotting to Observe Differences in Predicted and True Frequencies

Now, it's time to look at the differences in the predicted position groups. Here we will make a bar plot to show the differences in predicted and true frequencies for each positional category. The first step towards making such a plot is to create a nice and tidy dataframe containing the predicted and actual frequencies for each positional category. We will use dictionaries to keep track of the frequencies of each position group for both true and predicted values.

In [ ]:
# Build dictionaries containing the predicted and true frequencies of positional groups
prediction_frequencies = {'Center Backs': 0, 'Wing Backs': 0, 'Center Midfielders': 0, 'Midfielders': 0, 'Strikers': 0}

true_frequencies = {'Center Backs': 0, 'Wing Backs': 0, 'Center Midfielders': 0, 'Midfielders': 0, 'Strikers': 0}

# For each instance of a positional group encountered in the predictions, increase the group's frequency by 1
for prediction in predictions:
    prediction_frequencies[prediction] += 1

# For each instance of a positional group encountered in the test labels, increase the group's frequency by 1
for label in labels_test:
    true_frequencies[label] += 1

# Make a dataframe to hold the frequencies for each position group
# With a row for the prediction and a row for the true positions
frequency_df = pd.DataFrame().append(prediction_frequencies,
                                     ignore_index=True).append(true_frequencies, ignore_index=True)
frequency_df
/var/folders/0w/0myfk66n1bq57sx6ncpb6dzm0000gn/T/ipykernel_48294/317219970.py:16: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

/var/folders/0w/0myfk66n1bq57sx6ncpb6dzm0000gn/T/ipykernel_48294/317219970.py:16: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Out[ ]:
Center Backs Wing Backs Center Midfielders Midfielders Strikers
0 1129 826 1429 796 764
1 1092 796 1428 847 781
In [ ]:
# We need to transpose in order to get the data in a form suitable for plotting with matplotlib
transposed = frequency_df.transpose()
transposed.rename(columns={0: "Predicted Frequency",
                  1: "True Frequency"}, inplace=True)

# Create a bar chart to show the difference in predicted and true frequencies for positional groups
ax = transposed.plot.bar(color=["SkyBlue", "IndianRed"],
                         title="Predicted Versus True Frequencies of Positional Groups", figsize=(10, 10))
ax.set_xlabel("Positional Group")
ax.set_ylabel("Frequency")
matplotlib.rcParams.update({'font.size': 12})
plt.show()

The above plot shows the predicted and true frequencies for each positional group. As we can see, the differences between the predicted and actual frequencies are not very large, which lines up with our LDA model having an accuracy generally in the 0.7's. Also, the difference between predicted and true frequencies seems to be fairly similar across positional groups, though Wing Backs seem to be consistently classified the most accurately. This makes sense as their passing attributes are significantly higher than other positional groups.

In [ ]:
# Confusion Matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
confusion_matrix(labels_test, predictions)
# Accuracy
accuracy_score(labels_test, predictions)
# Recall
recall_score(labels_test, predictions, average=None)
# Precision
precision_score(labels_test, predictions, average=None)
Out[ ]:
array([0.82816652, 0.70118964, 0.6620603 , 0.90968586, 0.67312349])

6. Conclusion¶

6.1. Recap¶

With this tutorial, I set out to analyze the positional makeup of the FIFA players and ultimately see if I could accurately classify FIFA players based on their skillset. In order to simplify things, I based this analysis on FIFA 23 Ultimate Team, a soccer game by EA Sports that seeks to accurately model the skills of soccer players in-game.

I started out by grabbing the dataset. I then removed Goalkeepers (GK) from our data since their rating is categories with different set of attribute profiles.

Next, I analyzed the positional breakdown of our dataset and consolidated positions into positional groups. I did so to simplify classification because positions within a positional group tend to have similar attribute profiles.

I then looked at the correlation between the attributes and overall rating for players in 5 different positional groups: Center Backs, Wing Backs, Center Midfielders, Midfielders, and Strikers. I analyzed each group separately because Madden values attributes differently for different positional groups. For example, from the heat maps, it was easy to tell that a random Striker's overall rating is highly correlated with their shooting and dribbling attribute, but for Center Backs, their shooting and dribbling attribute has very little correlation to their overall rating.

With all this information, I set out to achieve our goal of attempting to classify players based on their attributes. For this classification task, I used a Linear Discriminant Analysis model. I found that the model had an accuracy typically in the mid to high 0.7's. I then made a double bar plot of the predicted versus true frequencies of each positional group to help see if our model's classification error was greater for certain positional groups than others. Overall, the error seemed to be fairly similar across all positional groups, though Wing Backs seemed to consistently be classified the most accurately.

6.2. Extension¶

The scope of extension on both dataset and project is very vast. These soccer players also have differing 'Work Rate', 'Skills Moves', 'Body Type', 'Stamina Type' and 'Weak Foot'. In order to further delve deeper into a player's potential at a particular position it is also possible to consider these following traits and analyse the multidimentional representation of this data to more accurately model players. This extension was with respect to the database. There is one more possible scope of extension (with respect to this project). Imagine you are a manager for a team and you have to select best possible players for a particular formation. Soccer has several formations and certain players perform better in certain formations as per their qualities. This could be a very valuable addition to this project and can also help give an idea about traits required for every specific position in the formation and why a certain type of player performs better at it.

6.3. Final Thoughts¶

Assuming that the majority of players are in the positional group best suited to their attributes, the results indicate that some players may be assigned to positional groups that are not the best match for their attributes. This means that they may be able to benefit from a switch to the positional group that the model predicts is the best fit. Of course, this is tentative, as it could be that our Linear Discriminant Analysis model simply failed to find the best suited positional group for the players' attributes. There could be other unexamined skills that lead to those players being classed in a different positional group in real life. There could even be other factors! For instance, I noticed a trend where our model would predict a smaller number of Wing Backs. This could mean that in reality, players with skillsets suited for other positions such as Center Backs or Midfielders, could be trying to play out of position, because they will be more likely to make the team's roster, as teams carry more Wing Backs than other positions.