Disclaimer: I’ll be talking about how to come up with the python code, if you want to read the actual code please go to this repo.
Meet the Credit Score Classification Dataset
The dataset that we’ll clean comes from kaggle, which is the train.csv
dataset, but this could be used for the test.csv
as well.
There are 28 columns and 100k rows in this dataset. I compiled a feature description table that you can see below.
Feature | Description |
---|---|
ID | Represents a unique identification of an entry |
Customer_ID | Represents a unique identification of a person |
Month | Represents the month of the year |
Name | Represents the name of a person |
Age | Represents the age of the person |
SSN | Represents the social security number of a person |
Occupation | Represents the occupation of the person |
Annual_Income | Represents the annual income of the person |
Monthly_Inhand_Salary | Represents the monthly base salary of a person |
Num_Bank_Accounts | Represents the number of bank accounts a person holds |
Num_Credit_Card | Represents the number of other credit cards held by a person |
Interest_Rate | Represents the interest rate on a credit card |
Num_of_Loan | Represents the number of loans taken from the bank |
Type_of_Loan | Represents the types of loan taken by a person |
Delay_from_due_date | Represents the average number of days delayed from the payment date |
Num_of_Delayed_Payment | Represents the average number of payments delayed by a person |
Changed_Credit_Limit | Represents the percentage change in credit card limit |
Num_Credit_Inquiries | Represents the number of credit card inquiries |
Credit_Mix | Represents the classification of the mix of credits |
Outstanding_Debt | Represents the remaining debt to be paid (in USD) |
Credit_Utilization_Ratio | Represents the utilization ratio of credit cards |
Credit_History_Age | Represents the age of credit history of the person |
Payment_of_Min_Amount | Represents whether only the minimum amount was paid by the person |
Total_EMI_per_month | Represents the monthly EMI payments (in USD) |
Amount_invested_monthly | Represents the monthly amount invested by the customer (in USD) |
Payment_Behaviour | Represents the payment behavior of the customer (in USD) |
Monthly_Balance | Represents the monthly balance amount of the customer (in USD) |
Credit_Score | Represents the bracket of credit score (Poor, Standard, Good) |
Even though we have 100k rows, within these rows that are only 12,500 different customers, each customer appears 8 times (from January to August). So basically we can select a particular customer and look at their information and easily find incorrect data and be able to adjust it.
Cleaning Typos and Outliers
In this dataset that is a lot of typos or just straight-up nonsense. You’ll find some values to be: _
, !@9#%8
, __10000__
, NM
or _______
. I believe these typos are in the dataset to represent the improbability that you may find when dealing with real-world data and most of them mean that this is a null value.
For a moment I thought __10000__
would just be a typo, but there is no amount invested monthly that is over 200 dollars.
|
|
Following this logic, I looked for nonsense in the data frame and I started to replace them with numpy nan
’s. I also looked for outliers by looking at the distribution of values, if there was a value that only appeared once and was isolated I substitute it for a null value. I based this decision not only on this but also when I looked for customers that had this outlier and I observed all the data from this particular customer, I’d see weird things like:
By looking at this customer is clear that he didn’t make this much money annually only one month of the year.
When you finish this search for typos and outliers don’t forget to assign the correct data type to your features. Some features like age
started with string characters among the age values and because of this, it’s uploaded as an object instead of int or float.
Filling Null Values
After dealing with all the outliers and typos, we ended up with a lot of null values, as you can see:
|
|
|
|
Instead of just dropping all these null values I first try to fill them using the information I already have. Remember that I said that a customer has historical data for 8 months? We can just use this historical data to fill the null values using an aggregation measurement of our choice filtering for the customer, this will be more accurate than just calculating the mean value of the database.
I decided to use the average values for the following columns:
|
|
And the last non-empty value for these:
|
|
The reason for not using the mean for all my values is that I didn’t want to have someone be 20.5 years old and Occupation
, Type_of_Loan
, and Credit_Mix
are discrete data.
Feature Engineering
With the clean data, we can proceed to feature engineering. In this case, we first want to change the Type_of_Loan
, because that are some occurrences that it has all the loans in one value, as you can see:
|
|
So I’ll save all the different loan types in one vector, by splitting the loans every time there is a ,
or , and
.
|
|
Now we can create dummy variables of these loan_types
, so a customer will receive the number 1 if they have this loan or a 0 if they don’t.
|
|
Now I want to keep working on this dataset to make it ready for training a machine learning model. For this reason, I need to transform my discrete data into numeric.
The feature Credit_History_Age
has the values as strings “22 Years and 5 Months”, this pattern repeats itself, so we can take advantage of this and select the year multiplied by 12 and sum the month, resulting in a new feature with the credit history age in months. When we are done with this, there are still going to be null values, to fill them I choose to interpolate the values, this works great when the missing value is in February up until July because it interpolates with the customer’s credit history age, but it becomes a bad guessed when the missing value is in January or August.
The months’ names are going to be replaced by their number counterpart, so January is 1, February is 2, and so on. credit_mix
and credit_score
have 3 sequential categories, I choose to go with -1, 0, and 1, but you can use 1, 2, 3 and it’ll produce the same result.
Don’t forget to check the GitHub Repository for the complete code mentioned here and the cleaned dataset.