Disclaimer: I’ll be talking about how to come up with the python code, if you want to read the actual code please go to this repo.

Meet the Credit Score Classification Dataset

The dataset that we’ll clean comes from kaggle, which is the train.csv dataset, but this could be used for the test.csv as well.

There are 28 columns and 100k rows in this dataset. I compiled a feature description table that you can see below.

FeatureDescription
IDRepresents a unique identification of an entry
Customer_IDRepresents a unique identification of a person
MonthRepresents the month of the year
NameRepresents the name of a person
AgeRepresents the age of the person
SSNRepresents the social security number of a person
OccupationRepresents the occupation of the person
Annual_IncomeRepresents the annual income of the person
Monthly_Inhand_SalaryRepresents the monthly base salary of a person
Num_Bank_AccountsRepresents the number of bank accounts a person holds
Num_Credit_CardRepresents the number of other credit cards held by a person
Interest_RateRepresents the interest rate on a credit card
Num_of_LoanRepresents the number of loans taken from the bank
Type_of_LoanRepresents the types of loan taken by a person
Delay_from_due_dateRepresents the average number of days delayed from the payment date
Num_of_Delayed_PaymentRepresents the average number of payments delayed by a person
Changed_Credit_LimitRepresents the percentage change in credit card limit
Num_Credit_InquiriesRepresents the number of credit card inquiries
Credit_MixRepresents the classification of the mix of credits
Outstanding_DebtRepresents the remaining debt to be paid (in USD)
Credit_Utilization_RatioRepresents the utilization ratio of credit cards
Credit_History_AgeRepresents the age of credit history of the person
Payment_of_Min_AmountRepresents whether only the minimum amount was paid by the person
Total_EMI_per_monthRepresents the monthly EMI payments (in USD)
Amount_invested_monthlyRepresents the monthly amount invested by the customer (in USD)
Payment_BehaviourRepresents the payment behavior of the customer (in USD)
Monthly_BalanceRepresents the monthly balance amount of the customer (in USD)
Credit_ScoreRepresents the bracket of credit score (Poor, Standard, Good)

Even though we have 100k rows, within these rows that are only 12,500 different customers, each customer appears 8 times (from January to August). So basically we can select a particular customer and look at their information and easily find incorrect data and be able to adjust it.

Cleaning Typos and Outliers

In this dataset that is a lot of typos or just straight-up nonsense. You’ll find some values to be: _, !@9#%8, __10000__, NM or _______. I believe these typos are in the dataset to represent the improbability that you may find when dealing with real-world data and most of them mean that this is a null value.

For a moment I thought __10000__ would just be a typo, but there is no amount invested monthly that is over 200 dollars.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__10000__             4305
0.0                    169
80.41529543900253        1
36.66235139442514        1
89.7384893604547         1
                      ... 
36.541908593249026       1
93.45116318631192        1
140.80972223052834       1
38.73937670100975        1
167.1638651610451        1
Name: Amount_invested_monthly, Length: 91049, dtype: int64

Following this logic, I looked for nonsense in the data frame and I started to replace them with numpy nan’s. I also looked for outliers by looking at the distribution of values, if there was a value that only appeared once and was isolated I substitute it for a null value. I based this decision not only on this but also when I looked for customers that had this outlier and I observed all the data from this particular customer, I’d see weird things like:

a table displaying information on a given customer and showing an outlier on its annual income

By looking at this customer is clear that he didn’t make this much money annually only one month of the year.

When you finish this search for typos and outliers don’t forget to assign the correct data type to your features. Some features like age started with string characters among the age values and because of this, it’s uploaded as an object instead of int or float.

Filling Null Values

After dealing with all the outliers and typos, we ended up with a lot of null values, as you can see:

1
df.isna().sum()[df.isna().sum() > 0]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Age                         2776
Occupation                  7062
Annual_Income                993
Monthly_Inhand_Salary      15002
Num_Credit_Card             2271
Interest_Rate               2034
Num_of_Loan                 4348
Type_of_Loan               11408
Num_of_Delayed_Payment      7002
Changed_Credit_Limit        2091
Num_Credit_Inquiries        1965
Credit_Mix                 20195
Credit_History_Age          9030
Payment_of_Min_Amount      12007
Amount_invested_monthly     8784
Payment_Behaviour           7600
Monthly_Balance             1200
dtype: int64

Instead of just dropping all these null values I first try to fill them using the information I already have. Remember that I said that a customer has historical data for 8 months? We can just use this historical data to fill the null values using an aggregation measurement of our choice filtering for the customer, this will be more accurate than just calculating the mean value of the database.

I decided to use the average values for the following columns:

1
2
3
4
5
mean_columns = [
    'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
    'Amount_invested_monthly', 'Monthly_Balance', 'Num_of_Loan',
    'Num_Credit_Card', 'Interest_Rate', 'Annual_Income', 'Monthly_Inhand_Salary'
    ]

And the last non-empty value for these:

1
last_columns = ['Age', 'Occupation', 'Type_of_Loan', 'Credit_Mix']

The reason for not using the mean for all my values is that I didn’t want to have someone be 20.5 years old and Occupation, Type_of_Loan, and Credit_Mix are discrete data.

Feature Engineering

With the clean data, we can proceed to feature engineering. In this case, we first want to change the Type_of_Loan, because that are some occurrences that it has all the loans in one value, as you can see:

1
2
3
4
5
6
Not Specified, Mortgage Loan, Auto Loan, and Payday Loan                                                                                         8
Payday Loan, Mortgage Loan, Debt Consolidation Loan, and Student Loan                                                                            8
Debt Consolidation Loan, Auto Loan, Personal Loan, Debt Consolidation Loan, Student Loan, and Credit-Builder Loan                                8
Student Loan, Auto Loan, Student Loan, Credit-Builder Loan, Home Equity Loan, Debt Consolidation Loan, and Debt Consolidation Loan               8
Personal Loan, Auto Loan, Mortgage Loan, Student Loan, and Student Loan                                                                          8
Name: Type_of_Loan, Length: 5380, dtype: int64

So I’ll save all the different loan types in one vector, by splitting the loans every time there is a , or , and.

1
2
3
4
5
6
loan_types = []
for index in df.index:
    temp = df.Type_of_Loan[index].replace('and ', '').split(', ')
    for i in temp: #loan in temp array
        if i not in loan_types: #if loan is not in loan_types
            loan_types.append(i) #add it

Now we can create dummy variables of these loan_types, so a customer will receive the number 1 if they have this loan or a 0 if they don’t.

1
2
3
4
5
6
for loan in loan_types:
    df[loan] = 0 #create the loan column in the df with 0
    for index in df.index:
        temp = df.Type_of_Loan[index].replace('and ', '').split(', ')
        if loan in temp:
            df.loc[index, loan] = 1

Now I want to keep working on this dataset to make it ready for training a machine learning model. For this reason, I need to transform my discrete data into numeric.

The feature Credit_History_Age has the values as strings “22 Years and 5 Months”, this pattern repeats itself, so we can take advantage of this and select the year multiplied by 12 and sum the month, resulting in a new feature with the credit history age in months. When we are done with this, there are still going to be null values, to fill them I choose to interpolate the values, this works great when the missing value is in February up until July because it interpolates with the customer’s credit history age, but it becomes a bad guessed when the missing value is in January or August.

The months’ names are going to be replaced by their number counterpart, so January is 1, February is 2, and so on. credit_mix and credit_score have 3 sequential categories, I choose to go with -1, 0, and 1, but you can use 1, 2, 3 and it’ll produce the same result.

Don’t forget to check the GitHub Repository for the complete code mentioned here and the cleaned dataset.