Random Forest on Credit Card Approval Classification

Random Forest on Credit Card Approval Classification

This is an excerpt from my Kaggle notebook where I used a Random Forest classifier on credit card data (https://www.kaggle.com/datasets/rohitudageri/credit-card-details/data).

Random Forest was extremely easy to use and offer great insights into the data relatively quickly.

Exploring the Data

The CSV had a total 1548 rows, the table below shows the top 10 rows and with an additional column "Approved" that is 1 or 0

CHILDREN Annual_income Birthday_count Employed_days Mobile_phone Work_Phone Phone EMAIL_ID Family_Members Approved
count 1548.000000 1.525000e+03 1526.000000 1548.000000 1548.0 1548.000000 1548.000000 1548.000000 1548.000000 1548.000000
mean 0.412791 1.913993e+05 -16040.342071 59364.689922 1.0 0.208010 0.309432 0.092377 2.161499 0.113049
std 0.776691 1.132530e+05 4229.503202 137808.062701 0.0 0.406015 0.462409 0.289651 0.947772 0.316755
min 0.000000 3.375000e+04 -24946.000000 -14887.000000 1.0 0.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 1.215000e+05 -19553.000000 -3174.500000 1.0 0.000000 0.000000 0.000000 2.000000 0.000000
50% 0.000000 1.665000e+05 -15661.500000 -1565.000000 1.0 0.000000 0.000000 0.000000 2.000000 0.000000
75% 1.000000 2.250000e+05 -12417.000000 -431.750000 1.0 0.000000 1.000000 0.000000 3.000000 0.000000
max 14.000000 1.575000e+06 -7705.000000 365243.000000 1.0 1.000000 1.000000 1.000000 15.000000 1.000000

Modes

The table below shows the modes for all columns

Field Value
GENDER F
Car_Owner N
Propert_Owner Y
CHILDREN 0.0
Annual_income 135000.0
Type_Income Working
EDUCATION Secondary / secondary special
Marital_status Married
Housing_type House / apartment
Employed_days 365243.0
Mobile_phone 1.0
Work_Phone 0.0
Phone 0.0
Type_Occupation Laborers
Family_Members 2.0
Approved 0.0
Status Declined

Number of Approved vs Declined

Note the data is heavily skewed to "Declined", this will be important later when fitting the model.

png

Education Distribution

Mostly high school and junior college.

png

Occupation Type Distribution

png

Income Type Distribution

png

Marital Status Distribution

png

Employed Days Distribution

This looks odd, but, it's the start day of the job backwards from the current day (0). A positive number means the person is currently unemployed (currently at the time of collection). Mostly around 5-7 years employment

png

Housing Type Distribution

png

Preprocess Data

We'll first separate the categorical and continuous fields.

Categorical Continous
GENDER CHILDREN
Car_Owner Family_Members
Propert_Owner Annual_income
Type_Income Age
EDUCATION EmployedDaysOnly
Marital_status UnemployedDaysOnly
Housing_type
Mobile_phone
Work_Phone
Phone
Type_Occupation
EMAIL_ID
  • Age is calculated in years from the birthday_count field.
  • Two new fields are added that count the number of employed days and unemployed days for each person

Random Forest Classifier

  • Given how skewed the classes are, over sampling is needed.
  • Given how small the dataset is, undersampling won't be used.
X, y = df[cats + conts].copy(), df[dep]

X_over, y_over = RandomOverSampler().fit_resample(X, y)

X_train, X_val, y_train, y_val = train_test_split(X_over, y_over, test_size=0.25)

X_train[cats] = X_train[cats].apply(lambda x: x.cat.codes)
X_val[cats] = X_val[cats].apply(lambda x: x.cat.codes)
rf = RandomForestClassifier(100, oob_score=True)
rf.fit(X_train, y_train);
Metric Value
MSE 0.011644832605531296
OOB 0.013598834385624037
Accuracy 0.9883551673944687
F1 Score 0.988235294117647

Feature Importance

And now my favorite part! This plot shows the most influential fields in the data. These are the ones that have the highest split ratio. No surprise length of employment and age are the 2 dominant factors. Annual income a close 3rd. So much data analysis can be condensed into the feature importance plot!

png

Confusion

There were no cases where the model predicted a "Declined" result where the actual status was "Approved". So the model is very good and accurately declining a credit approval. But it's a little too lenient where it predicted 8 would get approved when they were actually declined.

confusion = confusion_matrix(y_val, preds)
png

It follows that the ROC curve would almost be perfect, with AUC 0.988

png

Results

This was a non-starter without oversampling. Without oversampling, accuracy was deceptively pretty good around 91% but when looking at the confusion matrix and abismal F1 score it was obviously aweful.

I split employed days column into "unemployed" and "employed". It's not surprising looking at the feature importance plot that unemployed days, age and income are the top contributors.

Final Results:

Score Value
ROC AUC 0.9958
MSE 0.0044
OOB 0.0141
Accuracy 0.9956
F1 Score 0.9955