ML & Other

Random Forest on Credit Card Approval Classification

Adam Khatib

06 Oct 2019 • 4 min read

This is an excerpt from my Kaggle notebook where I used a Random Forest classifier on credit card data (https://www.kaggle.com/datasets/rohitudageri/credit-card-details/data).

Random Forest was extremely easy to use and offer great insights into the data relatively quickly.

Exploring the Data

The CSV had a total 1548 rows, the table below shows the top 10 rows and with an additional column "Approved" that is 1 or 0

	CHILDREN	Annual_income	Birthday_count	Employed_days	Mobile_phone	Work_Phone	Phone	EMAIL_ID	Family_Members	Approved
count	1548.000000	1.525000e+03	1526.000000	1548.000000	1548.0	1548.000000	1548.000000	1548.000000	1548.000000	1548.000000
mean	0.412791	1.913993e+05	-16040.342071	59364.689922	1.0	0.208010	0.309432	0.092377	2.161499	0.113049
std	0.776691	1.132530e+05	4229.503202	137808.062701	0.0	0.406015	0.462409	0.289651	0.947772	0.316755
min	0.000000	3.375000e+04	-24946.000000	-14887.000000	1.0	0.000000	0.000000	0.000000	1.000000	0.000000
25%	0.000000	1.215000e+05	-19553.000000	-3174.500000	1.0	0.000000	0.000000	0.000000	2.000000	0.000000
50%	0.000000	1.665000e+05	-15661.500000	-1565.000000	1.0	0.000000	0.000000	0.000000	2.000000	0.000000
75%	1.000000	2.250000e+05	-12417.000000	-431.750000	1.0	0.000000	1.000000	0.000000	3.000000	0.000000
max	14.000000	1.575000e+06	-7705.000000	365243.000000	1.0	1.000000	1.000000	1.000000	15.000000	1.000000

Modes

The table below shows the modes for all columns

Field	Value
GENDER	F
Car_Owner	N
Propert_Owner	Y
CHILDREN	0.0
Annual_income	135000.0
Type_Income	Working
EDUCATION	Secondary / secondary special
Marital_status	Married
Housing_type	House / apartment
Employed_days	365243.0
Mobile_phone	1.0
Work_Phone	0.0
Phone	0.0
Type_Occupation	Laborers
Family_Members	2.0
Approved	0.0
Status	Declined

Number of Approved vs Declined

Note the data is heavily skewed to "Declined", this will be important later when fitting the model.

Education Distribution

Mostly high school and junior college.

Occupation Type Distribution

Income Type Distribution

Marital Status Distribution

Employed Days Distribution

This looks odd, but, it's the start day of the job backwards from the current day (0). A positive number means the person is currently unemployed (currently at the time of collection). Mostly around 5-7 years employment

Housing Type Distribution

Preprocess Data

We'll first separate the categorical and continuous fields.

Categorical	Continous
GENDER	CHILDREN
Car_Owner	Family_Members
Propert_Owner	Annual_income
Type_Income	Age
EDUCATION	EmployedDaysOnly
Marital_status	UnemployedDaysOnly
Housing_type
Mobile_phone
Work_Phone
Phone
Type_Occupation
EMAIL_ID

Age is calculated in years from the birthday_count field.
Two new fields are added that count the number of employed days and unemployed days for each person

Random Forest Classifier

Given how skewed the classes are, over sampling is needed.
Given how small the dataset is, undersampling won't be used.

X, y = df[cats + conts].copy(), df[dep]

X_over, y_over = RandomOverSampler().fit_resample(X, y)

X_train, X_val, y_train, y_val = train_test_split(X_over, y_over, test_size=0.25)

X_train[cats] = X_train[cats].apply(lambda x: x.cat.codes)
X_val[cats] = X_val[cats].apply(lambda x: x.cat.codes)

rf = RandomForestClassifier(100, oob_score=True)
rf.fit(X_train, y_train);

Metric	Value
MSE	0.011644832605531296
OOB	0.013598834385624037
Accuracy	0.9883551673944687
F1 Score	0.988235294117647

Feature Importance

And now my favorite part! This plot shows the most influential fields in the data. These are the ones that have the highest split ratio. No surprise length of employment and age are the 2 dominant factors. Annual income a close 3rd. So much data analysis can be condensed into the feature importance plot!

Confusion

There were no cases where the model predicted a "Declined" result where the actual status was "Approved". So the model is very good and accurately declining a credit approval. But it's a little too lenient where it predicted 8 would get approved when they were actually declined.

confusion = confusion_matrix(y_val, preds)

It follows that the ROC curve would almost be perfect, with AUC 0.988

Results

This was a non-starter without oversampling. Without oversampling, accuracy was deceptively pretty good around 91% but when looking at the confusion matrix and abismal F1 score it was obviously aweful.

I split employed days column into "unemployed" and "employed". It's not surprising looking at the feature importance plot that unemployed days, age and income are the top contributors.

Final Results:

Score	Value
ROC AUC	0.9958
MSE	0.0044
OOB	0.0141
Accuracy	0.9956
F1 Score	0.9955