Credit Card Fraud Classification

5 min readOct 23, 2020

Overview

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not falsely charged for items that they did not purchase. Therefore credit card companies need model that can help them to identify fraud given some transaction information.

Data

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It contains only numeric input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant,cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Step 1 General work flow

standard input block

Step 2 Quick glance of data

First, I’d like to check if there is any correlation among these anonymous features.

The color scale of heatmap shows there is little or low correlation. Next, regarding the two known features, I’m curious is there any underlying pattern can help us to distinguish class? Here I used to scatter plot to illustrate distribution between class and time/amount separately.

Hooray! We find that most fraudulent activities have amount lower than 2.1K, but they don’t have special time occurrence pattern. For now, transaction amount is one of features that can help classify the target well.

Step 3 Class imbalance check

We need to pick metric for this project, but before that, we should to check our target instance is balanced or not, this will impact us on which metric to choose from.

These show that target instances are extremely imbalanced, which is acceptable after all fraud is minority compared to tremendous transactions occur.

If we recall what credit card company cares is to identify fraudulent activities/false negative( transactions is actually fraud/positive, which we test it as negative) rather than false positive(transactions is actually normal/negative, which we test it as fraud/positive, this may annoy customers by incorrect warning ), as least the latter one doesn’t leave any monetary loss. Hence , recall is the preferred metric for this problem. The formula is offered as below: Recall = TP / (TP+FN)

Step 4 Data preprocessing

This stage includes data split, scaling and oversampling, in fact it’s better to oversample the dataset to let model have more instances to train. I only oversample the train data and leave test set intact.

Here I use SMOTE oversampling to generate synthetic points and set random seeds to make sure have same outcome in the future.

Step 5 Base model selection

For this project, the following algorithms are selected, furthermore, they are tested under cross validation of recall scores during this stage and their performance will be evaluated based on their mean and standard deviation of recall scores.

A boxplot is more intuitive on this .

However , only boxplot is not sufficient for this evaluation, I also use ROC-AUC curve and precision recall curve to help reveal which model is better on performance.

Regarding this target class is imbalanced, ROC-AUC curve is less precise than precision recall curve, precision recall curve has more weight in overall evaluation.

It’s clear from three plots above that XGboost model have highest recall scores on training process.

Step 6 Fine-tuning the model

Since GridsearchCV is very time consuming, in addition that XGboost is ensemble model, this requires extra long time for parameters tuning. I use Bayesian optimization instead not only to save time but also can have insight on its iteration table as following.

Step 7 Model performance assessment

Now we can see how this model works with its tuned parameters by checking its recall score and confusion matrix on test set.

From confusion matrix, we can see number of wrong case is 59 out of 85443, which is not surprise because class imbalance. false negative is 17 out of 136, classification power is 88%, we can get same score from classification report below.

Thanks for your time reading.

Jupyter notebook could be found on Github, enjoy!

The data set is over 25MB limit on Github, but it’s easy to find online if you search base on data description from this blog.