Assitan Koné
Dec 29

Credit Card Fraud Detection Part 1

This tutorial will be in two parts.
You can also watch the tutorial on YouTube:
How do we know when a credit card is fraudulent or not?
This is a classification problem statement.
I like typing the problem statement, so let’s write in markdown what we’ve said before, whether this credit card is fraudulent or not. Now, about the dataset. We can find it on Kaggle. There is some information about it. Class 0, 1. the statistics. You can download the dataset here to get the archive.

We type the link in our notebook:
Don’t forget to you put the CSV in your folder.
To get it, just click on download, then unzip the archive to get this CSV file.

1. Importing Libraries

We import numpy as np (for algebraic operations on arrays), pandas as pd (for data exploration and manipulation). Then matplotlib, seaborn plotting libraries. The inline is a magic function for visualisation.

2. Loading the dataset

We load the dataset. We create a variable named train with the file path, and create a new variables named df_train to create a dataframe.
loading the dataset

3. Exploratory data analysis

Let’s view the first five lines. For that, we use .head(). at first glance, we see numerical values. We don’t understand the data. so it seems to be changed with PCA. It’s normal because credit card data needs to be confidential.
Exploratory data analysis
Now let’s see the shape., A lot of lines.
shape of the dataset
We check the data information. It will help us to to interpret a little bit. So we have 284807 rows, 31 columns, all numerical features and no missing columns.
shape of the dataset
Let’s type that.
Interpreting Data Information
We have 284807 rows, any column that contains a lesser number of rows has missing values.
We have 31 columns.
There are numerical features that have data type float64.
There are numerical features that have data type int64.
No missing columns.”
The columns’ names are a little bit different, so let’s normalize everything in lowercase.
shape of the dataset
Let’s see the statistics of numerical variables (so every variable).
We can compare the mean of each column with the min/max value, to check if we might have outliers as there's a considerable difference between the average value and the max value.
We can compare the general mean and standard deviation to see if we need to normalize the data as there's a considerable difference between all of them.
shape of the dataset
Now, we’re gonna go with univariate analysis.
shape of the dataset
For numerical continue variables, we can use a histogram or scatter plot, for categorical data, we commonly prefer bar plots or pie charts.
The goal is to analyze the target variable.
We can check again for missing values. Yes, still zéro! We can check the number of unique values and the frequency distribution.
shape of the dataset
The percent breakdown of the target. As you can see, this is the same code as the frequency distribution, but we normalize it as true.
shape of the dataset
I think it would be better to plot it, so let’s do that.
The dataset is very imbalanced (99/1) we can't use accuracy because of the huge imbalance.
shape of the dataset
Accuracy is correct predictions on all predictions. Super simple.
This is a balanced dataset.
The thing is, we want to create a baseline model to have a minimum score to beat after with better models. If we have 99% of negative values, our score will be around this number, and how can we beat that? This is too high.
So, we have positive and negative values, and I think you know true positives, false positives….
We can type how many cards are fraudulent.
shape of the dataset
Bivariate Analysis requires you to learn about relationships between pairs of variables. We can use a scatter plot, a pair plot, or a correlation matrix.
We can for example compare amount with class.
shape of the dataset
And filter to show only positive class. We can see that the amount is variable.
shape of the dataset
Let’s check the mean.
On average the fraud is 88.
shape of the dataset
Let’s check the distribution with histograms.
shape of the dataset
One thing we can do is to check correlations.
shape of the dataset
As usual; it’s better to plot it. We do that with a correlation matrix.
shape of the dataset
Now we’re gonna check outliers. We isolate numerical columns just to show you how to do that.
shape of the dataset
Let’s do again the summary statistics of all the columns.
shape of the dataset
We can plot the outliers with boxplots. So we do that for amount and V25.
shape of the dataset
Also, we calculate outlier space for both.
shape of the dataset
Also, we calculate outlier space for both.

4. Baseline model

It’s time to create our baseline model. The goal is to have a quick idea about what we can get with just a few variables.
We select 5 variables and of course without the target.
shape of the dataset
So we create a function that we name baseline_model. The input is the dataset.
First we isolate the target from the target as usual (class). We get the features and we return the features and the target separately.
shape of the dataset
We create X and y.
Let’s check X.
shape of the dataset
Then, we import the logistic regression algorithm.
We instantiate the model, and we train it.
shape of the dataset
We make predictions.
shape of the dataset
We’re gonna use the confusion matrix. A confusion matrix is a table that is used to define the performance of a classification algorithm.
Precision-Recall is a useful measure of the success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.
shape of the dataset

5. Feature engineering

It’s time to do feature engineering. We start with feature scaling. let's use describe.
shape of the dataset
Mean, std, min and max are very different from each other, so let's standardize. We’re gonna use StandardScaler and MinMaxScaler.
shape of the dataset
With MinMaxScaler everything is on the same scale.
shape of the dataset
Now we can update the model with the other features.
Let’s save the dataset. Now we get the new one.
shape of the dataset
So we create a function named update_model. Pretty much the same as the first one but this time we drop class.
shape of the dataset
We update X and y.
shape of the dataset
Then we fit the model.
shape of the dataset
We make predictions.
shape of the dataset
We plot again the confusion matrix. It seems worse.
shape of the dataset
Let’s see the coefficients (weight of variables).
We’re gonna use feature_names_in_.
shape of the dataset
We can display an array and sort it.
shape of the dataset
Let’s plot it to see better.
shape of the dataset
There are many very discriminating features especially V3, V15, V25 are very discriminating. amount is neutral. V7, V22 and V21 have a positive impact.
When we train our model on all features, the bias term is -3.29. The reason why the sign before the bias term is negative is the class balance. Meaning the probability of non-fraudulent on average is a little high.
shape of the dataset
See you next time for the part 2.

Author

Assitan Koné
Founder @Codistwa
Empty space, drag to resize

SHARE

Write your awesome label here.

Design Your Custom Machine Learning Chatbot

Write your awesome label here.
This quiz aims to help you create a machine learning-powered chatbot tailored to your interests, passions, culture, values, and expertise area. Answer the following questions honestly to uncover your ideal chatbot concept.
Sign up. Be inspired. Code.

Get a FREE Machine Learning Roadmap!

Subscribe to our newsletter to get your gift.

Get tips to teach yourself data science without being overwelmed in your email box. Get secrets to think and act like a Data Scientist on a daily basis. 
Write your awesome label here.
Created with