Fraud Detection

Anomaly Detection Implementation(Credit card fraud) - Random Forest Classifier :

The credit card data set is heavily imbalanced with more than 99% Valid records.

Feature Engineering :

The first step is to Scale the "Time" and "Amount" column appropriately. The Robust Scaler has been used for transforming the due to the nature of the outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Balancing the Dataset :

The imbalanced nature of the data means that directly modelling the given dataset will not provide an accurate model. There are 2 popular methods to deal with imbalanced datasets :

Undersampling - down-sizing the majority class by removing observations until the dataset is balanced
Oversampling - over-sizing the minority class by adding observations

Undersampling while helping balance the data, discards a lot of it. This means that the model may potentially loose out on valuable information. Thereby we've used SMOTE to oversample the minority class.

SMOTE :

The SMOTE algorithm is one of the first and still the most popular algorithmic approach to generating new dataset samples. The algorithm, introduced and accessibly enough described in a 2002 paper, works by oversampling the underlying dataset with new synthetic points. The SMOTE algorithm is parameterized with k_neighbors (the number of nearest neighbors it will consider) and the number of new points you wish to create. Each step of the algorithm will:

Randomly select a minority point.
Randomly select any of its k_neighbors nearest neighbors belonging to the same class.
Randomly specify a lambda value in the range [0, 1].
Generate and place a new point on the vector between the two points, located lambda percent of the way from the original point.

Model :

A random forest classifier with grid search hyper parameter tuning. An imbalanced pipeline is used to oversample the data and then the oversampled data is used to train the data. However the oversampled values are not used for the validation.

Model Performance :

With highly imbalanced models, accuracy is a bad measure to check a models performance due to majority bias. Recall and F1 score are a much better indicator of the models performance.

Initial attempts to train the model on the unbalanced dataset resulted in a recall of 77% which improved to 86% on the model trained on the Oversampled data.

Automated

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
images		images
.gitignore		.gitignore
Anomaly_Detection_(Credit_Card_data_ML_Approach).ipynb		Anomaly_Detection_(Credit_Card_data_ML_Approach).ipynb
Anomaly_Detection_(Credit_Card_data_Random_Forest_Classifier).ipynb		Anomaly_Detection_(Credit_Card_data_Random_Forest_Classifier).ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud Detection

Anomaly Detection Implementation(Credit card fraud) - Random Forest Classifier :

Feature Engineering :

Balancing the Dataset :

SMOTE :

Model :

Model Performance :

About

Releases

Packages

Languages

abhinav2301/Fraud_Detection

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection

Anomaly Detection Implementation(Credit card fraud) - Random Forest Classifier :

Feature Engineering :

Balancing the Dataset :

SMOTE :

Model :

Model Performance :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages