This project implements a real-time fraud detection system using Kafka, Spark Structured Streaming, Machine Learning, TimescaleDB, and Grafana for banking transactions. The system detects fraudulent transactions by processing real-time transaction data from Kafka, applying a machine learning model, and storing the results in TimescaleDB for monitoring through a Grafana dashboard.
The system consists of the following components:
- Kafka Broker: Ingests real-time banking transactions and distributes them to consumers.
- Zookeeper: Manages the Kafka broker.
- Producer:
- Normal Producer: Generates legitimate transaction data and sends it to the Kafka topic
raw_transactions
. - Hacker Producer: Simulates fraudulent transactions that will be flagged by the machine learning model.
- Normal Producer: Generates legitimate transaction data and sends it to the Kafka topic
- Spark Cluster:
- Spark Master and Spark Workers: Run a Spark Structured Streaming job that reads transaction data from Kafka, applies feature extraction, and infers fraud probability using a pre-trained machine learning model.
- TimescaleDB: A PostgreSQL-based time-series database for storing the processed transaction data, including fraud predictions.
- Grafana Dashboard: Visualizes the transaction data and fraud metrics in real-time.
- Model Build: Contains the code and data to retrain the machine learning model.
Before running the project, ensure that you have the following installed:
- Docker
- Docker Compose
- Python 3.x for model retraining (optional)
git clone https://github.com/iitravindra/streaming_fraud_detection.git
cd streaming_fraud_detection
Use Docker Compose to build and start the containers.
docker-compose up --build -d
This command starts all the services defined in the docker-compose.yml
, including Kafka, Spark, TimescaleDB, and Grafana.
There are two producers for generating transaction data:
- Normal Producer: Generates regular transaction data every second.
- Hacker Producer: Generates fraudulent transactions intermittently to simulate real-world fraud attempts.
Both producers send the transaction data to the Kafka topic raw_transactions
.
The Spark job processes incoming transactions from Kafka, creates features, and runs inference using a pre-trained machine learning model. The results (fraud predictions) are written to TimescaleDB for further analysis.
TimescaleDB stores the processed transaction data with fraud predictions. A database schema is initialized using the SQL file located at ./monitoring/init.sql
.
Grafana is used to visualize real-time fraud metrics. Once the system is running, you can access the Grafana dashboard at:
http://localhost:3000
Login credentials (default):
- Username:
admin
- Password:
admin
Navigate to Dashboards and click on 'Fraud Detection Dahboard' dashboard.
Here, you'll be able to monitor transaction statistics and detect fraud trends over a 10-minute period.
Note - You may need to refresh the dashboard for the first time. Just refresh the dashboard page for the first time and later it will update after every 5 seconds.
If you want to retrain the fraud detection model, follow these steps:
-
Train the Model: The
model_build
directory contains the code to generate synthetic transaction data and train a machine learning model. You can use the providedml_model.py
script to train a new model using synthetic data.cd model_build python ml_model.py
This will generate a new model binary file
fraud_detection_model.pkl
. -
Update the Model: After training, move the new model binary from the
model_build
directory to thedata_processor
directory to replace the existing model used by the Spark Structured Streaming job.mv model_build/fraud_detection_model.pkl data_processor/fraud_detection_model.pkl
-
Restart the Spark Job: After updating the model, restart the Spark Structured Streaming job to load the updated model.
docker-compose restart spark-worker
This process allows the system to use a new machine learning model for fraud detection.
fraud-detection-system/
├── data_producer/ # Kafka producers (normal and hacker producers)
│ ├── Dockerfile
│ ├── hacker_transaction_producer.py
│ ├── normal_transaction_producer.py
│ ├── requirements.txt
│ └── wait-for-it.sh
├── data_processor/ # Spark Structured Streaming job
│ ├── Dockerfile.spark
│ ├── fraud_detection_model.pkl
│ └── fraud_detector.py # The fraud detection script
├── model_build/ # Machine learning model building scripts
│ ├── ml_model.py # Script to train the model
│ ├── requirements.txt
│ ├── synthetic_fraud_data.csv
│ └── training_data_generation.py
├── monitoring/ # TimescaleDB and Grafana configurations
│ ├── grafana-dashboard/
│ │ ├── dashboard-provider.yaml
│ │ ├── datasource.yaml
│ │ ├── Dockerfile # Grafana setup
│ │ └── fraud_detection_dashboard.json
│ ├── init.sql # SQL file to initialize TimescaleDB schema
├── venv/ # Python virtual environment (optional)
├── .gitignore
├── docker-compose.yml # Docker Compose configuration
├── LICENSE
└── README.md # Project documentation
- Kafka: Message broker to handle streaming data.
- Spark Structured Streaming: For real-time data processing.
- TimescaleDB: Time-series database built on PostgreSQL for storing transaction data.
- Grafana: Dashboard to visualize the real-time fraud detection metrics.
- Docker: Containerizes all components for easy deployment.
- Model Retraining Automation: Automate the process of moving the model binary to the
data_processor
directory and restarting the Spark job. - Scaling: Introduce horizontal scaling to support larger transaction volumes.
- Enhanced Monitoring: Add more detailed metrics to Grafana for better insights into system performance.
This project is licensed under the MIT License.
Feel free to open issues or create pull requests to enhance the project.
For any inquiries, please contact [email protected].