Skip to content

Credit Card Fraudulent Detection with Random Forest


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



37 Commits

Repository files navigation

[CCFD-RF] Credit Card Fraudulent Detection with Random Forest

This is a project for Credit Card Fraudulent Detection with Random Forest using Spark Structured Streaming


In the code:

There are 3 options if you want to run CCFD-RF

  1. Option 1: Run job locally, reading from a file and writing to console
  2. Option 2: Run job locally, reading from a kafka source and writing to a kafka sink
  3. Option 3: Run job in SoftNet cluster, reading from HDFS and writing to HDFS

We propose to run the project with Option 2 because it is easier to test:
The attached code is written in Option 2

Configure SparkSession

Option 1 & 2 Run locally:

In line 25-30 [StructuredRandomForest]: Configure SparkSession variable
    val spark = SparkSession.builder()
      .config("spark.sql.streaming.checkpointLocation", "checkpoint_saves/")

Option 3 Run on the cluster:

In line 25-30 [StructuredRandomForest]: Configure SparkSession variable
    val spark = SparkSession.builder()
       .config("spark.sql.streaming.checkpointLocation", "/user/vvittis")


Option 1 Read from file:

In line 35-43 [StructuredRandomForest]: Read from Source
 val rawData = spark.readStream.text("dataset_source/")

Option 2 Read from kafka:

In line 35-43 [StructuredRandomForest]: Read from Source
 val rawData = spark.readStream
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("subscribe", "testSource")
          .option("startingOffsets", "earliest")
          .selectExpr("CAST(value AS STRING)")

Note: of course you have to execute:

Open 2 command line windows and cd on “C:\kafka_2.12-2.3.0”
1st window
bin\windows\zookeeper-server-start.bat config\
2nd window
bin\windows\kafka-server-start.bat config\

Option 3 Read from an HDFS file:

In line 35-43 [StructuredRandomForest]: Read from Source
val rawData = spark.readStream.text("/user/vvittis/numbers")

Note: /user/vvittis/numbers is a path to a HDFS folder


Option 1 Write to console:

In line 212 [StructuredRandomForest]: Write to Console
  val query = kafkaResult
      .option("truncate", "false")

Option 2 Write to kafka:

In line 215-230 [StructuredRandomForest]: Write to kafka sink
        val query = kafkaResult
          .selectExpr("CAST(value AS STRING)")
          .option("kafka.bootstrap.servers", "localhost:9092")
          .option("topic", "testSink")

Option 3 Write to HDFS file:

In line 224-230 [StructuredRandomForest]: Write to HDFS sink
        val query = kafkaResult

Note: /user/vvittis/results is a path to a HDFS folder

RUN the project.

In Intellij

Step 1: Clone CCFD-RF File > New > Project From Version Control... 
Step 2: In the URL: copy 
        In the Directory: Add your preferred directory
Step 3: Click the build button or Build > Build Project
Step 4: Go to src > main > scala > StructuredRandomForest.scala and click Run
  • A typical Console showing the state:

alt text

  • A typical Console showing the output:

alt text

In Cluster

You will find the sbt folder

Step 1: Run sbt assembly and create a .jar file
Step 2: Run
        --class StructuredRandomForest 
        --master yarn-client 
        --num-executors 10 
        --driver-memory 512m 
        --executor-memory 512m 
        --executor-cores 1 /home/vvittis/StructuredRandomForest-assembly-0.1.jar
  • A typical Cluster showing that each executor takes one Hoeffding Tree of the Random Forest:
  • This test executed with 10 executors and 10 HT.

alt text

Licensed under the MIT Licence.