Top-Down Specialization on Apache Spark™

Proposed top-down specialization algorithm on Apache Spark

Based on the following papers:

U. Sopaoglu and O. Abul. A top-down k-anonymization implementation for apache spark. In 2017 IEEE International Conference on Big Data (Big Data), pages 4513– 4521, December 2017.
B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In 21st International Conference on Data Engineering (ICDE’05), pages 205–216, April 2005.

Build

Run sbt build

Package

Run sbt package task for local spark, and sbt assembly for Spark Submit

Run

Spark submit

Local

spark-submit --class TopDownSpecialization --master local[*] target/scala-2.12/code-assembly-0.1.jar <pathToInputDataset> <k>

Cluster

$SPARK_HOME/bin/spark-submit --deploy-mode cluster --master spark://$SPARK_MASTER_HOST:7077 --class TopDownSpecialization --conf spark.sql.shuffle.partitions=$NUM_OF_CORES code-assembly-0.1.jar /home/student/adult-10M.csv 100

Cluster Installation

sudo apt update -y
# Install Java   
sudo apt install default-jre -y   
sudo apt install default-jdk -y   
echo "JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" | sudo tee /etc/environment  
source /etc/environment

# Install Scala  
sudo apt-get remove scala-library scala   
wget www.scala-lang.org/files/archive/scala-2.12.10.deb   
sudo dpkg -i scala-2.12.10.deb  
sudo apt-get update  
sudo apt-get install scala

# Install Spark   
wget https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz && tar xvf spark-2.4.2-bin-hadoop2.7.tgz  
rm spark-2.4.2-bin-hadoop2.7.tgz  
rm scala-2.12.10.deb  

# If necessary, downgrade Java to v8 (Spark 2.4.2 does not have full support of Java 11)
sudo apt install openjdk-8-jdk -y
sudo update-alternatives --config java

Dataset expansion

The dataset in the src/main/resources folder is from The University of California Irvine's Center for Machine Learning and Intelligent Systems. However, it only has 32561 rows which was not enough to carry out the performance tests required. An ExpandDataset Scala application is provided in src/main/scala folder. To run expansion:

$SPARK_HOME/bin/spark-submit --deploy-mode cluster --master spark://$SPARK_MASTER_HOST:7077 --class ExpandDataset --conf spark.sql.shuffle.partitions=16 --conf spark.driver.memory=12g code-assembly-0.1.jar /path/to/output/folder 32561 $TARGET_NUM_ROWS

Structure of Project

The source files are in src/main/scala. It contains two classes, the main algorithm and the data expansion utility.
The src/main/resources folder contains the basic adult dataset as well as the taxonomy tree in JSON format.
Unit tests are in src/test/scala.
Performance evaluation experiments are recorded in Excel spreadsheet with 3 tabs in /data/experiments.xlsx.

Useful documentation

Submitting Applications
Installing Spark Standalone to a Cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Top-Down Specialization on Apache Spark™

Build

Package

Run

Spark submit

Local

Cluster

Cluster Installation

Dataset expansion

Structure of Project

Useful documentation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Top-Down Specialization on Apache Spark™

Build

Package

Run

Spark submit

Local

Cluster

Cluster Installation

Dataset expansion

Structure of Project

Useful documentation