Skip to content

A distributed back-end for the CAM2 project using PySpark and HDFS

License

Notifications You must be signed in to change notification settings

malaref/CAM2DistributedBackend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAM2DistributedBackend

A distributed back-end for the CAM2 project using PySpark and HDFS.

Architecture

>> A visual presentation for the architecture alongside accompanying projects is available here.

The distributed back-end uses Apache Spark for processing and Apache Hadoop Distributed File System (HDFS) for storage. From one end, Spark Standalone cluster setup consists of one master for coordination and several slaves to carry out the actual computation. From the other end, HDFS cluster setup consists of one namenode that stores meta-data information and several datanodes for the actual data storage.

The recommended cluster setup is to run the Spark master daemon and the HDFS namenode daemon on one node (manager node). Similarly, it is recommended to run the Spark slave daemon and the HDFS datanode daemon on the other nodes (worker nodes).

Requirements

Requisite software

Java

  • Download and extract a release (tested with Java 8).

Apache Hadoop

  • Make sure the requisite software is installed. For example, on Debian-based Linux, issue the command:
    sudo apt install ssh pdsh
  • Download and extract a release (tested with 2.7.4).
  • Prepare its configuration in the etc/hadoop/ directory as follows:
    • In etc/hadoop/hadoop-env.sh find the line export JAVA_HOME=${JAVA_HOME} and change it to point to where Java resides.
    • In etc/hadoop/core-site.xml add the following property to the configuration tag:
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode_url:9000</value>
      </property>
      where namenode_url is the namenode host name or IP (must be reachable from the datanodes).
    • In etc/hadoop/hdfs-site.xml add the following property to the configuration tag:
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>

For more details: check the docs.

Apache Spark

  • Download and extract a release (tested with 2.2.0).

CAM2Environment

Create a file ~/CAM2Environment and use it to set the environment variables JAVA_HOME, HADOOP_HOME and SPARK_HOME to point to the where each software resides:

JAVA_HOME=/path/to/java
HADOOP_HOME=/path/to/hadoop
SPARK_HOME=/path/to/spark

Alternatively, set the environment variables manually and make sure they are available for the upcoming scripts.

Installation

Using pip:

pip install git+https://github.com/muhammad-alaref/CAM2DistributedBackend

Start the cluster

On the manager node:

CAM2StartManager

On the worker nodes:

CAM2StartWorker manager_host maximum_concurrent_tasks

where manager_host is the manager host name or IP (must be reachable from the workers) and maximum_concurrent_tasks is the maximum number of tasks (cameras) assigned to this worker concurrently.

Stop the cluster

On the worker nodes:

CAM2StopWorker

On the manager node:

CAM2StopManager

>> Note the reversed order.

Local setup

Issue the previous (manager + worker) commands on the same machine with manager_host=localhost.

Usage

The recommended way is to use the RESTful API project.
Alternatively, the CAM2DistributedBackend command can be used on any node on the cluster (preferably the manager node) with the parameters explained by the CAM2DistributedBackend --help command.

About

A distributed back-end for the CAM2 project using PySpark and HDFS

Resources

License

Stars

Watchers

Forks

Packages

No packages published