A distributed back-end for the CAM2 project using PySpark and HDFS.
>> A visual presentation for the architecture alongside accompanying projects is available here.
The distributed back-end uses Apache Spark for processing and Apache Hadoop Distributed File System (HDFS) for storage. From one end, Spark Standalone cluster setup consists of one master for coordination and several slaves to carry out the actual computation. From the other end, HDFS cluster setup consists of one namenode that stores meta-data information and several datanodes for the actual data storage.
The recommended cluster setup is to run the Spark master daemon and the HDFS namenode daemon on one node (manager node). Similarly, it is recommended to run the Spark slave daemon and the HDFS datanode daemon on the other nodes (worker nodes).
- Download and extract a release (tested with Java 8).
- Make sure the requisite software is installed. For example, on Debian-based Linux, issue the command:
sudo apt install ssh pdsh
- Download and extract a release (tested with 2.7.4).
- Prepare its configuration in the
etc/hadoop/
directory as follows:- In
etc/hadoop/hadoop-env.sh
find the lineexport JAVA_HOME=${JAVA_HOME}
and change it to point to where Java resides. - In
etc/hadoop/core-site.xml
add the following property to theconfiguration
tag:where namenode_url is the namenode host name or IP (must be reachable from the datanodes).<property> <name>fs.defaultFS</name> <value>hdfs://namenode_url:9000</value> </property>
- In
etc/hadoop/hdfs-site.xml
add the following property to theconfiguration
tag:<property> <name>dfs.replication</name> <value>1</value> </property>
- In
For more details: check the docs.
- Download and extract a release (tested with 2.2.0).
Create a file ~/CAM2Environment
and use it to set the environment variables JAVA_HOME
, HADOOP_HOME
and SPARK_HOME
to point to the where each software resides:
JAVA_HOME=/path/to/java
HADOOP_HOME=/path/to/hadoop
SPARK_HOME=/path/to/spark
Alternatively, set the environment variables manually and make sure they are available for the upcoming scripts.
Using pip:
pip install git+https://github.com/muhammad-alaref/CAM2DistributedBackend
On the manager node:
CAM2StartManager
On the worker nodes:
CAM2StartWorker manager_host maximum_concurrent_tasks
where manager_host is the manager host name or IP (must be reachable from the workers) and maximum_concurrent_tasks is the maximum number of tasks (cameras) assigned to this worker concurrently.
On the worker nodes:
CAM2StopWorker
On the manager node:
CAM2StopManager
>> Note the reversed order.
Issue the previous (manager + worker) commands on the same machine with manager_host=localhost
.
The recommended way is to use the RESTful API project.
Alternatively, the CAM2DistributedBackend
command can be used on any node on the cluster (preferably the manager node) with the parameters explained by the CAM2DistributedBackend --help
command.