This is an example of the GPU accelerated PCA algorithm running on Spark.
Please refer to README in the spark-rapids-ml github repository for build instructions and API usage.
User can also download the release jar from Maven central. Due to incompatible cuda libraries, we provide 2 jars for different cuda environments:
For cuda11.0
: rapids-4-spark-ml_2.12-21.10.0-cuda11.jar
For cuda11.1
to cuda11.4
: rapids-4-spark-ml_2.12-21.10.0-cuda11-2.jar
User can find sample scala code in main.scala
. In the sample code, we will generate random data with 2048 feature dimensions. Then we use PCA to reduce number of features to 3.
Just copy the sample code into the spark-shell laucnhed according to this section and REPL will give out the algorithm results.
Apache Toree is required to run PCA sample code in a Jupyter Notebook.
It is assumed that a Standalone Spark cluster has been set up, the SPARK_MASTER
and SPARK_HOME
environment variables are defined and point to the master spark URL (e.g. spark://localhost:7077
), and the home directory for Apache Spark respectively.
-
Make sure you have jupyter notebook and sbt installed first.
-
Build the 'toree' locally to support scala 2.12, and install it.
# Download toree wget https://github.com/apache/incubator-toree/archive/refs/tags/v0.5.0-incubating-rc4.tar.gz tar -xvzf v0.5.0-incubating-rc4.tar.gz # Build the Toree pip package. cd incubator-toree-0.5.0-incubating-rc4 make pip-release # Install Toree pip install dist/toree-pip/toree-0.5.0.tar.gz
-
Install a new kernel with the jar(use $RAPIDS_ML_JAR for reference) built from section Build and launch
jupyter toree install \ --spark_home=${SPARK_HOME} \ --user \ --toree_opts='--nosparkcontext' \ --kernel_name="spark-rapids-ml-pca" \ --spark_opts='--master ${SPARK_MASTER} \ --jars ${RAPIDS_ML_JAR} \ --conf spark.driver.memory=10G \ --conf spark.executor.memory=10G \ --conf spark.executor.heartbeatInterval=20s \ --conf spark.executor.extraClassPath=${RAPIDS_ML_JAR} \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.task.resource.gpu.amount=1 \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
Launch the notebook:
jupyter notebook
Please choose "spark-rapids-ml-pca" as your notebook kernel.
We also provide the spark-submit way to run the PCA example. We suggest using Dockerfile to get a clean runnning environment:
docker build -f Dockerfile -t nvspark/pca:0.1 .
Then get into the container of this image(nvidia-docker
is required as we will use GPU then):
nvidia-docker run -it nvspark/pca:0.1 bash
In this docker image, we assume that user has 2 GPUs in his machine. If you are not the condition, please modify the -Dspark.worker.resource.gpu.amount
in spark-env.sh
according to your actual environment.
Then just start the standalone Spark and submit the job:
./start-spark.sh
./spark-submit.sh