Skip to content

Latest commit

 

History

History
95 lines (64 loc) · 3.87 KB

README.md

File metadata and controls

95 lines (64 loc) · 3.87 KB

PCA example

This is an example of the GPU accelerated PCA algorithm running on Spark.

Build

Please refer to README in the spark-rapids-ml github repository for build instructions and API usage.

Get jars from Maven Central

User can also download the release jar from Maven central. Due to incompatible cuda libraries, we provide 2 jars for different cuda environments:

For cuda11.0 : rapids-4-spark-ml_2.12-21.10.0-cuda11.jar

For cuda11.1 to cuda11.4 : rapids-4-spark-ml_2.12-21.10.0-cuda11-2.jar

Sample code

User can find sample scala code in main.scala. In the sample code, we will generate random data with 2048 feature dimensions. Then we use PCA to reduce number of features to 3.

Just copy the sample code into the spark-shell laucnhed according to this section and REPL will give out the algorithm results.

Notebook

Apache Toree is required to run PCA sample code in a Jupyter Notebook.

It is assumed that a Standalone Spark cluster has been set up, the SPARK_MASTER and SPARK_HOME environment variables are defined and point to the master spark URL (e.g. spark://localhost:7077), and the home directory for Apache Spark respectively.

  1. Make sure you have jupyter notebook and sbt installed first.

  2. Build the 'toree' locally to support scala 2.12, and install it.

    # Download toree
    wget https://github.com/apache/incubator-toree/archive/refs/tags/v0.5.0-incubating-rc4.tar.gz
    
    tar -xvzf v0.5.0-incubating-rc4.tar.gz
    
    # Build the Toree pip package.
    cd incubator-toree-0.5.0-incubating-rc4
    make pip-release
    
    # Install Toree
    pip install dist/toree-pip/toree-0.5.0.tar.gz
  3. Install a new kernel with the jar(use $RAPIDS_ML_JAR for reference) built from section Build and launch

    jupyter toree install                                \
    --spark_home=${SPARK_HOME}                             \
    --user                                          \
    --toree_opts='--nosparkcontext'                         \
    --kernel_name="spark-rapids-ml-pca"                         \
    --spark_opts='--master ${SPARK_MASTER} \
      --jars ${RAPIDS_ML_JAR}       \
      --conf spark.driver.memory=10G \
      --conf spark.executor.memory=10G \
      --conf spark.executor.heartbeatInterval=20s \
      --conf spark.executor.extraClassPath=${RAPIDS_ML_JAR} \
      --conf spark.executor.resource.gpu.amount=1 \
      --conf spark.task.resource.gpu.amount=1 \
      --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
      --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'

    Launch the notebook:

    jupyter notebook

    Please choose "spark-rapids-ml-pca" as your notebook kernel.

Submit app jar

We also provide the spark-submit way to run the PCA example. We suggest using Dockerfile to get a clean runnning environment:

docker build -f Dockerfile -t nvspark/pca:0.1 .

Then get into the container of this image(nvidia-docker is required as we will use GPU then):

nvidia-docker run -it nvspark/pca:0.1 bash

In this docker image, we assume that user has 2 GPUs in his machine. If you are not the condition, please modify the -Dspark.worker.resource.gpu.amount in spark-env.sh according to your actual environment.

Then just start the standalone Spark and submit the job:

./start-spark.sh
./spark-submit.sh