Skip to content

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.

License

Notifications You must be signed in to change notification settings

wbo4958/spark-rapids-examples

 
 

Repository files navigation

spark-rapids-examples

A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.

It includes utilities related to running Spark using the Rapids Accelerator, docs and example applications that demonstrate the RAPIDS.ai GPU-accelerated Spark and Spark-ML(PCA Algorithm) projects. Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.

Utilities and Examples

1. Xgboost examples

2. Microbenchmarks

3. TensorFlow training on Horovod Spark example

4. Spark-ML examples

5. NVIDIA GPU Plugin for YARN with MIG support

Getting Started Guides

1. Xgboost examples guide

Try one of the "Getting Started Guides" below. Please note that they target the Mortgage dataset as written, but with a few changes to EXAMPLE_CLASS and dataPath, they can be easily adapted to the Taxi or Agaricus datasets.

You can get a small size datasets for each example in the datasets folder. These datasets are only provided for convenience. In order to test for performance, please prepare a larger dataset by following Preparing Datasets via Notebook. We also provide a larger dataset: Morgage Dataset (1 GB uncompressed), which is used in the guides below.

These examples use default parameters for demo purposes. For a full list please see "Supported Parameters" for Scala or Python

2. Microbenchmark guide

The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. The queries are based on several tables in TPC-DS parquet format with Double replacing Decimal, so that similar speedups can be reproducible by others. The microbenchmark includes commonly used Spark SQL operations such as expand, hash aggregate, windowing, and cross joins, and runs the same queries in CPU mode and GPU mode. Some queries will involve data skew. Each of them is highly tuned and works with the optimal configuration on an 8 nodes Spark standalone cluster which with 128 CPU cores and 1 A100 GPU on each node.

You can generate the parquet format dataset using this Databricks Tool. All the queries are running on the SF3000(Scale Factors 3TB) dataset. You can generate it with the following command:

build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /databricks-tpcds-kit-path -s 3000G -l /your-dataset-path -f parquet"

You will see the RAPIDS Accelerator For Apache Spark can give speedups of up to 10x over the CPU, and in some cases up to 80x. It is easy to compare the microbenchmarks on CPU and GPU side by side. You can see some queries are faster in the second time, it can be caused by many reasons such as JVM JIT or initialization overhead or caching input data in the OS page cache, etc. You can get a clear and visual impression of the improved performance with or without the benefits of post-running. The improved performance is influenced by many conditions, including the dataset's scale factors or the GPU card. If the application ran for too long or even failed, you can run the queries on a smaller dataset.

3. TensorFlow training on Horovod Spark example guide

Please follow the README guide here: README

4. PCA example guide

Please follow the README guide here: README

API

1. Xgboost examples API

Troubleshooting

Contributing

See the Contributing guide.

Contact Us

Please see the RAPIDS website for contact information.

License

This content is licensed under the Apache License 2.0

About

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 54.1%
  • Python 29.5%
  • Shell 9.4%
  • Dockerfile 7.0%