spark-rapids-examples

A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.

It includes utilities related to running Spark using the Rapids Accelerator, docs and example applications that demonstrate the RAPIDS.ai GPU-accelerated Spark and Spark-ML(PCA Algorithm) projects. Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.

Utilities and Examples

1. Xgboost examples

Mortgage: Scala, Python
Taxi: Scala, Python
Agaricus: Scala, Python

2. Microbenchmarks

3. TensorFlow training on Horovod Spark example

Criteo: Python

4. Spark-ML examples

PCA: Scala

5. NVIDIA GPU Plugin for YARN with MIG support

YARN MIG GPU Plugin

Getting Started Guides

1. Xgboost examples guide

Try one of the "Getting Started Guides" below. Please note that they target the Mortgage dataset as written, but with a few changes to EXAMPLE_CLASS and dataPath, they can be easily adapted to the Taxi or Agaricus datasets.

You can get a small size datasets for each example in the datasets folder. These datasets are only provided for convenience. In order to test for performance, please prepare a larger dataset by following Preparing Datasets via Notebook. We also provide a larger dataset: Morgage Dataset (1 GB uncompressed), which is used in the guides below.

Prepare packages and dataset
- Scala
- Python
Getting started on on-premises clusters
Getting started on cloud service providers
- Amazon AWS
  - EC2
- Databricks
Getting started for Jupyter Notebook applications
- Apache Toree Notebook for Scala
- Jupyter Notebook for Python

These examples use default parameters for demo purposes. For a full list please see "Supported Parameters" for Scala or Python

2. Microbenchmark guide

The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. The queries are based on several tables in TPC-DS parquet format with Double replacing Decimal, so that similar speedups can be reproducible by others. The microbenchmark includes commonly used Spark SQL operations such as expand, hash aggregate, windowing, and cross joins, and runs the same queries in CPU mode and GPU mode. Some queries will involve data skew. Each of them is highly tuned and works with the optimal configuration on an 8 nodes Spark standalone cluster which with 128 CPU cores and 1 A100 GPU on each node.

You can generate the parquet format dataset using this Databricks Tool. All the queries are running on the SF3000(Scale Factors 3TB) dataset. You can generate it with the following command:

build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /databricks-tpcds-kit-path -s 3000G -l /your-dataset-path -f parquet"

You will see the RAPIDS Accelerator For Apache Spark can give speedups of up to 10x over the CPU, and in some cases up to 80x. It is easy to compare the microbenchmarks on CPU and GPU side by side. You can see some queries are faster in the second time, it can be caused by many reasons such as JVM JIT or initialization overhead or caching input data in the OS page cache, etc. You can get a clear and visual impression of the improved performance with or without the benefits of post-running. The improved performance is influenced by many conditions, including the dataset's scale factors or the GPU card. If the application ran for too long or even failed, you can run the queries on a smaller dataset.

3. TensorFlow training on Horovod Spark example guide

Please follow the README guide here: README

4. PCA example guide

Please follow the README guide here: README

API

1. Xgboost examples API

Troubleshooting

Trouble Shooting

Contributing

See the Contributing guide.

Contact Us

Please see the RAPIDS website for contact information.

License

This content is licensed under the Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
datasets		datasets
dockerfile		dockerfile
docs		docs
examples		examples
hadoop/device-plugins/gpu-mig		hadoop/device-plugins/gpu-mig
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-rapids-examples

Utilities and Examples

1. Xgboost examples

2. Microbenchmarks

3. TensorFlow training on Horovod Spark example

4. Spark-ML examples

5. NVIDIA GPU Plugin for YARN with MIG support

Getting Started Guides

1. Xgboost examples guide

2. Microbenchmark guide

3. TensorFlow training on Horovod Spark example guide

4. PCA example guide

API

1. Xgboost examples API

Troubleshooting

Contributing

Contact Us

License

About

Releases

Packages

Languages

License

wbo4958/spark-rapids-examples

Folders and files

Latest commit

History

Repository files navigation

spark-rapids-examples

Utilities and Examples

1. Xgboost examples

2. Microbenchmarks

3. TensorFlow training on Horovod Spark example

4. Spark-ML examples

5. NVIDIA GPU Plugin for YARN with MIG support

Getting Started Guides

1. Xgboost examples guide

2. Microbenchmark guide

3. TensorFlow training on Horovod Spark example guide

4. PCA example guide

API

1. Xgboost examples API

Troubleshooting

Contributing

Contact Us

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages