A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.
It includes utilities related to running Spark using the Rapids Accelerator, docs and example applications that demonstrate the RAPIDS.ai GPU-accelerated Spark and Spark-ML(PCA Algorithm) projects. Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.
- Criteo: Python
- PCA: Scala
Try one of the "Getting Started Guides" below. Please note that they target the Mortgage dataset as written,
but with a few changes to EXAMPLE_CLASS
and dataPath
, they can be easily adapted to the Taxi or Agaricus datasets.
You can get a small size datasets for each example in the datasets folder. These datasets are only provided for convenience. In order to test for performance, please prepare a larger dataset by following Preparing Datasets via Notebook. We also provide a larger dataset: Morgage Dataset (1 GB uncompressed), which is used in the guides below.
- Prepare packages and dataset
- Getting started on on-premises clusters
- Getting started on cloud service providers
- Amazon AWS
- Databricks
- Getting started for Jupyter Notebook applications
These examples use default parameters for demo purposes. For a full list please see "Supported Parameters" for Scala or Python
The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. The queries are based on several tables in TPC-DS parquet format with Double replacing Decimal, so that similar speedups can be reproducible by others. The microbenchmark includes commonly used Spark SQL operations such as expand, hash aggregate, windowing, and cross joins, and runs the same queries in CPU mode and GPU mode. Some queries will involve data skew. Each of them is highly tuned and works with the optimal configuration on an 8 nodes Spark standalone cluster which with 128 CPU cores and 1 A100 GPU on each node.
You can generate the parquet format dataset using this Databricks Tool. All the queries are running on the SF3000(Scale Factors 3TB) dataset. You can generate it with the following command:
build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /databricks-tpcds-kit-path -s 3000G -l /your-dataset-path -f parquet"
You will see the RAPIDS Accelerator For Apache Spark can give speedups of up to 10x over the CPU, and in some cases up to 80x. It is easy to compare the microbenchmarks on CPU and GPU side by side. You can see some queries are faster in the second time, it can be caused by many reasons such as JVM JIT or initialization overhead or caching input data in the OS page cache, etc. You can get a clear and visual impression of the improved performance with or without the benefits of post-running. The improved performance is influenced by many conditions, including the dataset's scale factors or the GPU card. If the application ran for too long or even failed, you can run the queries on a smaller dataset.
Please follow the README guide here: README
Please follow the README guide here: README
See the Contributing guide.
Please see the RAPIDS website for contact information.
This content is licensed under the Apache License 2.0