Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 63 additions & 2 deletions docs/source/contributor-guide/benchmark-results/tpc-ds.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ under the License.

# Apache DataFusion Comet: Benchmarks Derived From TPC-DS

The following benchmarks were performed on a two node Kubernetes cluster with
data stored locally in Parquet format on NVMe storage. Performance characteristics will vary in different environments
The following benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and
data stored locally in Parquet format on NVMe storage. Performance characteristics will vary in different environments
and we encourage you to run these benchmarks in your own environments.

The tracking issue for improving TPC-DS performance is [#858](https://github.com/apache/datafusion-comet/issues/858).
Expand All @@ -43,3 +43,64 @@ The raw results of these benchmarks in JSON format is available here:

- [Spark](0.5.0/spark-tpcds.json)
- [Comet](0.5.0/comet-tpcds.json)

# Scripts

Here are the scripts that were used to generate these results.

## Apache Spark

```shell
#!/bin/bash
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.memory=32G \
--conf spark.executor.instances=2 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=16 \
--conf spark.eventLog.enabled=true \
tpcbench.py \
--benchmark tpcds \
--name spark \
--data /mnt/bigdata/tpcds/sf100/ \
--queries ../../tpcds/ \
--output . \
--iterations 5
```

## Apache Spark + Comet

```shell
#!/bin/bash
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=16G \
--conf spark.executor.cores=8 \
--total-executor-cores=16 \
--conf spark.eventLog.enabled=true \
--conf spark.driver.maxResultSize=2G \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=24g \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.comet.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.exec.replaceSortMergeJoin=false \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
tpcbench.py \
--name comet \
--benchmark tpcds \
--data /mnt/bigdata/tpcds/sf100/ \
--queries ../../tpcds/ \
--output . \
--iterations 5
```
71 changes: 67 additions & 4 deletions docs/source/contributor-guide/benchmark-results/tpc-h.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,84 @@ and we encourage you to run these benchmarks in your own environments.

The tracking issue for improving TPC-H performance is [#391](https://github.com/apache/datafusion-comet/issues/391).

![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_allqueries.png)
![](../../_static/images/benchmark-results/0.5.0/tpch_allqueries.png)

Here is a breakdown showing relative performance of Spark and Comet for each query.

![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_compare.png)
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_compare.png)

The following chart shows how much Comet currently accelerates each query from the benchmark in relative terms.

![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_rel.png)
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_rel.png)

The following chart shows how much Comet currently accelerates each query from the benchmark in absolute terms.

![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_abs.png)
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_abs.png)

The raw results of these benchmarks in JSON format is available here:

- [Spark](0.5.0/spark-tpch.json)
- [Comet](0.5.0/comet-tpch.json)

# Scripts

Here are the scripts that were used to generate these results.

## Apache Spark

```shell
#!/bin/bash
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.eventLog.enabled=true \
tpcbench.py \
--name spark \
--benchmark tpch \
--data /mnt/bigdata/tpch/sf100/ \
--queries ../../tpch/queries \
--output . \
--iterations 5

```

## Apache Spark + Comet

```shell
#!/bin/bash
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.eventLog.enabled=true \
--jars $COMET_JAR \
--driver-class-path $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
tpcbench.py \
--name comet \
--benchmark tpch \
--data /mnt/bigdata/tpch/sf100/ \
--queries ../../tpch/queries \
--output . \
--iterations 5
```
60 changes: 0 additions & 60 deletions docs/source/contributor-guide/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,66 +24,6 @@ benchmarking documentation and scripts are available in the [DataFusion Benchmar

We also have many micro benchmarks that can be run from an IDE located [here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).

Here are example commands for running the benchmarks against a Spark cluster. This command will need to be
adapted based on the Spark environment and location of data files.

These commands are intended to be run from the `runners/datafusion-comet` directory in the `datafusion-benchmarks`
repository.

## Running Benchmarks Against Apache Spark

```shell
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=32G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
tpcbench.py \
--benchmark tpch \
--data /mnt/bigdata/tpch/sf100/ \
--queries ../../tpch/queries \
--iterations 3
```

## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled

### TPC-H

```shell
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=16G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.comet.exec.shuffle.enableFastEncoding=true \
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
tpcbench.py \
--benchmark tpch \
--data /mnt/bigdata/tpch/sf100/ \
--queries ../../tpch/queries \
--iterations 3
```

### TPC-DS

For TPC-DS, use `spark.comet.exec.replaceSortMergeJoin=false`.

## Current Benchmark Results

- [Benchmarks derived from TPC-H](benchmark-results/tpc-h)
Expand Down