diff --git a/README.md b/README.md index c318b05354..1a6281a993 100644 --- a/README.md +++ b/README.md @@ -30,10 +30,12 @@ under the License. logo Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful -[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is designed to significantly enhance the +[Apache DataFusion] query engine. Comet is designed to significantly enhance the performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the Spark ecosystem without requiring any code changes. +[Apache DataFusion]: https://datafusion.apache.org + # Benefits of Using Comet ## Run Spark Queries at DataFusion Speeds diff --git a/docs/source/_static/images/CometNativeExecution.drawio.png b/docs/source/_static/images/CometNativeExecution.drawio.png deleted file mode 100644 index ba122a1f2c..0000000000 Binary files a/docs/source/_static/images/CometNativeExecution.drawio.png and /dev/null differ diff --git a/docs/source/_static/images/CometNativeParquetReader.drawio b/docs/source/_static/images/CometNativeParquetReader.drawio new file mode 100644 index 0000000000..0c7304eff9 --- /dev/null +++ b/docs/source/_static/images/CometNativeParquetReader.drawio @@ -0,0 +1,100 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/source/_static/images/CometNativeParquetReader.drawio.svg b/docs/source/_static/images/CometNativeParquetReader.drawio.svg new file mode 100644 index 0000000000..0c1f93c7b0 --- /dev/null +++ b/docs/source/_static/images/CometNativeParquetReader.drawio.svg @@ -0,0 +1,4 @@ + + + +
Spark Executor
JVM Code
Comet Parquet Reader


IO and Decompression
Native Code
Native Execution Plan
Parquet Decoding
Shuffle Files
executePlan()
CometExecIterator
next()
Spark Execution Logic
decode()
next()
\ No newline at end of file diff --git a/docs/source/_static/images/CometNativeParquetScan.drawio.png b/docs/source/_static/images/CometNativeParquetScan.drawio.png deleted file mode 100644 index 712cbae4bf..0000000000 Binary files a/docs/source/_static/images/CometNativeParquetScan.drawio.png and /dev/null differ diff --git a/docs/source/_static/images/CometOverviewDetailed.drawio b/docs/source/_static/images/CometOverviewDetailed.drawio new file mode 100644 index 0000000000..ff7f4c5911 --- /dev/null +++ b/docs/source/_static/images/CometOverviewDetailed.drawio @@ -0,0 +1,94 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/source/_static/images/CometOverviewDetailed.drawio.svg b/docs/source/_static/images/CometOverviewDetailed.drawio.svg new file mode 100644 index 0000000000..0f29083b11 --- /dev/null +++ b/docs/source/_static/images/CometOverviewDetailed.drawio.svg @@ -0,0 +1,4 @@ + + + +
Spark Executor
Spark Driver
Spark Logical Plan
Spark Physical Plan
Comet Physical Plan
protobuf intermediate representation
Native Execution Plan
Comet Physical Plan
protobuf intermediate representation
Shuffle Files
\ No newline at end of file diff --git a/docs/source/contributor-guide/plugin_overview.md b/docs/source/contributor-guide/plugin_overview.md index c753829076..a211ca6b55 100644 --- a/docs/source/contributor-guide/plugin_overview.md +++ b/docs/source/contributor-guide/plugin_overview.md @@ -79,10 +79,10 @@ The leaf nodes in the physical plan are always `ScanExec` and these operators co prepared before the plan is executed. When `CometExecIterator` invokes `Native.executePlan` it passes the memory addresses of these Arrow arrays to the native code. -![Diagram of Comet Native Execution](../../_static/images/CometNativeExecution.drawio.png) +![Diagram of Comet Native Execution](../../_static/images/CometOverviewDetailed.drawio.svg) ## End to End Flow The following diagram shows the end-to-end flow. -![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetScan.drawio.png) +![Diagram of Comet Native Parquet Scan](../../_static/images/CometNativeParquetReader.drawio.svg) diff --git a/docs/source/index.rst b/docs/source/index.rst index 4bf5d9fde3..39ad27a57c 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -42,6 +42,8 @@ as a native runtime to achieve improvement in terms of query efficiency and quer Comet Overview Installing Comet + Building From Source + Kubernetes Guide Supported Data Sources Supported Data Types Supported Operators diff --git a/docs/source/user-guide/installation.md b/docs/source/user-guide/installation.md index dc4429b8b9..343b658683 100644 --- a/docs/source/user-guide/installation.md +++ b/docs/source/user-guide/installation.md @@ -19,73 +19,54 @@ # Installing DataFusion Comet +## Prerequisites + Make sure the following requirements are met and software installed on your machine. -## Supported Platforms +### Supported Operating Systems - Linux - Apple OSX (Intel and Apple Silicon) -## Requirements +### Supported Spark Versions -- [Apache Spark supported by Comet](overview.md#supported-apache-spark-versions) -- JDK 8 and up -- GLIBC 2.17 (Centos 7) and up +Comet currently supports the following versions of Apache Spark: -## Deploying to Kubernetes +- 3.3.x (Java 8/11/17, Scala 2.12/2.13) +- 3.4.x (Java 8/11/17, Scala 2.12/2.13) +- 3.5.x (Java 8/11/17, Scala 2.12/2.13) -See the [Comet Kubernetes Guide](kubernetes.md) guide. - -## Using a Published JAR File +Experimental support is provided for the following versions of Apache Spark and is intended for development/testing +use only and should not be used in production yet. -Pre-built jar files are available in Maven central at https://central.sonatype.com/namespace/org.apache.datafusion +- 4.0.0-preview1 (Java 17/21, Scala 2.13) -## Using a Published Source Release - -Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/ - -```console -# Pick the latest version -export COMET_VERSION=0.3.0 -# Download the tarball -curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz" -# Unpack -tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz -cd apache-datafusion-comet-$COMET_VERSION -``` +Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by +Cloud Service Providers. -Build - -```console -make release-nogit PROFILES="-Pspark-3.4" -``` - -## Building from the GitHub repository +## Using a Published JAR File -Clone the repository: +Comet jar files are available in [Maven Central](https://central.sonatype.com/namespace/org.apache.datafusion). -```console -git clone https://github.com/apache/datafusion-comet.git -``` +Here are the direct links for downloading the Comet jar file. -Build Comet for a specific Spark version: +- [Comet plugin for Spark 3.3 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.12/0.3.0/comet-spark-spark3.3_2.12-0.3.0.jar) +- [Comet plugin for Spark 3.3 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.3_2.13/0.3.0/comet-spark-spark3.3_2.13-0.3.0.jar) +- [Comet plugin for Spark 3.4 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.12/0.3.0/comet-spark-spark3.4_2.12-0.3.0.jar) +- [Comet plugin for Spark 3.4 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.4_2.13/0.3.0/comet-spark-spark3.4_2.13-0.3.0.jar) +- [Comet plugin for Spark 3.5 / Scala 2.12](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.3.0/comet-spark-spark3.5_2.12-0.3.0.jar) +- [Comet plugin for Spark 3.5 / Scala 2.13](https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.13/0.3.0/comet-spark-spark3.5_2.13-0.3.0.jar) -```console -cd datafusion-comet -make release PROFILES="-Pspark-3.4" -``` +## Building from source -Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile: +Refer to the [Building from Source] guide for instructions from building Comet from source, either from official +source releases, or from the latest code in the GitHub repository. -```console -make release PROFILES="-Pspark-3.4 -Pscala-2.13" -``` +[Building from Source]: source.md -To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use: +## Deploying to Kubernetes -```console -make release-nogit PROFILES="-Pspark-3.4" -``` +See the [Comet Kubernetes Guide](kubernetes.md) guide. ## Run Spark Shell with Comet enabled @@ -99,11 +80,10 @@ $SPARK_HOME/bin/spark-shell \ --conf spark.driver.extraClassPath=$COMET_JAR \ --conf spark.executor.extraClassPath=$COMET_JAR \ --conf spark.plugins=org.apache.spark.CometPlugin \ - --conf spark.comet.enabled=true \ - --conf spark.comet.exec.enabled=true \ + --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager --conf spark.comet.explainFallback.enabled=true \ - --conf spark.driver.memory=1g \ - --conf spark.executor.memory=1g + --conf spark.memory.offHeap.enabled=true \ + --conf spark.memory.offHeap.size=16g \ ``` ### Verify Comet enabled for Spark SQL query @@ -142,20 +122,9 @@ WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts - Execute InsertIntoHadoopFsRelationCommand is not supported ``` -### Enable Comet shuffle +## Additional Configuration -Comet shuffle feature is disabled by default. To enable it, please add related configs: - -``` ---conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager ---conf spark.comet.exec.shuffle.enabled=true -``` - -Above configs enable Comet native shuffle which only supports hash partition and single partition. -Comet native shuffle doesn't support complex types yet. - -Comet doesn't have official release yet so currently the only way to test it is to build jar and include it in your -Spark application. Depending on your deployment mode you may also need to set the driver & executor class path(s) to +Depending on your deployment mode you may also need to set the driver & executor class path(s) to explicitly contain Comet otherwise Spark may use a different class-loader for the Comet components than its internal components which will then fail at runtime. For example: @@ -165,11 +134,7 @@ components which will then fail at runtime. For example: Some cluster managers may require additional configuration, see -To enable columnar shuffle which supports all partitioning and basic complex types, one more config is required: - -``` ---conf spark.comet.exec.shuffle.mode=jvm -``` - ### Memory tuning -In addition to Apache Spark memory configuration parameters the Comet introduces own parameters to configure memory allocation for native execution. More [Comet Memory Tuning](./tuning.md) + +In addition to Apache Spark memory configuration parameters, Comet introduces additional parameters to configure memory +allocation for native execution. See [Comet Memory Tuning](./tuning.md) for details. diff --git a/docs/source/user-guide/overview.md b/docs/source/user-guide/overview.md index e386aec8ca..92dfe2bb94 100644 --- a/docs/source/user-guide/overview.md +++ b/docs/source/user-guide/overview.md @@ -19,8 +19,14 @@ # Comet Overview -Comet runs Spark SQL queries using the native Apache DataFusion runtime, which is -typically faster and more resource efficient than JVM based runtimes. +Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful +[Apache DataFusion] query engine. Comet is designed to significantly enhance the +performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the +Spark ecosystem without requiring any code changes. + +[Apache DataFusion]: https://datafusion.apache.org + +The following diagram provides an overview of Comet's architecture. ![Comet Overview](../_static/images/comet-overview.png) @@ -34,26 +40,10 @@ Comet aims to support: ## Architecture -The following diagram illustrates the architecture of Comet: +The following diagram shows how Comet integrates with Apache Spark. ![Comet System Diagram](../_static/images/comet-system-diagram.png) -## Supported Apache Spark versions - -Comet currently supports the following versions of Apache Spark: - -- 3.3.x -- 3.4.x -- 3.5.x - -Experimental support is provided for the following versions of Apache Spark and is intended for development/testing -use only and should not be used in production yet. - -- 4.0.0-preview1 - -Note that Comet may not fully work with proprietary forks of Apache Spark such as the Spark versions offered by -Cloud Service Providers. - ## Feature Parity with Apache Spark The project strives to keep feature parity with Apache Spark, that is, @@ -65,3 +55,9 @@ features and fallback to Spark engine. To achieve this, besides unit tests within Comet itself, we also re-use Spark SQL tests and make sure they all pass with Comet extension enabled. + +## Getting Started + +Refer to the [Comet Installation Guide] to get started. + +[Comet Installation Guide]: installation.md diff --git a/docs/source/user-guide/source.md b/docs/source/user-guide/source.md new file mode 100644 index 0000000000..71c9060cb5 --- /dev/null +++ b/docs/source/user-guide/source.md @@ -0,0 +1,69 @@ + + +# Building Comet From Source + +It is sometimes preferable to build from source for a specific platform. + +## Using a Published Source Release + +Official source releases can be downloaded from https://dist.apache.org/repos/dist/release/datafusion/ + +```console +# Pick the latest version +export COMET_VERSION=0.3.0 +# Download the tarball +curl -O "https://dist.apache.org/repos/dist/release/datafusion/datafusion-comet-$COMET_VERSION/apache-datafusion-comet-$COMET_VERSION.tar.gz" +# Unpack +tar -xzf apache-datafusion-comet-$COMET_VERSION.tar.gz +cd apache-datafusion-comet-$COMET_VERSION +``` + +Build + +```console +make release-nogit PROFILES="-Pspark-3.4" +``` + +## Building from the GitHub repository + +Clone the repository: + +```console +git clone https://github.com/apache/datafusion-comet.git +``` + +Build Comet for a specific Spark version: + +```console +cd datafusion-comet +make release PROFILES="-Pspark-3.4" +``` + +Note that the project builds for Scala 2.12 by default but can be built for Scala 2.13 using an additional profile: + +```console +make release PROFILES="-Pspark-3.4 -Pscala-2.13" +``` + +To build Comet from the source distribution on an isolated environment without an access to `github.com` it is necessary to disable `git-commit-id-maven-plugin`, otherwise you will face errors that there is no access to the git during the build process. In that case you may use: + +```console +make release-nogit PROFILES="-Pspark-3.4" +```