Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 10, 2024

What changes were proposed in this pull request?

This PR aims to improve TPCDSQueryBenchmark to support other file formats.

Why are the changes needed?

Currently, parquet is a hard-coded because it's the default value of spark.sql.sources.default.

spark.catalog.createTable(tableName, "parquet", tableColumns(tableName), options)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual.

BEFORE

$ build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /tmp/tpcds-sf-1-orc-snappy/"
...
[info] 18:36:39.698 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
[info] java.lang.RuntimeException: file:/tmp/tpcds-sf-1-orc-snappy/catalog_page/part-00000-40446d2a-f814-4e26-b3e1-664b833bf041-c000.snappy.orc is not a Parquet file. Expected magic number at tail, but found [79, 82, 67, 25]
[info] 	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
...

AFTER

$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /tmp/tpcds-sf-1-orc-snappy/"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 6 iterations, 2028 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                  305            338          24          1.5         660.4       1.0X

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the SQL label Jan 10, 2024
@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @LuciferYang ?

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @LuciferYang .

@dongjoon-hyun
Copy link
Member Author

Since this is irrelevant from the CI result and I verified this manually, let me merge this~

@dongjoon-hyun dongjoon-hyun deleted the SPARK-46646 branch January 10, 2024 02:52
@yaooqinn
Copy link
Member

Late LGTM

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…her file formats

### What changes were proposed in this pull request?

This PR aims to improve `TPCDSQueryBenchmark` to support other file formats.

### Why are the changes needed?

Currently, `parquet` is a hard-coded because it's the default value of `spark.sql.sources.default`.

https://github.com/apache/spark/blob/48d22e9f876f070d35ff3dd011bfbd1b6bccb4ac/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala#L77

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual.

**BEFORE**
```
$ build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /tmp/tpcds-sf-1-orc-snappy/"
...
[info] 18:36:39.698 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
[info] java.lang.RuntimeException: file:/tmp/tpcds-sf-1-orc-snappy/catalog_page/part-00000-40446d2a-f814-4e26-b3e1-664b833bf041-c000.snappy.orc is not a Parquet file. Expected magic number at tail, but found [79, 82, 67, 25]
[info] 	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
...
```

**AFTER**
```
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /tmp/tpcds-sf-1-orc-snappy/"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 6 iterations, 2028 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                  305            338          24          1.5         660.4       1.0X
```

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#44651 from dongjoon-hyun/SPARK-46646.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants