[SPARK-37220][SQL] Do not split input file for Parquet reader with aggregate push down #34498

c21 · 2021-11-06T01:48:11Z

What changes were proposed in this pull request?

As a followup of https://github.com/apache/spark/pull/34298/files#r734795801, Similar to ORC aggregate push down, we can disallow split input files for Parquet reader as well. See original comment for more details of motivation. Also fix the string of RowDataSourceScanExec to only print out PushedAggregates and PushedGroupby, to be aligned with PushedLimit and PushedSample, as there's not so many queries can benefit from aggregate push down, so we don't need to print those unnecessary information.

Why are the changes needed?

Avoid unnecessary file splits in multiple tasks for Parquet reader with aggregate push down.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test in FileSourceAggregatePushDownSuite.scala.

c21 · 2021-11-06T01:48:40Z

@huaxingao, @sunchao and @viirya - could you help take a look when you have time? Thanks.

sunchao

Thanks @c21, looks good just one nit.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

sunchao · 2021-11-06T02:11:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

+  override def isSplitable(path: Path): Boolean = {
+    // If aggregate is pushed down, only the file footer will be read once,
+    // so file should not be split across multiple tasks.
+    pushedAggregate.isEmpty


SparkQA · 2021-11-06T02:34:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49415/

SparkQA · 2021-11-06T03:33:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49416/

SparkQA · 2021-11-06T03:34:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49415/

SparkQA · 2021-11-06T04:16:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49416/

SparkQA · 2021-11-06T04:29:39Z

Test build #144942 has finished for PR 34498 at commit 4eb0e8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-06T04:38:31Z

Test build #144944 has finished for PR 34498 at commit 94a1bd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Looks okay.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

SparkQA · 2021-11-06T05:06:24Z

Test build #144945 has finished for PR 34498 at commit 3c54820.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-06T06:34:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49420/

SparkQA · 2021-11-06T07:17:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49420/

SparkQA · 2021-11-06T10:30:51Z

Test build #144948 has finished for PR 34498 at commit 04c3f20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-11-06T20:47:59Z

Thank you @sunchao, @huaxingao and @viirya for review!

HyukjinKwon · 2021-11-07T05:55:41Z

Quick question on:

Existing unit test in FileSourceAggregatePushDownSuite.scala.

How did the existing tests pass before this PR?

HyukjinKwon

LGTM2

HyukjinKwon · 2021-11-07T05:58:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

+  override def isSplitable(path: Path): Boolean = {
+    // If aggregate is pushed down, only the file footer will be read once,
+    // so file should not be split across multiple tasks.
+    pushedAggregate.isEmpty


Oh, okay. Got it now.

c21 · 2021-11-07T19:35:09Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

-      // footers for every split of the file. Basically if the start (the beginning of)
-      // the offset in PartitionedFile is 0, we will read the footer. Otherwise, it means
-      // that we have already read footer for that file, so we will skip reading again.
-      if (file.start != 0) return null


Quick question on: Existing unit test in FileSourceAggregatePushDownSuite.scala. How did the existing tests pass before this PR?

@HyukjinKwon - I think we are in the same page based on your latest comment, but just to be noisy here in case anything is missing. Before this PR, when a single file is split into multiple splits across multiple tasks, we have the logic here to only process the split of file if file.start == 0, so only the first split of file will be processed, and every file is processed only once. So here is the trick. Before this PR, the logic for Parquet aggregate push down was still correct. We want to avoid unnecessary file splitting so update logic in this PR here.

Do not split input file for Parquet with aggregate push down

4eb0e8d

github-actions bot added the SQL label Nov 6, 2021

Fix RowDataSourceScanExec

94a1bd4

sunchao reviewed Nov 6, 2021

View reviewed changes

Removed unused aggString & groupByString in RowDataSourceScanExec

3c54820

viirya reviewed Nov 6, 2021

View reviewed changes

sunchao reviewed Nov 6, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Outdated Show resolved Hide resolved

Backout the change in RowDataSourceScanExec

04c3f20

viirya approved these changes Nov 6, 2021

View reviewed changes

sunchao approved these changes Nov 6, 2021

View reviewed changes

sunchao closed this in d3a1337 Nov 6, 2021

c21 deleted the agg-fix branch November 7, 2021 00:41

HyukjinKwon reviewed Nov 7, 2021

View reviewed changes

c21 commented Nov 7, 2021

View reviewed changes

c21 mentioned this pull request Nov 11, 2021

[SPARK-37262][SQL] Don't log empty aggregate and group by in JDBCScan #34540

Closed

[SPARK-37220][SQL] Do not split input file for Parquet reader with aggregate push down #34498

[SPARK-37220][SQL] Do not split input file for Parquet reader with aggregate push down #34498

Uh oh!

Conversation

c21 commented Nov 6, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Nov 6, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunchao Nov 6, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

SparkQA commented Nov 6, 2021

Uh oh!

c21 commented Nov 6, 2021

Uh oh!

HyukjinKwon commented Nov 7, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 7, 2021

Choose a reason for hiding this comment

Uh oh!

c21 Nov 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

c21 Nov 7, 2021 •

edited

Loading