Skip to content

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Nov 6, 2021

What changes were proposed in this pull request?

As a followup of https://github.com/apache/spark/pull/34298/files#r734795801, Similar to ORC aggregate push down, we can disallow split input files for Parquet reader as well. See original comment for more details of motivation. Also fix the string of RowDataSourceScanExec to only print out PushedAggregates and PushedGroupby, to be aligned with PushedLimit and PushedSample, as there's not so many queries can benefit from aggregate push down, so we don't need to print those unnecessary information.

Why are the changes needed?

Avoid unnecessary file splits in multiple tasks for Parquet reader with aggregate push down.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test in FileSourceAggregatePushDownSuite.scala.

@github-actions github-actions bot added the SQL label Nov 6, 2021
@c21
Copy link
Contributor Author

c21 commented Nov 6, 2021

@huaxingao, @sunchao and @viirya - could you help take a look when you have time? Thanks.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c21, looks good just one nit.

override def isSplitable(path: Path): Boolean = {
// If aggregate is pushed down, only the file footer will be read once,
// so file should not be split across multiple tasks.
pushedAggregate.isEmpty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49415/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49416/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49415/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49416/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Test build #144942 has finished for PR 34498 at commit 4eb0e8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Test build #144944 has finished for PR 34498 at commit 94a1bd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay.

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Test build #144945 has finished for PR 34498 at commit 3c54820.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49420/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49420/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Test build #144948 has finished for PR 34498 at commit 04c3f20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao closed this in d3a1337 Nov 6, 2021
@c21
Copy link
Contributor Author

c21 commented Nov 6, 2021

Thank you @sunchao, @huaxingao and @viirya for review!

@c21 c21 deleted the agg-fix branch November 7, 2021 00:41
@HyukjinKwon
Copy link
Member

Quick question on:

Existing unit test in FileSourceAggregatePushDownSuite.scala.

How did the existing tests pass before this PR?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

override def isSplitable(path: Path): Boolean = {
// If aggregate is pushed down, only the file footer will be read once,
// so file should not be split across multiple tasks.
pushedAggregate.isEmpty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. Got it now.

// footers for every split of the file. Basically if the start (the beginning of)
// the offset in PartitionedFile is 0, we will read the footer. Otherwise, it means
// that we have already read footer for that file, so we will skip reading again.
if (file.start != 0) return null
Copy link
Contributor Author

@c21 c21 Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question on: Existing unit test in FileSourceAggregatePushDownSuite.scala. How did the existing tests pass before this PR?

@HyukjinKwon - I think we are in the same page based on your latest comment, but just to be noisy here in case anything is missing. Before this PR, when a single file is split into multiple splits across multiple tasks, we have the logic here to only process the split of file if file.start == 0, so only the first split of file will be processed, and every file is processed only once. So here is the trick. Before this PR, the logic for Parquet aggregate push down was still correct. We want to avoid unnecessary file splitting so update logic in this PR here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants