[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

cloud-fan · 2019-10-17T11:31:37Z

What changes were proposed in this pull request?

SupportsPushDownRequiredColumns should return the pruned schema, instead of building the Scan and get readSchema from Scan.

Why are the changes needed?

I found this problem while developing the v1 read fallback API following #25348.

The problem is that, v1 read fallback API needs to rely on ScanBuilder to do filter pushdown and column pruning, and create a v1 BaseRelation at the end.

However, the SupportsPushDownRequiredColumns is not well designed. Spark must create the v2 Scan to get the result of column pruning: the pruned schema. This is not possible for data sources implementing v1 read fallback API.

By doing this change, we also make it easier to implement DS v2: users don't need to implement schema twice (in Table and Scan) if they don't support column pruning.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2019-10-17T11:32:55Z

cc @rdblue @gengliangwang @brkyvz

SparkQA · 2019-10-17T14:16:37Z

Test build #112219 has finished for PR 26150 at commit 2bba4e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-10-18T20:26:10Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java

   * A description string of this scan, which may includes information like: what filters are
   * configured for this scan, what's the value of some important options like path, etc. The
-   * description doesn't need to include {@link #readSchema()}, as Spark already knows it.
+   * description doesn't need to include the schema, as Spark already knows it.


I would rather not remove readSchema. The scan should be self-describing back to Spark, and the read schema is a key piece of information. In fact, I'd like to add more methods to access other things, like pushed filters and residual filters.

... like pushed filters and residual filters.

hmm, are these already available from SupportsPushDownFilters?

They are, but pushedFilters should also be available from the Scan.

I'm not sure this is the right direction. If we add more pushdown in the future (limit, aggregate, etc.), are we going to add methods to Scan every time?

rdblue · 2019-10-18T20:28:31Z

@cloud-fan, I don't think that removing this method is possible without correctness issues.

The read schema may be impacted by filters. For example, if I project column a and filter by column b then the read schema can be (a, b) or just a. So the read schema can't be fully determined when the method to push the projection is called.

Why not just create a scan and use the readSchema from the scan, but still pass the builder into v1?

cloud-fan · 2019-10-21T06:56:15Z

The read schema may be impacted by filters. For example, if I project column a and filter by column b then the read schema can be (a, b) or just a. So the read schema can't be fully determined when the method to push the projection is called.

In fact Spark always do column pruning at the end, so this can be determined then pushRequiredColumns is called.

rdblue · 2019-10-21T20:27:45Z

In fact Spark always do column pruning at the end, so this can be determined then pushRequiredColumns is called.

One of the main reasons for the recent refactor that introduced the scan builder was that "scan execution order is not obvious". The builder was introduced so that the order was clear: filters and projections are pushed, then the read schema is fetched. What you're suggesting, to get the read schema before building the scan, reintroduces the problem this was intended to fix by breaking the guarantee that the read schema is fetched after both projection and filters are pushed.

I'm -1 for this change.

Also, why can't the v1 fallback path can't build a scan, even if it isn't used?

github-actions · 2020-01-31T00:05:51Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SupportsPushDownRequiredColumns should report pruned schema immediately

2bba4e9

dongjoon-hyun added the SQL label Oct 18, 2019

rdblue reviewed Oct 18, 2019

View reviewed changes

github-actions bot added the Stale label Jan 31, 2020

cloud-fan closed this Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

Uh oh!

cloud-fan commented Oct 17, 2019

Uh oh!

cloud-fan commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

rdblue Oct 18, 2019

Uh oh!

cloud-fan Oct 21, 2019

Uh oh!

rdblue Oct 21, 2019

Uh oh!

cloud-fan Oct 22, 2019

Uh oh!

rdblue commented Oct 18, 2019

Uh oh!

cloud-fan commented Oct 21, 2019

Uh oh!

rdblue commented Oct 21, 2019

Uh oh!

github-actions bot commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150

Uh oh!

Conversation

cloud-fan commented Oct 17, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

rdblue Oct 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 21, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 22, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 18, 2019

Uh oh!

cloud-fan commented Oct 21, 2019

Uh oh!

rdblue commented Oct 21, 2019

Uh oh!

github-actions bot commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants