Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

SupportsPushDownRequiredColumns should return the pruned schema, instead of building the Scan and get readSchema from Scan.

Why are the changes needed?

I found this problem while developing the v1 read fallback API following #25348.

The problem is that, v1 read fallback API needs to rely on ScanBuilder to do filter pushdown and column pruning, and create a v1 BaseRelation at the end.

However, the SupportsPushDownRequiredColumns is not well designed. Spark must create the v2 Scan to get the result of column pruning: the pruned schema. This is not possible for data sources implementing v1 read fallback API.

By doing this change, we also make it easier to implement DS v2: users don't need to implement schema twice (in Table and Scan) if they don't support column pruning.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @rdblue @gengliangwang @brkyvz

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112219 has finished for PR 26150 at commit 2bba4e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* A description string of this scan, which may includes information like: what filters are
* configured for this scan, what's the value of some important options like path, etc. The
* description doesn't need to include {@link #readSchema()}, as Spark already knows it.
* description doesn't need to include the schema, as Spark already knows it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not remove readSchema. The scan should be self-describing back to Spark, and the read schema is a key piece of information. In fact, I'd like to add more methods to access other things, like pushed filters and residual filters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... like pushed filters and residual filters.

hmm, are these already available from SupportsPushDownFilters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, but pushedFilters should also be available from the Scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right direction. If we add more pushdown in the future (limit, aggregate, etc.), are we going to add methods to Scan every time?

@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2019

@cloud-fan, I don't think that removing this method is possible without correctness issues.

The read schema may be impacted by filters. For example, if I project column a and filter by column b then the read schema can be (a, b) or just a. So the read schema can't be fully determined when the method to push the projection is called.

Why not just create a scan and use the readSchema from the scan, but still pass the builder into v1?

@cloud-fan
Copy link
Contributor Author

The read schema may be impacted by filters. For example, if I project column a and filter by column b then the read schema can be (a, b) or just a. So the read schema can't be fully determined when the method to push the projection is called.

In fact Spark always do column pruning at the end, so this can be determined then pushRequiredColumns is called.

@rdblue
Copy link
Contributor

rdblue commented Oct 21, 2019

In fact Spark always do column pruning at the end, so this can be determined then pushRequiredColumns is called.

One of the main reasons for the recent refactor that introduced the scan builder was that "scan execution order is not obvious". The builder was introduced so that the order was clear: filters and projections are pushed, then the read schema is fetched. What you're suggesting, to get the read schema before building the scan, reintroduces the problem this was intended to fix by breaking the guarantee that the read schema is fetched after both projection and filters are pushed.

I'm -1 for this change.

Also, why can't the v1 fallback path can't build a scan, even if it isn't used?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 31, 2020
@cloud-fan cloud-fan closed this Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants