-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29501][SQL] SupportsPushDownRequiredColumns should report pruned schema immediately #26150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112219 has finished for PR 26150 at commit
|
| * A description string of this scan, which may includes information like: what filters are | ||
| * configured for this scan, what's the value of some important options like path, etc. The | ||
| * description doesn't need to include {@link #readSchema()}, as Spark already knows it. | ||
| * description doesn't need to include the schema, as Spark already knows it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not remove readSchema. The scan should be self-describing back to Spark, and the read schema is a key piece of information. In fact, I'd like to add more methods to access other things, like pushed filters and residual filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... like pushed filters and residual filters.
hmm, are these already available from SupportsPushDownFilters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are, but pushedFilters should also be available from the Scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right direction. If we add more pushdown in the future (limit, aggregate, etc.), are we going to add methods to Scan every time?
|
@cloud-fan, I don't think that removing this method is possible without correctness issues. The read schema may be impacted by filters. For example, if I project column Why not just create a scan and use the |
In fact Spark always do column pruning at the end, so this can be determined then |
One of the main reasons for the recent refactor that introduced the scan builder was that "scan execution order is not obvious". The builder was introduced so that the order was clear: filters and projections are pushed, then the read schema is fetched. What you're suggesting, to get the read schema before building the scan, reintroduces the problem this was intended to fix by breaking the guarantee that the read schema is fetched after both projection and filters are pushed. I'm -1 for this change. Also, why can't the v1 fallback path can't build a scan, even if it isn't used? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
SupportsPushDownRequiredColumnsshould return the pruned schema, instead of building theScanand getreadSchemafromScan.Why are the changes needed?
I found this problem while developing the v1 read fallback API following #25348.
The problem is that, v1 read fallback API needs to rely on
ScanBuilderto do filter pushdown and column pruning, and create a v1BaseRelationat the end.However, the
SupportsPushDownRequiredColumnsis not well designed. Spark must create the v2Scanto get the result of column pruning: the pruned schema. This is not possible for data sources implementing v1 read fallback API.By doing this change, we also make it easier to implement DS v2: users don't need to implement schema twice (in
TableandScan) if they don't support column pruning.Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests