-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema #24284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR contains too many changes, I am OK to create a separate PR for the partition value pruning.
|
Test build #104251 has finished for PR 24284 at commit
|
313eda8 to
cd236a7
Compare
|
Test build #104274 has finished for PR 24284 at commit
|
|
Test build #104269 has finished for PR 24284 at commit
|
|
retest this please. |
|
Test build #104280 has finished for PR 24284 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need migration guide? it's a behavior change for file source v2, which is new in Spark 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with either way. Let me remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we testing v1 or v2 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V2.
For V1 we use OrcV1PartitionDiscoverySuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should put V2 in the test suite name as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite related to this PR. If we are going to use V2 by default, I think the current test suite name is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't V1 by default now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For read path, it is V2 by default now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make a followup PR to put V2 in the test suite name and do not rely on the default config values.
|
Test build #104291 has finished for PR 24284 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
…rlap with partition schema
8894d93 to
a64107d
Compare
|
Test build #104299 has finished for PR 24284 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
In the current file source V2 framework, the schema of
FileScanis not returned correctly if there are overlap columns betweendataSchemaandpartitionSchema. The actual schema should bedataSchema - overlapSchema + partitionSchema, which might have different column order from the pushed downrequiredSchemainSupportsPushDownRequiredColumns.pruneColumns.For example, if the data schema is
[a: String, b: String, c: String]and the partition schema is[b: Int, d: Int], the result schema is[a: String, b: Int, c: String, d: Int]in currentFileTableandHadoopFsRelation. while the actual scan schema is[a: String, c: String, b: Int, d: Int]inFileScan.To fix the corner case, this PR proposes that the output schema of
FileTableshould bedataSchema - overlapSchema + partitionSchema, so that the column order is consistent withFileScan.Putting all the partition columns to the end of table schema is more reasonable.
How was this patch tested?
Unit test.