[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema #24284

gengliangwang · 2019-04-03T15:51:24Z

What changes were proposed in this pull request?

In the current file source V2 framework, the schema of FileScan is not returned correctly if there are overlap columns between dataSchema and partitionSchema. The actual schema should be
dataSchema - overlapSchema + partitionSchema, which might have different column order from the pushed down requiredSchema in SupportsPushDownRequiredColumns.pruneColumns.

For example, if the data schema is [a: String, b: String, c: String] and the partition schema is [b: Int, d: Int], the result schema is [a: String, b: Int, c: String, d: Int] in current FileTable and HadoopFsRelation. while the actual scan schema is [a: String, c: String, b: Int, d: Int] in FileScan.

To fix the corner case, this PR proposes that the output schema of FileTable should be dataSchema - overlapSchema + partitionSchema, so that the column order is consistent with FileScan.
Putting all the partition columns to the end of table schema is more reasonable.

How was this patch tested?

Unit test.

gengliangwang · 2019-04-03T15:52:51Z

@cloud-fan

gengliangwang · 2019-04-03T15:56:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

If this PR contains too many changes, I am OK to create a separate PR for the partition value pruning.

SparkQA · 2019-04-03T19:58:14Z

Test build #104251 has finished for PR 24284 at commit 313eda8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileScanBuilder(

SparkQA · 2019-04-04T07:05:02Z

Test build #104274 has finished for PR 24284 at commit 518b628.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-04T07:05:02Z

Test build #104269 has finished for PR 24284 at commit cd236a7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-04T07:07:38Z

retest this please.

SparkQA · 2019-04-04T11:11:21Z

Test build #104280 has finished for PR 24284 at commit 518b628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-04T12:48:52Z

docs/sql-migration-guide-upgrade.md

do we need migration guide? it's a behavior change for file source v2, which is new in Spark 3.0.

I am OK with either way. Let me remove this.

cloud-fan · 2019-04-04T12:58:17Z

...c/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala

are we testing v1 or v2 here?

V2.
For V1 we use OrcV1PartitionDiscoverySuite.

maybe we should put V2 in the test suite name as well.

This is not quite related to this PR. If we are going to use V2 by default, I think the current test suite name is OK.

isn't V1 by default now?

For read path, it is V2 by default now.

let's make a followup PR to put V2 in the test suite name and do not rely on the default config values.

SparkQA · 2019-04-04T17:02:38Z

Test build #104291 has finished for PR 24284 at commit 8894d93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-04T17:13:34Z

docs/sql-migration-guide-upgrade.md

unnecessary change

…rlap with partition schema

SparkQA · 2019-04-04T22:04:37Z

Test build #104299 has finished for PR 24284 at commit a64107d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-05T05:35:11Z

thanks, merging to master!

gengliangwang commented Apr 3, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema~~ [WIP][SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema Apr 3, 2019

gengliangwang force-pushed the FixReadSchema branch from 313eda8 to cd236a7 Compare April 4, 2019 03:55

gengliangwang changed the title ~~[WIP][SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema~~ [SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema Apr 4, 2019

cloud-fan reviewed Apr 4, 2019

View reviewed changes

gengliangwang mentioned this pull request Apr 4, 2019

[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns #24296

Closed

cloud-fan reviewed Apr 4, 2019

View reviewed changes

docs/sql-migration-guide-upgrade.md Outdated

Copy link

Contributor

cloud-fan Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

[SPARK-27356][SQL] File source V2: Fix the case that data columns ove…

a64107d

…rlap with partition schema

gengliangwang force-pushed the FixReadSchema branch from 8894d93 to a64107d Compare April 4, 2019 17:58

cloud-fan closed this in 568db94 Apr 5, 2019

gengliangwang mentioned this pull request Apr 8, 2019

[SPARK-27271][SQL] Migrate Text to File Data Source V2 #24207

Closed

[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema #24284

[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema #24284

Uh oh!

Conversation

gengliangwang commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Apr 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 3, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

gengliangwang commented Apr 4, 2019

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2019

Uh oh!

cloud-fan commented Apr 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gengliangwang commented Apr 3, 2019 •

edited

Loading