Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Apr 3, 2019

What changes were proposed in this pull request?

In the current file source V2 framework, the schema of FileScan is not returned correctly if there are overlap columns between dataSchema and partitionSchema. The actual schema should be
dataSchema - overlapSchema + partitionSchema, which might have different column order from the pushed down requiredSchema in SupportsPushDownRequiredColumns.pruneColumns.

For example, if the data schema is [a: String, b: String, c: String] and the partition schema is [b: Int, d: Int], the result schema is [a: String, b: Int, c: String, d: Int] in current FileTable and HadoopFsRelation. while the actual scan schema is [a: String, c: String, b: Int, d: Int] in FileScan.

To fix the corner case, this PR proposes that the output schema of FileTable should be dataSchema - overlapSchema + partitionSchema, so that the column order is consistent with FileScan.
Putting all the partition columns to the end of table schema is more reasonable.

How was this patch tested?

Unit test.

@gengliangwang
Copy link
Member Author

@cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR contains too many changes, I am OK to create a separate PR for the partition value pruning.

@gengliangwang gengliangwang changed the title [SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema [WIP][SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema Apr 3, 2019
@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104251 has finished for PR 24284 at commit 313eda8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileScanBuilder(

@gengliangwang gengliangwang changed the title [WIP][SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema [SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema Apr 4, 2019
@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104274 has finished for PR 24284 at commit 518b628.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104269 has finished for PR 24284 at commit cd236a7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104280 has finished for PR 24284 at commit 518b628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need migration guide? it's a behavior change for file source v2, which is new in Spark 3.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with either way. Let me remove this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we testing v1 or v2 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V2.
For V1 we use OrcV1PartitionDiscoverySuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should put V2 in the test suite name as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite related to this PR. If we are going to use V2 by default, I think the current test suite name is OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't V1 by default now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For read path, it is V2 by default now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make a followup PR to put V2 in the test suite name and do not rely on the default config values.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104291 has finished for PR 24284 at commit 8894d93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104299 has finished for PR 24284 at commit a64107d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants