Skip to content

[SPARK-21463] Allow userSpecifiedSchema to override partition inference performed by MetadataLogFileIndex#18676

Closed
brkyvz wants to merge 6 commits intoapache:masterfrom
brkyvz:stream-partitioning
Closed

[SPARK-21463] Allow userSpecifiedSchema to override partition inference performed by MetadataLogFileIndex#18676
brkyvz wants to merge 6 commits intoapache:masterfrom
brkyvz:stream-partitioning

Conversation

@brkyvz
Copy link
Copy Markdown
Contributor

@brkyvz brkyvz commented Jul 18, 2017

What changes were proposed in this pull request?

When using the MetadataLogFileIndex to read back a table, we don't respect the user provided schema as the proper column types. This can lead to issues when trying to read strings that look like dates that get truncated to DateType, or longs being truncated to IntegerType, just because a long value doesn't exist.

How was this patch tested?

Unit tests and manual tests

@brkyvz
Copy link
Copy Markdown
Contributor Author

brkyvz commented Jul 18, 2017

One thing that worries me is the fact that we're paying the price of inferring partitions and reading all files from the log twice. I tried re-using the metadataLog instance in the copy of the new MetadataLogFileIndex as implemented in the first commit, but not sure if that gives us any benefit.

cc @zsxwing

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 18, 2017

Test build #79723 has finished for PR 18676 at commit 7589b66.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 18, 2017

Test build #79724 has finished for PR 18676 at commit dce4a75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 18, 2017

Test build #79721 has finished for PR 18676 at commit 8e4b94f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MetadataLogFileIndex(

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 19, 2017

Test build #79730 has finished for PR 18676 at commit 7cdc864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Copy Markdown
Member

zsxwing commented Jul 19, 2017

LGTM

@brkyvz
Copy link
Copy Markdown
Contributor Author

brkyvz commented Jul 19, 2017

Thanks! Merging to master

@asfgit asfgit closed this in 2c9d5ef Jul 19, 2017
@brkyvz brkyvz deleted the stream-partitioning branch February 3, 2019 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants