Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten.
We should follow #13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path.

How was this patch tested?

Unit test

TableIdentifier("tmp"), ignoreIfNotExists = true, purge = false)
}

test("SPARK-15678: not use cache on overwrite") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the original test case for following reasons:

  1. The cache invalidation is for all file sources
  2. The two test cases in ParquetQuerySuite is covered by the new test cases of this PR.
  3. The Parquet data source is not migrated to V2 yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gengliangwang
Copy link
Member Author

@cloud-fan

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104386 has finished for PR 24318 at commit 039324d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) {
withTempDir { dir =>
val path = dir.toString
spark.range(1000).write.mode("overwrite").orc(path)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang . In this suite, we need to test all available data sources like the following instead of using .orc.

Seq("csv", "orc", "text").foreach { format =>

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you generalize this test case?
Also, please add JIRA issue for Parquet DSv2 Migration as a TODO comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun The cache invalidation is for all file sources. Testing ORC here is quite sufficient, just like only Parquet is tested in #13566.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the logic, let's hold on this until Parquet is ready to migrate. We don't need to move around the test logic from here to there.

We are going to migrate Parquet anyway, aren't we?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this test suite is designed to verify for all data sources from the beginning.

withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) {
withTempDir { dir =>
val path = dir.toString
spark.range(1000).write.mode("append").orc(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking back my words.

@gengliangwang . You are the only one who is doing this migration in the community. Given that, I don't want to block you. This might be inevitable for this kind of migration.

+1, LGTM. Merged to master.

@gengliangwang
Copy link
Member Author

Thanks, @cloud-fan @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants