[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

gengliangwang · 2019-04-08T13:25:56Z

What changes were proposed in this pull request?

File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten.
We should follow #13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path.

How was this patch tested?

Unit test

gengliangwang · 2019-04-08T13:28:38Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

      TableIdentifier("tmp"), ignoreIfNotExists = true, purge = false)
  }

-  test("SPARK-15678: not use cache on overwrite") {


Delete the original test case for following reasons:

The cache invalidation is for all file sources

The two test cases in ParquetQuerySuite is covered by the new test cases of this PR.

The Parquet data source is not migrated to V2 yet.

gengliangwang · 2019-04-08T13:28:55Z

@cloud-fan

SparkQA · 2019-04-08T17:31:20Z

Test build #104386 has finished for PR 24318 at commit 039324d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-09T00:38:18Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+      withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) {
+        withTempDir { dir =>
+          val path = dir.toString
+          spark.range(1000).write.mode("overwrite").orc(path)


@gengliangwang . In this suite, we need to test all available data sources like the following instead of using .orc.

Seq("csv", "orc", "text").foreach { format =>

Could you generalize this test case?
Also, please add JIRA issue for Parquet DSv2 Migration as a TODO comment.

@dongjoon-hyun The cache invalidation is for all file sources. Testing ORC here is quite sufficient, just like only Parquet is tested in #13566.

~~If that is the logic, let's hold on this until Parquet is ready to migrate. We don't need to move around the test logic from here to there.~~

~~We are going to migrate Parquet anyway, aren't we?~~

~~In general, this test suite is designed to verify for all data sources from the beginning.~~

dongjoon-hyun · 2019-04-09T01:25:02Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+      withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) {
+        withTempDir { dir =>
+          val path = dir.toString
+          spark.range(1000).write.mode("append").orc(path)


dongjoon-hyun

I'm taking back my words.

@gengliangwang . You are the only one who is doing this migration in the community. Given that, I don't want to block you. This might be inevitable for this kind of migration.

+1, LGTM. Merged to master.

gengliangwang · 2019-04-09T16:44:24Z

Thanks, @cloud-fan @dongjoon-hyun

fix

039324d

gengliangwang commented Apr 8, 2019

View reviewed changes

cloud-fan approved these changes Apr 8, 2019

View reviewed changes

dongjoon-hyun reviewed Apr 9, 2019

View reviewed changes

dongjoon-hyun approved these changes Apr 9, 2019

View reviewed changes

dongjoon-hyun closed this in 3db117e Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

Uh oh!

gengliangwang commented Apr 8, 2019

Uh oh!

gengliangwang Apr 8, 2019

Uh oh!

dongjoon-hyun Apr 8, 2019

Uh oh!

gengliangwang commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

dongjoon-hyun Apr 9, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun Apr 9, 2019 •

edited

Loading

Uh oh!

gengliangwang Apr 9, 2019

Uh oh!

dongjoon-hyun Apr 9, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun Apr 9, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun Apr 9, 2019

Uh oh!

dongjoon-hyun left a comment

Uh oh!

gengliangwang commented Apr 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318

Uh oh!

Conversation

gengliangwang commented Apr 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang Apr 8, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 8, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

dongjoon-hyun Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Apr 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun Apr 9, 2019 •

edited

Loading

dongjoon-hyun Apr 9, 2019 •

edited

Loading

dongjoon-hyun Apr 9, 2019 •

edited

Loading

dongjoon-hyun Apr 9, 2019 •

edited

Loading