-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| TableIdentifier("tmp"), ignoreIfNotExists = true, purge = false) | ||
| } | ||
|
|
||
| test("SPARK-15678: not use cache on overwrite") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the original test case for following reasons:
- The cache invalidation is for all file sources
- The two test cases in ParquetQuerySuite is covered by the new test cases of this PR.
- The Parquet data source is not migrated to V2 yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Test build #104386 has finished for PR 24318 at commit
|
| withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) { | ||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("overwrite").orc(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang . In this suite, we need to test all available data sources like the following instead of using .orc.
Seq("csv", "orc", "text").foreach { format =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you generalize this test case?
Also, please add JIRA issue for Parquet DSv2 Migration as a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun The cache invalidation is for all file sources. Testing ORC here is quite sufficient, just like only Parquet is tested in #13566.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the logic, let's hold on this until Parquet is ready to migrate. We don't need to move around the test logic from here to there.
We are going to migrate Parquet anyway, aren't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this test suite is designed to verify for all data sources from the beginning.
| withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) { | ||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("append").orc(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking back my words.
@gengliangwang . You are the only one who is doing this migration in the community. Given that, I don't want to block you. This might be inevitable for this kind of migration.
+1, LGTM. Merged to master.
|
Thanks, @cloud-fan @dongjoon-hyun |
What changes were proposed in this pull request?
File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten.
We should follow #13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path.
How was this patch tested?
Unit test