-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append #24318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -494,6 +494,38 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo | |
| } | ||
| } | ||
|
|
||
| test("Do not use cache on overwrite") { | ||
| Seq("", "orc").foreach { useV1SourceReaderList => | ||
| withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) { | ||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("overwrite").orc(path) | ||
| val df = spark.read.orc(path).cache() | ||
| assert(df.count() == 1000) | ||
| spark.range(10).write.mode("overwrite").orc(path) | ||
| assert(df.count() == 10) | ||
| assert(spark.read.orc(path).count() == 10) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("Do not use cache on append") { | ||
| Seq("", "orc").foreach { useV1SourceReaderList => | ||
| withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> useV1SourceReaderList) { | ||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("append").orc(path) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto. |
||
| val df = spark.read.orc(path).cache() | ||
| assert(df.count() == 1000) | ||
| spark.range(10).write.mode("append").orc(path) | ||
| assert(df.count() == 1010) | ||
| assert(spark.read.orc(path).count() == 1010) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("Return correct results when data columns overlap with partition columns") { | ||
| Seq("parquet", "orc", "json").foreach { format => | ||
| withTempPath { path => | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -70,30 +70,6 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext | |
| TableIdentifier("tmp"), ignoreIfNotExists = true, purge = false) | ||
| } | ||
|
|
||
| test("SPARK-15678: not use cache on overwrite") { | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Delete the original test case for following reasons:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("overwrite").parquet(path) | ||
| val df = spark.read.parquet(path).cache() | ||
| assert(df.count() == 1000) | ||
| spark.range(10).write.mode("overwrite").parquet(path) | ||
| assert(df.count() == 10) | ||
| assert(spark.read.parquet(path).count() == 10) | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-15678: not use cache on append") { | ||
| withTempDir { dir => | ||
| val path = dir.toString | ||
| spark.range(1000).write.mode("append").parquet(path) | ||
| val df = spark.read.parquet(path).cache() | ||
| assert(df.count() == 1000) | ||
| spark.range(10).write.mode("append").parquet(path) | ||
| assert(df.count() == 1010) | ||
| assert(spark.read.parquet(path).count() == 1010) | ||
| } | ||
| } | ||
|
|
||
| test("self-join") { | ||
| // 4 rows, cells of column 1 of row 2 and row 4 are null | ||
| val data = (1 to 4).map { i => | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang . In this suite, we need to test all available data sources like the following instead of using
.orc.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you generalize this test case?
Also, please add JIRA issue for
Parquet DSv2 Migrationas a TODO comment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun The cache invalidation is for all file sources. Testing ORC here is quite sufficient, just like only Parquet is tested in #13566.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the logic, let's hold on this until Parquet is ready to migrate. We don't need to move around the test logic from here to there.We are going to migrate Parquet anyway, aren't we?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this test suite is designed to verify for all data sources from the beginning.