-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34203][SQL][3.0] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog
#31326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…_PARTITION__` in v1 `In-Memory` catalog
In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203.
`InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:
```
$ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory
```
```scala
scala> spark.conf.get("spark.sql.catalogImplementation")
res0: String = in-memory
scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default':
Map(p1 -> null)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440)
```
Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog:
```scala
scala> spark.table("tbl").show(false)
+----+----+
|col1|p1 |
+----+----+
|0 |null|
+----+----+
scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
res4: org.apache.spark.sql.DataFrame = []
scala> spark.table("tbl").show(false)
+----+---+
|col1|p1 |
+----+---+
+----+---+
```
Added new test to `AlterTableDropPartitionSuiteBase`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes apache#31322 from MaxGekk/insert-overwrite-null-part.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit bfc0235)
Signed-off-by: Max Gekk <[email protected]>
| val nullPartValue = if (isUsingHiveMetastore) "__HIVE_DEFAULT_PARTITION__" else null | ||
| assert(catalog.listPartitions(tableIdent).map(_.spec).toSet == | ||
| Set(Map("a" -> nullPartValue, "b" -> nullPartValue))) | ||
| Set(Map("a" -> "__HIVE_DEFAULT_PARTITION__", "b" -> "__HIVE_DEFAULT_PARTITION__"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the In-Memory catalog behaves similarly to Hive external catalog, so, we don't need to distinguish them in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do it in master branch as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In master, we already have common settings in unified tests:
Line 36 in 861f8bb
| override protected def nullPartitionValue: String = "__HIVE_DEFAULT_PARTITION__" |
| sql(s"CREATE TABLE $t (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1)") | ||
| sql(s"INSERT INTO TABLE $t PARTITION (p1 = null) SELECT 0") | ||
| checkAnswer(sql(s"SELECT * FROM $t"), Row(0, null)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler way is to test DROP PARTITION here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test covers both v1 In-Memory and Hive external catalogs because it runs as a part of InMemoryCatalogedDDLSuite and HiveCatalogedDDLSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then please forward port the new test to branch-3.1, as I used this smaller change while backporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is #31331
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
… failure ### What changes were proposed in this pull request? Forward port changes in tests from #31326. ### Why are the changes needed? This fixes a test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31331 from MaxGekk/insert-overwrite-null-part-3.1. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Test build #134457 has finished for PR 31326 at commit
|
|
thanks, merging to 3.0! |
…FAULT_PARTITION__` in v1 `In-Memory` catalog ### What changes were proposed in this pull request? In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. ### Why are the changes needed? `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` ### How was this patch tested? Added new test to `DDLSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit bfc0235) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31326 from MaxGekk/insert-overwrite-null-part-3.0. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to convert
nullpartition values to"__HIVE_DEFAULT_PARTITION__"before storing in theIn-Memorycatalog internally. Currently, theIn-Memorycatalog maintains null partitions as"__HIVE_DEFAULT_PARTITION__"in file system but asnullvalues in memory that could cause some issues like in SPARK-34203.Why are the changes needed?
InMemoryCatalogstores partitions in the file system in the Hive compatible form, for instance, it converts thenullpartition value to"__HIVE_DEFAULT_PARTITION__"but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:Does this PR introduce any user-facing change?
Yes. After the changes,
ALTER TABLE .. DROP PARTITIONcan drop thenullpartition inIn-Memorycatalog:How was this patch tested?
Added new test to
DDLSuite:Authored-by: Max Gekk [email protected]
Signed-off-by: Wenchen Fan [email protected]
(cherry picked from commit bfc0235)
Signed-off-by: Max Gekk [email protected]