[SPARK-34203][SQL][3.0] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31326

MaxGekk · 2021-01-25T18:28:21Z

What changes were proposed in this pull request?

In the PR, I propose to convert null partition values to "__HIVE_DEFAULT_PARTITION__" before storing in the In-Memory catalog internally. Currently, the In-Memory catalog maintains null partitions as "__HIVE_DEFAULT_PARTITION__" in file system but as null values in memory that could cause some issues like in SPARK-34203.

Why are the changes needed?

InMemoryCatalog stores partitions in the file system in the Hive compatible form, for instance, it converts the null partition value to "__HIVE_DEFAULT_PARTITION__" but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:

$ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory

scala> spark.conf.get("spark.sql.catalogImplementation")
res0: String = in-memory

scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default':
Map(p1 -> null)
  at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440)

Does this PR introduce any user-facing change?

Yes. After the changes, ALTER TABLE .. DROP PARTITION can drop the null partition in In-Memory catalog:

scala> spark.table("tbl").show(false)
+----+----+
|col1|p1  |
+----+----+
|0   |null|
+----+----+


scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.table("tbl").show(false)
+----+---+
|col1|p1 |
+----+---+
+----+---+

How was this patch tested?

Added new test to DDLSuite:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"

Authored-by: Max Gekk [email protected]
Signed-off-by: Wenchen Fan [email protected]
(cherry picked from commit bfc0235)
Signed-off-by: Max Gekk [email protected]

…_PARTITION__` in v1 `In-Memory` catalog In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes apache#31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit bfc0235) Signed-off-by: Max Gekk <[email protected]>

MaxGekk · 2021-01-25T18:30:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

-    val nullPartValue = if (isUsingHiveMetastore) "__HIVE_DEFAULT_PARTITION__" else null
    assert(catalog.listPartitions(tableIdent).map(_.spec).toSet ==
-      Set(Map("a" -> nullPartValue, "b" -> nullPartValue)))
+      Set(Map("a" -> "__HIVE_DEFAULT_PARTITION__", "b" -> "__HIVE_DEFAULT_PARTITION__")))


Now, the In-Memory catalog behaves similarly to Hive external catalog, so, we don't need to distinguish them in tests.

shall we do it in master branch as well?

In master, we already have common settings in unified tests:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableDropPartitionSuite.scala

Line 36 in 861f8bb

override protected def nullPartitionValue: String = "__HIVE_DEFAULT_PARTITION__"

cloud-fan · 2021-01-25T18:36:19Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

-      sql(s"CREATE TABLE $t (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1)")
-      sql(s"INSERT INTO TABLE $t PARTITION (p1 = null) SELECT 0")
-      checkAnswer(sql(s"SELECT * FROM $t"), Row(0, null))
-    }


A simpler way is to test DROP PARTITION here.

New test covers both v1 In-Memory and Hive external catalogs because it runs as a part of InMemoryCatalogedDDLSuite and HiveCatalogedDDLSuite.

OK, then please forward port the new test to branch-3.1, as I used this smaller change while backporting.

Here it is #31331

SparkQA · 2021-01-25T19:24:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39043/

SparkQA · 2021-01-25T20:34:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39043/

… failure ### What changes were proposed in this pull request? Forward port changes in tests from #31326. ### Why are the changes needed? This fixes a test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31331 from MaxGekk/insert-overwrite-null-part-3.1. Authored-by: Max Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

SparkQA · 2021-01-25T22:52:48Z

Test build #134457 has finished for PR 31326 at commit 9171b0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-26T03:45:49Z

thanks, merging to 3.0!

…FAULT_PARTITION__` in v1 `In-Memory` catalog ### What changes were proposed in this pull request? In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. ### Why are the changes needed? `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` ### How was this patch tested? Added new test to `DDLSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit bfc0235) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31326 from MaxGekk/insert-overwrite-null-part-3.0. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk mentioned this pull request Jan 25, 2021

[SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog #31322

Closed

MaxGekk commented Jan 25, 2021

View reviewed changes

cloud-fan reviewed Jan 25, 2021

View reviewed changes

MaxGekk mentioned this pull request Jan 25, 2021

[SPARK-34203][SQL][TESTS][3.1][FOLLOWUP] Fix null partition values UT failure #31331

Closed

cloud-fan closed this Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-34203][SQL][3.0] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31326

[SPARK-34203][SQL][3.0] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31326

Uh oh!

MaxGekk commented Jan 25, 2021

Uh oh!

MaxGekk Jan 25, 2021

Uh oh!

cloud-fan Jan 25, 2021

Uh oh!

MaxGekk Jan 25, 2021

Uh oh!

cloud-fan Jan 25, 2021

Uh oh!

MaxGekk Jan 25, 2021 •

edited

Loading

Uh oh!

cloud-fan Jan 25, 2021

Uh oh!

MaxGekk Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-34203][SQL][3.0] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog #31326

[SPARK-34203][SQL][3.0] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog #31326

Uh oh!

Conversation

MaxGekk commented Jan 25, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 25, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-34203][SQL][3.0] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31326

[SPARK-34203][SQL][3.0] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31326

MaxGekk Jan 25, 2021 •

edited

Loading