[SPARK-34203][SQL] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31322

MaxGekk · 2021-01-25T13:11:34Z

What changes were proposed in this pull request?

In the PR, I propose to convert null partition values to "__HIVE_DEFAULT_PARTITION__" before storing in the In-Memory catalog internally. Currently, the In-Memory catalog maintains null partitions as "__HIVE_DEFAULT_PARTITION__" in file system but as null values in memory that could cause some issues like in SPARK-34203.

Why are the changes needed?

InMemoryCatalog stores partitions in the file system in the Hive compatible form, for instance, it converts the null partition value to "__HIVE_DEFAULT_PARTITION__" but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:

$ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory

scala> spark.conf.get("spark.sql.catalogImplementation")
res0: String = in-memory

scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default':
Map(p1 -> null)
  at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440)

Does this PR introduce any user-facing change?

Yes. After the changes, ALTER TABLE .. DROP PARTITION can drop the null partition in In-Memory catalog:

scala> spark.table("tbl").show(false)
+----+----+
|col1|p1  |
+----+----+
|0   |null|
+----+----+


scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.table("tbl").show(false)
+----+---+
|col1|p1 |
+----+---+
+----+---+

How was this patch tested?

Added new test to AlterTableDropPartitionSuiteBase:

$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"

MaxGekk · 2021-01-25T13:15:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala


  def convertNullPartitionValues(spec: TablePartitionSpec): TablePartitionSpec = {
-    spec.mapValues(v => if (v == null) DEFAULT_PARTITION_NAME else v).toMap
+    spec.mapValues(v => if (v == null) DEFAULT_PARTITION_NAME else v).map(identity).toMap


Otherwise I got the "Task is not serializable" exception

yaooqinn · 2021-01-25T13:56:38Z

spark-sql> drop table ppp;
spark-sql> create table ppp(i string, j string) using parquet partitioned by (j);
spark-sql>
         > INSERT OVERWRITE ppp VALUES ('1', null);
21/01/25 21:55:03 WARN log: Updating partition stats fast for: ppp
21/01/25 21:55:03 WARN log: Updated size to 396
spark-sql> alter table ppp drop partition(j=null);
Error in query: No partition is dropped. One partition spec 'Map(j -> null)' does not exist in table 'ppp' database 'default';
spark-sql> select version();
3.0.1 2b147c4cd50da32fe2b4167f97c8142102a0510d
spark-sql>

I also test the 3.0.1 release, might not be just in-memory catalog?

MaxGekk · 2021-01-25T14:17:44Z

@yaooqinn Fix for Hive catalog was merged recently in 3.0: 16cab5c . If you build Spark from branch-3.0, it should work correctly at least with Hive external catalog.

The commit 2b147c4 is pretty old.

yaooqinn · 2021-01-25T14:24:56Z

ah yes, I verified the case locally with hive catalog, it has been fixed in master.

cloud-fan · 2021-01-25T15:27:17Z

GA passed, merging to master/3.1! (I didn't backport further because in-memory catalog is for testing purpose)

…_PARTITION__` in v1 `In-Memory` catalog In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes apache#31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit bfc0235) Signed-off-by: Wenchen Fan <[email protected]>

…_PARTITION__` in v1 `In-Memory` catalog In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk · 2021-01-25T16:49:11Z

I didn't backport further because in-memory catalog is for testing purpose

@cloud-fan Doesn't it mean that it is less risky to backport it to 3.0 than #31095 for Hive catalog merged to 3.0 already?

cloud-fan · 2021-01-25T16:51:45Z

It's less risky, but also less useful (as it doesn't fix a real bug). If you have time to open a backport PR for 3.0, I'm OK too.

SparkQA · 2021-01-25T18:17:44Z

Test build #134441 has finished for PR 31322 at commit 303ce52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…_PARTITION__` in v1 `In-Memory` catalog In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes apache#31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit bfc0235) Signed-off-by: Max Gekk <[email protected]>

MaxGekk · 2021-01-25T18:28:45Z

Here is the backport to 3.0: #31326

dongjoon-hyun · 2021-01-25T20:02:16Z

Hi, this causes a branch-3.1 UT failure. Please take a look at it.

- InMemoryCatalogedDDLSuite.alter table: drop partition (datasource table)

cc @HyukjinKwon since he is the release manager of Apache Spark 3.1.0.

MaxGekk · 2021-01-25T21:36:49Z

This PR #31331 is a forward port of #31326 including a fix for the test failure.

dongjoon-hyun · 2021-01-25T22:22:21Z

Thank you, @MaxGekk !

…_PARTITION__` in v1 `In-Memory` catalog ### What changes were proposed in this pull request? In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. ### Why are the changes needed? `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` ### How was this patch tested? Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes apache#31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 2 commits January 25, 2021 15:34

Add a test

c096f10

Fix

303ce52

MaxGekk changed the title ~~[SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in the In-Memory catalog~~ [SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog Jan 25, 2021

MaxGekk commented Jan 25, 2021

View reviewed changes

github-actions bot added the SQL label Jan 25, 2021

cloud-fan approved these changes Jan 25, 2021

View reviewed changes

cloud-fan closed this in bfc0235 Jan 25, 2021

[SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog #31322

[SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog #31322

Uh oh!

Conversation

MaxGekk commented Jan 25, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Jan 25, 2021

Uh oh!

MaxGekk commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 25, 2021

Uh oh!

MaxGekk commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

MaxGekk commented Jan 25, 2021

Uh oh!

dongjoon-hyun commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Jan 25, 2021

Uh oh!

dongjoon-hyun commented Jan 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-34203][SQL] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31322

[SPARK-34203][SQL] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog #31322

MaxGekk Jan 25, 2021 •

edited

Loading

MaxGekk commented Jan 25, 2021 •

edited

Loading

dongjoon-hyun commented Jan 25, 2021 •

edited

Loading