Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 8, 2021

What changes were proposed in this pull request?

  1. Recognize null while parsing partition specs, and put null instead of "null" as partition values.
  2. For V1 catalog: replace null by __HIVE_DEFAULT_PARTITION__.

Why are the changes needed?

Currently, null in partition specs is recognized as the "null" string which could lead to incorrect results, for example:

spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false

Even we inserted a row to the partition with the null value, the resulted table doesn't contain null.

Does this PR introduce any user-facing change?

Yes. After the changes, the example above works as expected:

spark-sql> SELECT isnull(p1) FROM tbl5;
true

How was this patch tested?

By running the affected test suites:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *SQLQuerySuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"

Authored-by: Max Gekk [email protected]
Signed-off-by: Wenchen Fan [email protected]
(cherry picked from commit 157b72a)
Signed-off-by: Max Gekk [email protected]

1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.
3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec.

Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
```sql
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
```
Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.

Yes. After the changes, the example above works as expected:
```sql
spark-sql> SELECT isnull(p1) FROM tbl5;
true
```

1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`.
2. Compiling by Scala 2.13:
```
$  ./dev/change-scala-version.sh 2.13
$ ./build/sbt -Pscala-2.13 compile
```

Closes apache#30538 from MaxGekk/partition-spec-value-null.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 157b72a)
Signed-off-by: Max Gekk <[email protected]>
@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38438/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38438/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133849 has finished for PR 31095 at commit 4fef8be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Jan 11, 2021
### What changes were proposed in this pull request?
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values.
2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`.

### Why are the changes needed?
Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example:
```sql
spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false
```
Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works as expected:
```sql
spark-sql> SELECT isnull(p1) FROM tbl5;
true
```

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *SQLQuerySuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
(cherry picked from commit 157b72a)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes #31095 from MaxGekk/partition-spec-value-null-3.0.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to 3.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants