Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 30, 2020

What changes were proposed in this pull request?

Convert NULL elements of maps, structs and arrays to the "null" string while converting maps/struct/array values to strings. The SQL config spark.sql.legacy.omitNestedNullInCast.enabled controls the behaviour. When it is true, NULL elements of structs/maps/arrays will be omitted otherwise, when it is false, NULL elements will be converted to "null".

Why are the changes needed?

  1. It is impossible to distinguish empty string and null, for instance:
scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
  1. Inconsistent NULL conversions for top-level values and nested columns, for instance:
scala> sql("select named_struct('c', null), null").show
+---------------------+----+
|named_struct(c, NULL)|NULL|
+---------------------+----+
|                   []|null|
+---------------------+----+
  1. .show() is different from conversions to Hive strings, and as a consequence its output is different from spark-sql (sql tests):
spark-sql> select named_struct('c', null) as struct;
{"c":null}
scala> sql("select named_struct('c', null) as struct").show
+------+
|struct|
+------+
|    []|
+------+
  1. It is impossible to distinguish empty struct/array from struct/array with null in the current implementation:
scala> Seq[Seq[String]](Seq(), Seq(null)).toDF.show()
+-----+
|value|
+-----+
|   []|
|   []|
+-----+

Does this PR introduce any user-facing change?

Yes, before:

scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+

After:

scala> Seq(Seq(""), Seq(null)).toDF().show
+------+
| value|
+------+
|    []|
|[null]|
+------+

How was this patch tested?

By existing test suite CastSuite.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126804 has finished for PR 29311 at commit 8e3b910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 31, 2020

@cloud-fan @HyukjinKwon @maropu Please, take a look at this.

@maropu
Copy link
Member

maropu commented Jul 31, 2020

Looks a reasonable change. Actually, hive does so;

hive> select named_struct('a', 1, 'b', null);
{"a":1,"b":null}

hive> select array(1, 2, null);
[1,2,null]

hive> select map('a', 1, 'b', null);
{"a":1,"b":null}

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126989 has finished for PR 29311 at commit 6347cda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126992 has finished for PR 29311 at commit 2325c7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2020

Test build #127047 has finished for PR 29311 at commit 2a711dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

MaxGekk added 3 commits August 4, 2020 22:09
…-to-string

# Conflicts:
#	docs/sql-migration-guide.md
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
#	sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
@SparkQA
Copy link

SparkQA commented Aug 4, 2020

Test build #127064 has finished for PR 29311 at commit a686b99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 5, 2020

@cloud-fan @maropu @HyukjinKwon Please, review this PR.


- In Spark 3.1, structs and maps are wrapped by the `{}` brackets in casting them to strings. For instance, the `show()` action and the `CAST` expression use such brackets. In Spark 3.0 and earlier, the `[]` brackets are used for the same purpose. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

- In Spark 3.1, `CAST` converts NULL elements of structures, arrays and maps to "null". In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passive stuff was not recommended by IntelliJ IDEA ;-) Native speakers don't like it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2020-08-05 at 09 34 31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but let's highlight it's for "casting to string", not for general CAST

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced it by your sentence already.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127083 has finished for PR 29311 at commit 9547682.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127088 has finished for PR 29311 at commit 39daa90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3a437ed Aug 5, 2020
@MaxGekk MaxGekk deleted the nested-null-to-string branch December 11, 2020 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants