[SPARK-24762][SQL] Enable Option of Product encoders #21732

viirya · 2018-07-09T03:43:53Z

What changes were proposed in this pull request?

SparkSQL doesn't support to encode Option[Product] as a top-level row now, because in SparkSQL entire top-level row can't be null.

However for use cases like Aggregator, it is reasonable to use Option[Product] as buffer and output column types. Due to above limitation, we don't do it for now.

This patch proposes to encode Option[Product] at top-level as single struct column. So we can work around the issue that entire top-level row can't be null.

To summarize encoding of Product and Option[Product].

For Product, 1. at root level, the schema is all fields are flatten it into multiple columns. The Product can't be null, otherwise it throws an exception.

val df = Seq((1 -> "a"), (2 -> "b")).toDF()
df.printSchema()

root
 |-- _1: integer (nullable = false)
 |-- _2: string (nullable = true)

At non-root level, Product is a struct type column.

val df = Seq((1, (1 -> "a")), (2, (2 -> "b")), (3, null)).toDF()
df.printSchema()

root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: string (nullable = true)

For Option[Product], 1. it was not supported at root level. After this change, it is a struct type column.

val df = Seq(Some(1 -> "a"), Some(2 -> "b"), None).toDF()
df.printSchema

root
 |-- value: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: string (nullable = true)

At non-root level, it is also a struct type column.

val df = Seq((1, Some(1 -> "a")), (2, Some(2 -> "b")), (3, None)).toDF()
df.printSchema

root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: string (nullable = true)

For use case like Aggregator, it was not supported too. After this change, we support to use Option[Product] as buffer/output column type.

val df = Seq(
    OptionBooleanIntData("bob", Some((true, 1))),
    OptionBooleanIntData("bob", Some((false, 2))),
    OptionBooleanIntData("bob", None)).toDF()

val group = df
    .groupBy("name")
    .agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood"))
group.printSchema

root                                                  
 |-- name: string (nullable = true)                          
 |-- isGood: struct (nullable = true)
 |    |-- _1: boolean (nullable = false)
 |    |-- _2: integer (nullable = false)

The buffer and output type of OptionBooleanIntAggregator is both Option[(Boolean, Int).

How was this patch tested?

Added test.

SparkQA · 2018-07-09T07:05:02Z

Test build #92729 has finished for PR 21732 at commit e1b5dee.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-09T07:07:14Z

retest this please.

SparkQA · 2018-07-09T09:03:24Z

Test build #92739 has finished for PR 21732 at commit e1b5dee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-09T09:41:36Z

retest this please.

SparkQA · 2018-07-09T17:15:36Z

Test build #92748 has finished for PR 21732 at commit e1b5dee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-09T22:36:27Z

I'm wondering should we add encoders of Option of Product into object Encoders.

viirya · 2018-07-09T22:36:47Z

cc @cloud-fan @hvanhovell

cloud-fan · 2018-07-10T08:46:57Z

how about we treat the top level Option[Product] as a single struct column? then we can git rid of this limitation entirely.

viirya · 2018-07-10T09:50:55Z

It sounds like much more behavior changing?

cloud-fan · 2018-07-10T15:31:49Z

yes it is, but it makes the encoder framework more consistent. And making a failure case into runnable is a safe behavior change.

viirya · 2018-07-17T01:55:17Z

Non top-level and top-level encoders for Option[Product] have a little difference.

As you said, top-level Option[Product] can only be encoded as a single struct column.

For non top-level one, we can't apply the same change because it is already a struct column. We don't want to change current behavior of it from a struct column to a struct column of a struct column.

This means that we can remove the limitation of top-level Option[Product], but it doesn't help too much Aggregator issue here. We still need to use non top-level Option[Product] encoders for Aggregator case.

@cloud-fan Do you want to incorporate top-level Option[Product] encoders into this PR too? Or let's create another Jira & PR for it? Thanks.

viirya · 2018-07-19T05:10:37Z

ping @cloud-fan

cloud-fan · 2018-07-19T06:25:55Z

Non top-level and top-level encoders for Option[Product] have a little difference.

Can we treat them the same but at the end of encoder creation, we flatten the Option[Product]?

viirya · 2018-07-19T06:50:48Z

At the end of encoder creation? You mean at the end of calling ExpressionEncoder.apply()? But it is used both for top-level encoder e.g., Dataset[Option[Product]] and non top-level encoder e.g., Aggregator's encoder. If we flatten it, doesn't it mean for top-level, it is encoded as a row, not a struct column?

cloud-fan · 2018-07-19T07:11:31Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala

 }

+case class OptionBooleanIntAggregator(colName: String)
+    extends Aggregator[Row, Option[(Boolean, Int)], Option[(Boolean, Int)]] {


what's the expected schema after we apply an aggregator with Option[Product] as buffer/output?

For a non top-level encoder, the output schema of Option[Product] should be struct column.

assuming non top level, Option[Product] is same as Product?

Yes. For non top level, [Option[Product] is same as Product. The difference is additional WrapOption and UnwrapOption around expressions.

viirya · 2018-07-26T19:33:24Z

ping @cloud-fan @hvanhovell

cloud-fan · 2018-07-27T11:49:22Z

Again, can we always support Option[Product] with some special handling for top-level encoder expression?

viirya · 2018-07-27T22:10:58Z

@cloud-fan We can. Just wondering if you think it is good to have that in this PR too?

cloud-fan · 2018-07-28T01:32:57Z

This PR is just a special handling for Option[Product] in Aggregator, I think we don't need it when we have the more general solution, right?

SparkQA · 2018-07-30T03:36:38Z

Test build #93763 has finished for PR 21732 at commit 2b73b33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* For example, we build an encoder forcase class Data(a: Int, b: String) and the real type

SparkQA · 2018-07-30T03:53:39Z

Test build #93762 has finished for PR 21732 at commit 6e32abe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* For example, we build an encoder forcase class Data(a: Int, b: String) and the real type

SparkQA · 2018-07-30T04:40:10Z

Test build #93764 has finished for PR 21732 at commit b6c6e9f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-30T04:52:41Z

Test build #93765 has finished for PR 21732 at commit 4f5628d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* For example, we build an encoder forcase class Data(a: Int, b: String) and the real type

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

cloud-fan · 2018-11-21T07:20:11Z

last comment, LGTM otherwise

SparkQA · 2018-11-22T13:49:17Z

Test build #99177 has finished for PR 21732 at commit 29de9e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-22T13:52:59Z

Test build #99178 has finished for PR 21732 at commit dbd8678.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-24T03:53:42Z

Test build #99222 has finished for PR 21732 at commit 62fdb17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-26T03:13:47Z

thanks, merging to master, great work!

cloud-fan · 2018-11-26T03:14:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

+   * flattened to top-level row, because in Spark SQL top-level row can't be null. This method
+   * returns true if `T` is serialized as struct and is not `Option` type.
+   */
+  def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && !isOptionType


can you send a followup PR to inline isOptionType if it's only used here?

## What changes were proposed in this pull request? This is follow-up of #21732. This patch inlines `isOptionType` method. ## How was this patch tested? Existing tests. Closes #23143 from viirya/SPARK-24762-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon · 2018-12-27T14:18:21Z

Hm .. sorry for joining this party late. I was reading and testing it by myself.

scala> Seq((1, "a"), (2, "b")).toDF.show()
+---+---+
| _1| _2|
+---+---+
|  1|  a|
|  2|  b|
+---+---+


scala> Seq(1, 2).toDF.show()
+-----+
|value|
+-----+
|    1|
|    2|
+-----+

scala> Seq(Some((1, "a")), Some((2, "b"))).toDF.show()
+------+
| value|
+------+
|[1, a]|
|[2, b]|
+------+


scala> Seq(Some(1), Some(2)).toDF.show()
+-----+
|value|
+-----+
|    1|
|    2|
+-----+

I think this behaviour can be actually controversial. If we interpret Option as existent/missing value. Why did we interpret Option as Tuple1?

viirya · 2018-12-27T14:29:53Z

@HyukjinKwon What do you mean we interpret Option as Tuple1?

cloud-fan · 2018-12-27T14:36:04Z

Seq(Some(1), Some(2)) is treated same as Seq(1, 2), because it's only a single column and we can make it nullable. So the Some here changes nothing but the column nullability.

Seq((1, "a"),(2, "b")) is special, as dataset encoder assumes the top level Product will never be null, and flatten it into 2 columns. If we add Some here, we can't do the flattening.

HyukjinKwon · 2018-12-27T14:39:05Z

re: #21732 (comment) I was thinking both below

Seq(Some((1, "a")), Some((2, "b"))).toDF.show()
Seq((1, "a"), (2, "b")).toDF.show()

should produce the same result since

Seq(Some(1), Some(2)).toDF.show()
Seq(1, 2).toDF.show()

produces the same results anyhow. Apparently this looks why it has been disallowed.

viirya · 2018-12-27T14:41:31Z

Thanks for @cloud-fan's explanation. So I think @HyukjinKwon you mean why we interpret Option[Product] like Some((1, "a")), Some((2, "b")) as Tuple2.

For the following

Seq(Some((1, "a")), Some((2, "b"))).toDF.show()
Seq((1, "a"), (2, "b")).toDF.show()

we can't produce the same result since top-level null row is not allowed in Spark.

HyukjinKwon · 2018-12-27T14:42:35Z

Yea, then why did we allow the different result? I was thinking we're going to allow this only for aggregators.

viirya · 2018-12-27T14:51:30Z

@HyukjinKwon there was a comment #21732 (comment) for it. This was originally to make Option[Product] work for aggregators, but it makes encoder framework inconsistent because the created encoders can used for aggregators but not for Dataset. So I think the idea is to make it consistent. Users are allowed to encoder Option[Product] as Dataset but they should be aware of the encoding.

HyukjinKwon · 2018-12-27T16:15:18Z

Hm, for aggregators, I would consider this as non root level. Looks they use the same encoder but can't be the same.

cloud-fan · 2018-12-28T01:56:04Z

@HyukjinKwon If we can go back, I'd say we should not have this optimization which flattens top-level Product when encoding. This brings the assumption that top level Product can't be null, because in Spark the top level Row can't be null.

Ideally Option[T] should be the same as T if T is nullable. It's just 2 different ways to represent null in Scala and Java. But because of the optimization I mentioned before, top-level Product is the only exception.

It's too late to revert that optimization, I think we should accept this special case.

HyukjinKwon · 2018-12-28T06:50:27Z

I didn't mean that we should revert .. was just checking the PRs in my queue and was just curious.

I mean, I understood the limitation but failed to understand why it's been allowed. We exposed Option[T < Project] at top level but limitation is still there.

HyukjinKwon · 2018-12-28T06:52:42Z

Ah, but you're saying Product at top-level is an exception and the other cases are all coherent? hmm ... okie.

cloud-fan · 2018-12-28T14:07:53Z

Yes, top-level Product is the only exception, because of the flatten trick which has been there for years.

## What changes were proposed in this pull request? This is inspired during implementing apache#21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do. To summarize the proposed changes: 1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly. 2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row. So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value. A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct. A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type. 3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer. ## How was this patch tested? Existing tests. Closes apache#22749 from viirya/SPARK-24762-refactor. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? SparkSQL doesn't support to encode `Option[Product]` as a top-level row now, because in SparkSQL entire top-level row can't be null. However for use cases like Aggregator, it is reasonable to use `Option[Product]` as buffer and output column types. Due to above limitation, we don't do it for now. This patch proposes to encode `Option[Product]` at top-level as single struct column. So we can work around the issue that entire top-level row can't be null. To summarize encoding of `Product` and `Option[Product]`. For `Product`, 1. at root level, the schema is all fields are flatten it into multiple columns. The `Product ` can't be null, otherwise it throws an exception. ```scala val df = Seq((1 -> "a"), (2 -> "b")).toDF() df.printSchema() root |-- _1: integer (nullable = false) |-- _2: string (nullable = true) ``` 2. At non-root level, `Product` is a struct type column. ```scala val df = Seq((1, (1 -> "a")), (2, (2 -> "b")), (3, null)).toDF() df.printSchema() root |-- _1: integer (nullable = false) |-- _2: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` For `Option[Product]`, 1. it was not supported at root level. After this change, it is a struct type column. ```scala val df = Seq(Some(1 -> "a"), Some(2 -> "b"), None).toDF() df.printSchema root |-- value: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` 2. At non-root level, it is also a struct type column. ```scala val df = Seq((1, Some(1 -> "a")), (2, Some(2 -> "b")), (3, None)).toDF() df.printSchema root |-- _1: integer (nullable = false) |-- _2: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` 3. For use case like Aggregator, it was not supported too. After this change, we support to use `Option[Product]` as buffer/output column type. ```scala val df = Seq( OptionBooleanIntData("bob", Some((true, 1))), OptionBooleanIntData("bob", Some((false, 2))), OptionBooleanIntData("bob", None)).toDF() val group = df .groupBy("name") .agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood")) group.printSchema root |-- name: string (nullable = true) |-- isGood: struct (nullable = true) | |-- _1: boolean (nullable = false) | |-- _2: integer (nullable = false) ``` The buffer and output type of `OptionBooleanIntAggregator` is both `Option[(Boolean, Int)`. ## How was this patch tested? Added test. Closes apache#21732 from viirya/SPARK-24762. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This is follow-up of apache#21732. This patch inlines `isOptionType` method. ## How was this patch tested? Existing tests. Closes apache#23143 from viirya/SPARK-24762-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Aggregator should be able to use Option of Product encoder.

e1b5dee

cloud-fan reviewed Jul 19, 2018

View reviewed changes

viirya changed the title ~~[SPARK-24762][SQL] Aggregator should be able to use Option of Product encoder~~ [SPARK-24762][SQL] Enable Option of Product encoders Jul 30, 2018

viirya force-pushed the SPARK-24762 branch 2 times, most recently from b6c6e9f to 4f5628d Compare July 30, 2018 02:33

viirya force-pushed the SPARK-24762 branch from 4f5628d to 50f6430 Compare July 30, 2018 04:59

Enable top-level Option of Product encoders.

80506f4

viirya force-pushed the SPARK-24762 branch from 50f6430 to 80506f4 Compare July 30, 2018 05:08

cloud-fan reviewed Nov 21, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala Outdated Show resolved Hide resolved

viirya force-pushed the SPARK-24762 branch from 4e4718a to 29de9e9 Compare November 22, 2018 10:01

Improve tests and document.

dbd8678

viirya force-pushed the SPARK-24762 branch from 29de9e9 to dbd8678 Compare November 22, 2018 10:15

Add helper method.

62fdb17

cloud-fan reviewed Nov 26, 2018

View reviewed changes

asfgit closed this in 6339c8c Nov 26, 2018

viirya mentioned this pull request Nov 26, 2018

[SPARK-24762][SQL][Followup] Enable Option of Product encoders #23143

Closed

viirya deleted the SPARK-24762 branch December 27, 2023 18:22

[SPARK-24762][SQL] Enable Option of Product encoders #21732

[SPARK-24762][SQL] Enable Option of Product encoders #21732

Uh oh!

Conversation

viirya commented Jul 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

viirya commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

viirya commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

viirya commented Jul 9, 2018

Uh oh!

viirya commented Jul 9, 2018

Uh oh!

cloud-fan commented Jul 10, 2018

Uh oh!

viirya commented Jul 10, 2018

Uh oh!

cloud-fan commented Jul 10, 2018

Uh oh!

viirya commented Jul 17, 2018

Uh oh!

viirya commented Jul 19, 2018

Uh oh!

cloud-fan commented Jul 19, 2018

Uh oh!

viirya commented Jul 19, 2018

Uh oh!

cloud-fan Jul 19, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jul 19, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 20, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jul 20, 2018

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 26, 2018

Uh oh!

cloud-fan commented Jul 27, 2018

Uh oh!

viirya commented Jul 27, 2018

Uh oh!

cloud-fan commented Jul 28, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

Uh oh!

cloud-fan commented Nov 21, 2018

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 24, 2018

Uh oh!

cloud-fan commented Nov 26, 2018

Uh oh!

cloud-fan Nov 26, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Nov 26, 2018

viirya commented Jul 9, 2018 •

edited

Loading

HyukjinKwon commented Dec 27, 2018 •

edited

Loading

viirya commented Dec 27, 2018 •

edited

Loading

HyukjinKwon commented Dec 27, 2018 •

edited

Loading