-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24762][SQL] Enable Option of Product encoders #21732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92729 has finished for PR 21732 at commit
|
|
retest this please. |
|
Test build #92739 has finished for PR 21732 at commit
|
|
retest this please. |
|
Test build #92748 has finished for PR 21732 at commit
|
|
I'm wondering should we add encoders of Option of Product into object |
|
how about we treat the top level |
|
It sounds like much more behavior changing? |
|
yes it is, but it makes the encoder framework more consistent. And making a failure case into runnable is a safe behavior change. |
|
Non top-level and top-level encoders for As you said, top-level For non top-level one, we can't apply the same change because it is already a struct column. We don't want to change current behavior of it from a struct column to a struct column of a struct column. This means that we can remove the limitation of top-level @cloud-fan Do you want to incorporate top-level |
|
ping @cloud-fan |
Can we treat them the same but at the end of encoder creation, we flatten the |
|
At the end of encoder creation? You mean at the end of calling |
| } | ||
|
|
||
| case class OptionBooleanIntAggregator(colName: String) | ||
| extends Aggregator[Row, Option[(Boolean, Int)], Option[(Boolean, Int)]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the expected schema after we apply an aggregator with Option[Product] as buffer/output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a non top-level encoder, the output schema of Option[Product] should be struct column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming non top level, Option[Product] is same as Product?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. For non top level, [Option[Product] is same as Product. The difference is additional WrapOption and UnwrapOption around expressions.
|
ping @cloud-fan @hvanhovell |
|
Again, can we always support |
|
@cloud-fan We can. Just wondering if you think it is good to have that in this PR too? |
|
This PR is just a special handling for |
b6c6e9f to
4f5628d
Compare
|
Test build #93763 has finished for PR 21732 at commit
|
|
Test build #93762 has finished for PR 21732 at commit
|
|
Test build #93764 has finished for PR 21732 at commit
|
|
Test build #93765 has finished for PR 21732 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
Outdated
Show resolved
Hide resolved
|
last comment, LGTM otherwise |
|
Test build #99177 has finished for PR 21732 at commit
|
|
Test build #99178 has finished for PR 21732 at commit
|
|
Test build #99222 has finished for PR 21732 at commit
|
|
thanks, merging to master, great work! |
| * flattened to top-level row, because in Spark SQL top-level row can't be null. This method | ||
| * returns true if `T` is serialized as struct and is not `Option` type. | ||
| */ | ||
| def isSerializedAsStructForTopLevel: Boolean = isSerializedAsStruct && !isOptionType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you send a followup PR to inline isOptionType if it's only used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
## What changes were proposed in this pull request? This is follow-up of #21732. This patch inlines `isOptionType` method. ## How was this patch tested? Existing tests. Closes #23143 from viirya/SPARK-24762-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
Hm .. sorry for joining this party late. I was reading and testing it by myself. scala> Seq((1, "a"), (2, "b")).toDF.show()
+---+---+
| _1| _2|
+---+---+
| 1| a|
| 2| b|
+---+---+
scala> Seq(1, 2).toDF.show()
+-----+
|value|
+-----+
| 1|
| 2|
+-----+scala> Seq(Some((1, "a")), Some((2, "b"))).toDF.show()
+------+
| value|
+------+
|[1, a]|
|[2, b]|
+------+
scala> Seq(Some(1), Some(2)).toDF.show()
+-----+
|value|
+-----+
| 1|
| 2|
+-----+I think this behaviour can be actually controversial. If we interpret |
|
@HyukjinKwon What do you mean we interpret Option as Tuple1? |
|
|
|
re: #21732 (comment) I was thinking both below Seq(Some((1, "a")), Some((2, "b"))).toDF.show()
Seq((1, "a"), (2, "b")).toDF.show()should produce the same result since Seq(Some(1), Some(2)).toDF.show()
Seq(1, 2).toDF.show()produces the same results anyhow. Apparently this looks why it has been disallowed. |
|
Thanks for @cloud-fan's explanation. So I think @HyukjinKwon you mean why we interpret For the following we can't produce the same result since top-level null row is not allowed in Spark. |
|
Yea, then why did we allow the different result? I was thinking we're going to allow this only for aggregators. |
|
@HyukjinKwon there was a comment #21732 (comment) for it. This was originally to make |
|
Hm, for aggregators, I would consider this as non root level. Looks they use the same encoder but can't be the same. |
|
@HyukjinKwon If we can go back, I'd say we should not have this optimization which flattens top-level Ideally It's too late to revert that optimization, I think we should accept this special case. |
|
I didn't mean that we should revert .. was just checking the PRs in my queue and was just curious. I mean, I understood the limitation but failed to understand why it's been allowed. We exposed |
|
Ah, but you're saying |
|
Yes, top-level |
## What changes were proposed in this pull request? This is inspired during implementing apache#21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do. To summarize the proposed changes: 1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly. 2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row. So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value. A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct. A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type. 3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer. ## How was this patch tested? Existing tests. Closes apache#22749 from viirya/SPARK-24762-refactor. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?
SparkSQL doesn't support to encode `Option[Product]` as a top-level row now, because in SparkSQL entire top-level row can't be null.
However for use cases like Aggregator, it is reasonable to use `Option[Product]` as buffer and output column types. Due to above limitation, we don't do it for now.
This patch proposes to encode `Option[Product]` at top-level as single struct column. So we can work around the issue that entire top-level row can't be null.
To summarize encoding of `Product` and `Option[Product]`.
For `Product`, 1. at root level, the schema is all fields are flatten it into multiple columns. The `Product ` can't be null, otherwise it throws an exception.
```scala
val df = Seq((1 -> "a"), (2 -> "b")).toDF()
df.printSchema()
root
|-- _1: integer (nullable = false)
|-- _2: string (nullable = true)
```
2. At non-root level, `Product` is a struct type column.
```scala
val df = Seq((1, (1 -> "a")), (2, (2 -> "b")), (3, null)).toDF()
df.printSchema()
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: string (nullable = true)
```
For `Option[Product]`, 1. it was not supported at root level. After this change, it is a struct type column.
```scala
val df = Seq(Some(1 -> "a"), Some(2 -> "b"), None).toDF()
df.printSchema
root
|-- value: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: string (nullable = true)
```
2. At non-root level, it is also a struct type column.
```scala
val df = Seq((1, Some(1 -> "a")), (2, Some(2 -> "b")), (3, None)).toDF()
df.printSchema
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: string (nullable = true)
```
3. For use case like Aggregator, it was not supported too. After this change, we support to use `Option[Product]` as buffer/output column type.
```scala
val df = Seq(
OptionBooleanIntData("bob", Some((true, 1))),
OptionBooleanIntData("bob", Some((false, 2))),
OptionBooleanIntData("bob", None)).toDF()
val group = df
.groupBy("name")
.agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood"))
group.printSchema
root
|-- name: string (nullable = true)
|-- isGood: struct (nullable = true)
| |-- _1: boolean (nullable = false)
| |-- _2: integer (nullable = false)
```
The buffer and output type of `OptionBooleanIntAggregator` is both `Option[(Boolean, Int)`.
## How was this patch tested?
Added test.
Closes apache#21732 from viirya/SPARK-24762.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is follow-up of apache#21732. This patch inlines `isOptionType` method. ## How was this patch tested? Existing tests. Closes apache#23143 from viirya/SPARK-24762-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
SparkSQL doesn't support to encode
Option[Product]as a top-level row now, because in SparkSQL entire top-level row can't be null.However for use cases like Aggregator, it is reasonable to use
Option[Product]as buffer and output column types. Due to above limitation, we don't do it for now.This patch proposes to encode
Option[Product]at top-level as single struct column. So we can work around the issue that entire top-level row can't be null.To summarize encoding of
ProductandOption[Product].For
Product, 1. at root level, the schema is all fields are flatten it into multiple columns. TheProductcan't be null, otherwise it throws an exception.Productis a struct type column.For
Option[Product], 1. it was not supported at root level. After this change, it is a struct type column.Option[Product]as buffer/output column type.The buffer and output type of
OptionBooleanIntAggregatoris bothOption[(Boolean, Int).How was this patch tested?
Added test.