[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference #23908

HeartSaVioR · 2019-02-27T14:27:36Z

What changes were proposed in this pull request?

This patch proposes refactoring serializerFor method between ScalaReflection and JavaTypeInference, being consistent with what we refactored for deserializerFor in #23854.

This patch also extracts the logic on recording walk type path since the logic is duplicated across serializerFor and deserializerFor with ScalaReflection and JavaTypeInference.

How was this patch tested?

Existing tests.

SparkQA · 2019-02-27T14:39:26Z

Test build #102825 has finished for PR 23908 at commit 9d62bc9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-27T15:07:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/WalkedTypePathRecorder.scala

I was thinking about making this as class (with storing built path in each instance) but soon realized it requires touching other thing as well and feel a bit overkill. I'm still open to make this as individual class so please let me know if it sounds better.

making it a class looks better, as it needs to accumulate the walked type path.

Thanks for the support! Just addressed.

fottey · 2019-02-27T17:29:42Z

As an outside observer, would this refactoring allow the method ScalaReflection.serializeFor to handle arbitrary types that conform to the Java bean interface, and/or common Java specific types, such as java.util.List?

I recently discovered that because most of the common Scala implicit encoders reduce to ExpressionEncoder's apply method, it's very difficult to work with arbitrary Java bean type's in the Dataset API.

Specifically, given a java bean type, MyBean, and an implicit encoder of that bean type in scope, existing Spark 2.4.0 machinery in can't synthesize a valid encoder at runtime for hybrid Scala / Java types, like Seq[MyBean] or tuple types like (Int, MyBean) despite the fact that we have encoders for Seq[_], Tuple2[_, _], and MyBean available separately.

While it may be unreasonable to solve the problem generically across all potential classes, it would be really nice if ExpressionEncoder's apply method could somehow detect and support at least java beans and java.util.Lists at runtime...

See here on Stackoverflow for more details...

See below code examples:

import com.example.MyBean
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder

object Example {
    case class Test()

    def main(args: Array[String]): Unit = {
        val spark: SparkSession = ???

        import spark.implicits._

        // Works today after above implicit import
        val ds: Dataset[Seq[Test]] = Seq(Seq(Test()), Seq(Test()), ...).toDS

        // DOES NOT WORK
        // ExpressionEncoder's apply method cannot handle type MyBean!
        implicit def newMyBeanExpressionEncoder: Encoder[MyBean] = ExpressionEncoder()
        // 
        // Need to do the following:
        implicit def newMyBeanBeanEncoder: Encoder[MyBean] = Encoders.bean(classOf[MyBean])

        // But this only allows expressing things like this:
        val ds: Dataset[MyBean] = Seq(new MyBean(), new MyBean(), ...).toDS

        // Due to the above limitation we CANNOT do the following, EVEN AFTER
        // newMyBeanBeanEncoder is brought into scope!
        // DOES NOT WORK 
        val ds: Dataset[Seq[MyBean]] = Seq(Seq(new MyBean()), Seq(new MyBean()), ...).toDS

        // Finally, these do not work: 

        // DOES NOT WORK 
        val ds: Dataset[(Int, MyBean)] = Seq((0, new MyBean()),(0, new MyBean()), ...).toDS

        // DOES NOT WORK
        implicit def newMyBeanEncoder: Encoder[Seq[MyBean]] = ExpressionEncoder()
        
        // DOES NOT WORK
        implicit def newMyBeanEncoder: Encoder[java.util.List[MyBean]] = ExpressionEncoder()

        // The above samples all rely on ExpressionEncoder
        // being able to handle every type in the expression...
        // currently seems to work for:
        // - case classes
        // - tuples
        // - scala.Product
        // - scala "primitives"
        // other common types with encoders... BUT NOT java beans or java.util.List... :'(
    }
}

SparkQA · 2019-02-27T18:29:26Z

Test build #102828 has finished for PR 23908 at commit 3e17117.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-27T18:42:40Z

Test build #102829 has finished for PR 23908 at commit d7b8292.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-27T20:35:35Z

@fottey
Could you file an issue to JIRA for your case? Refactor is normally done without touching behavior, or at most making some minor changes, so I think it's out of scope for this PR.

Someone may want to take a look at it, or I may spend time to take a look at. Just would like to limit scope of concerns.

SparkQA · 2019-02-27T22:47:53Z

Test build #102833 has finished for PR 23908 at commit 01b7c41.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-27T23:24:50Z

Some test failures are occurred just because of one more new line. Fixed.
Some other test failures complain plan do not match: though string representation of two plans look same. Need to investigate.

HeartSaVioR · 2019-02-27T23:33:35Z

The default implementation of equals in WalkedTypePath affects comparison. Fixed.

SparkQA · 2019-02-28T01:18:50Z

Test build #102841 has finished for PR 23908 at commit 0d42fb4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-28T01:47:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

this belongs to the previous PR: why not just let the caller side create the expression and pass to deserializerForWithNullSafety?

Yeah I might thought too complicated. Not a big deal and looks simpler. Will address. Thanks!

cloud-fan · 2019-02-28T01:49:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

why not use recordRoot here?

cloud-fan · 2019-02-28T01:57:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/WalkedTypePath.scala

shall we use a mutable list for better performance?

It was to address diverged paths for map key and value, but we can also copy instance via cloning internal list if necessary. Will address.

Addressed via 90df8a3 - I found it a bit complicated to maintain the list without polluting, so please take a look at the change and let me know if you would like to roll back to immutable one if performance gain doesn't seem to have more value than complexity.

SparkQA · 2019-02-28T03:26:04Z

Test build #102842 has finished for PR 23908 at commit 8e62f4c.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WalkedTypePath(walkedPaths: Seq[String] = Nil) extends Serializable

cloud-fan · 2019-02-28T05:01:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

so this is same as expressionWithNullSafety?

Yeah now it's same. I've left comment on #23916. Btw I feel we might be better to focus review channel to one, either this PR or #23916 .

I didn't notice you also make a lot change to DeserializerBuildHelper in this PR. There might be conflicts if continuing #23916, I will close it.

I think expressionWithNullSafety is more general naming so might be preferred one, but deserializerForWithNullSafety is also a good name cause we have relevant method deserializerForWithNullSafetyAndUpcast.

So that's a matter of preference and either can be removed. Which method would we prefer to keep?

Thanks @viirya , please feel free to comment even it belongs to previous PR. Thanks again!

Why you remove the funcForCreatingNewExpr from this and turn to pass in created expression (deserializer)?

I think the previous deserializerForWithNullSafety is more consistent to deserializerForWithNullSafetyAndUpcast.

I tend to follow the suggestion (#23908 (comment)) unless I have strong opinion, as I'm fairly new to contributing SQL area. For consistency I agree having func is better, but for simplicity we can remove it like applying inline. Either is reasonable.

I think we just need to keep expressionWithNullSafety. I don't see why we have to have 2 methods for deserializeFor. Leaving only a deserializerForWithNullSafetyAndUpcast is fine.

Left deserializerForWithNullSafetyAndUpcast and expressionWithNullSafety since both are used in multiple places. Please let me know if it doesn't work.

viirya · 2019-02-28T05:27:31Z

So this is a purely refactoring PR and doesn't address bug, right?

viirya · 2019-02-28T05:29:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SerializerBuildHelper.scala

Seems these helper methods don't reduce code and just add one more wrapper around calling Invoke. Are they needed?

This is not for reducing code. This is for consistency. These methods ensure we are consistently serialize / deserialize things between ScalaReflection and JavaTypeInference if the type is same.

HeartSaVioR · 2019-02-28T06:10:53Z

So this is a purely refactoring PR and doesn't address bug, right?

Yes, and make things consistent between deserializerFor and serializerFor.

viirya · 2019-02-28T06:17:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

Doesn't this revert what you did in previous PR?

I got review comment for previous PR in here as well - this is the reflection of replacing function to applied expression which ends up making deserializerForWithNullSafety and expressionWithNullSafety being same.

SparkQA · 2019-02-28T08:05:01Z

Test build #102850 has finished for PR 23908 at commit 90df8a3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WalkedTypePath() extends Serializable
case class AssertNotNull(child: Expression, walkedTypePath: WalkedTypePath = WalkedTypePath())

SparkQA · 2019-03-01T03:55:47Z

Test build #102882 has finished for PR 23908 at commit ff7512b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ction and JavaTypeInference

…eq[String]`

maropu · 2019-03-01T05:05:28Z

Nice work, @HeartSaVioR! btw, this pr consists of the two parts you described in the PR description. If so, how about splitting this into the two prs for easy reviews? Refactoring the code for the consistency between ScalaReflection and JavaTypeInference, and adding WalkedTypePath then?

cloud-fan · 2019-03-01T07:15:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

      expr
    } else {
-      AssertNotNull(expr, walkedTypePath)
+      AssertNotNull(expr, walkedTypePath.copy())


We can let AssertNotNull take a Seq[String], to force us to copy the WalkedTypePath when creating AssertNotNull

cloud-fan · 2019-03-01T07:15:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

    case _: ArrayType => expr
    case _: MapType => expr
-    case _ => UpCast(expr, expected, walkedTypePath)
+    case _ => UpCast(expr, expected, walkedTypePath.copy())


cloud-fan · 2019-03-01T07:17:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+          valueType.getType.getTypeName)
+
+        val newTypePathForKey = walkedTypePath.copy()
+        val newTypePathForValue = walkedTypePath.copy()


Sorry for the back and forth. But seems it's better to make WalkedTypePath immutable as there are branches. It's hard to maintain and we can easily mess it up if we forget the call copy somewhere.

Yeah, same understanding. No problem! I'll revert back to let WalkedTypePath be immutable one.

HeartSaVioR · 2019-03-01T07:25:47Z

@maropu I like the idea of splitting PR, but since @cloud-fan already provides feedbacks on WalkedTypePath, might be better to hear opinion and decide. Let me first address his feedback on WalkedTypePath - even we decide to break down it would be needed work.

This reverts commit c67826a. NOTE: there's conflict which makes revert commit not clearly reverting as before, but WalkedTypePath is clearly reverted

…f3a228e4ff6d47bfd6f0ed98ad2b964)

cloud-fan · 2019-03-01T08:10:12Z

I'm OK both ways. Since the PR already contains the WalkedTypePath refactor, I think it's fine to include it as it's pretty intuitive.

HeartSaVioR · 2019-03-01T08:12:50Z

Thanks! I'll keep it as it is. How about applying DeserializerBuildHelper/SerializerBuildHelper to RowEncoder? Better to have it as separate PR?

cloud-fan · 2019-03-01T08:17:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-              "fromPrimitiveArray",
-              input :: Nil,
-              returnNullable = false)
+            createSerializerForPrimitiveArray(input, dt)


seems this branch is missing in the java side. We can address it in the followup.

Raised PR: #24015

cloud-fan · 2019-03-01T08:24:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+case class UpCast(
+    child: Expression,
+    dataType: DataType,
+    walkedTypePath: WalkedTypePath = new WalkedTypePath())


can we keep it Seq[String]? When we reach here, the walkedTypePath is only needed for logging/error message, and we don't need the WalkedTypePath class to help accumulate the paths.

cloud-fan · 2019-03-01T08:24:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

 * non-null `s`, `s.i` can't be null.
 */
-case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil)
+case class AssertNotNull(child: Expression, walkedTypePath: WalkedTypePath = new WalkedTypePath())


SparkQA · 2019-03-01T08:54:58Z

Test build #102899 has finished for PR 23908 at commit 852debd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-01T12:26:49Z

Test build #102902 has finished for PR 23908 at commit 578d8fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-01T13:22:42Z

Test build #102906 has finished for PR 23908 at commit 20e8d5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil)

cloud-fan · 2019-03-02T04:57:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

    val inputObject = BoundReference(0, ObjectType(beanClass), nullable = true)
-    val nullSafeInput = AssertNotNull(inputObject, Seq("top level input bean"))
+    val nullSafeInput = AssertNotNull(inputObject,
+      WalkedTypePath().recordRoot("top level input bean").getPaths)


nit: we can keep it unchanged.

Ah yes that's not even same. Will revert.

cloud-fan · 2019-03-02T05:02:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

          // For input object of Product type, we can't encode it to row if it's null, as Spark SQL
          // doesn't allow top-level row to be null, only its columns can be null.
-          AssertNotNull(r, Seq("top level Product or row object"))
+          AssertNotNull(r, WalkedTypePath().recordRoot("top level Product or row object").getPaths)


nit: we can keep it unchanged.

cloud-fan · 2019-03-02T05:02:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+case class UpCast(
+    child: Expression,
+    dataType: DataType,
+    walkedTypePath: Seq[String] = Nil)


can we revert the code style change?

cloud-fan · 2019-03-02T05:02:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

 import org.apache.spark.serializer._
 import org.apache.spark.sql.Row
-import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection}
+import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection, WalkedTypePath}


unnecessary change

SparkQA · 2019-03-02T12:17:25Z

Test build #102936 has finished for PR 23908 at commit 50c2ddc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil)

cloud-fan · 2019-03-04T02:46:03Z

thanks, merging to master!

HeartSaVioR · 2019-03-04T03:04:22Z

Thanks all for reviewing and merging!

## What changes were proposed in this pull request? This is follow-up PR which addresses review comment in PR for SPARK-27001: #23908 (comment) This patch proposes addressing primitive array type for serializer - instead of handling it to generic one, Spark now handles it efficiently as primitive array. ## How was this patch tested? UT modified to include primitive array. Closes #24015 from HeartSaVioR/SPARK-27001-FOLLOW-UP-java-primitive-array. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HeartSaVioR force-pushed the SPARK-27001 branch from 9d62bc9 to 3e17117 Compare February 27, 2019 15:02

HeartSaVioR mentioned this pull request Feb 27, 2019

[SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType #23811

Closed

HeartSaVioR commented Feb 27, 2019

View reviewed changes

HeartSaVioR mentioned this pull request Feb 27, 2019

[SPARK-27008][SQL] Support java.time.LocalDate as an external type of DateType #23913

Closed

cloud-fan reviewed Feb 28, 2019

View reviewed changes

viirya reviewed Feb 28, 2019

View reviewed changes

cloud-fan mentioned this pull request Feb 28, 2019

[SPARK-22000][SQL][FOLLOWUP] Add comment and simple cleanup to DeserializerBuildHelper #23916

Closed

viirya reviewed Feb 28, 2019

View reviewed changes

HeartSaVioR added 5 commits March 1, 2019 13:36

[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaRefle…

44fa876

…ction and JavaTypeInference

Also extract the logic on recording walked type path

d683d80

Address UDT as well

43a69f0

Fix a silly mistake

371a2d1

Introduce WalkedTypePath class to replace recording walked path in `S…

4dfe3c7

…eq[String]`

cloud-fan reviewed Mar 1, 2019

View reviewed changes

HeartSaVioR added 3 commits March 1, 2019 16:55

Revert "address review comments from cloud-fan"

e0d7495

This reverts commit c67826a. NOTE: there's conflict which makes revert commit not clearly reverting as before, but WalkedTypePath is clearly reverted

Address review comment from cloud-fan (partially reapplying 90df8a3eb…

65d2079

…f3a228e4ff6d47bfd6f0ed98ad2b964)

Address review comments from maropu

578d8fe

cloud-fan reviewed Mar 1, 2019

View reviewed changes

Address review comments

20e8d5a

cloud-fan reviewed Mar 2, 2019

View reviewed changes

Address review comments

50c2ddc

cloud-fan closed this in 34f6066 Mar 4, 2019

HeartSaVioR deleted the SPARK-27001 branch March 4, 2019 03:04

maropu mentioned this pull request Mar 5, 2019

[SPARK-27001][SQL][FOLLOW-UP] Drop Serializable in WalkedTypePath #23973

Closed

HeartSaVioR mentioned this pull request Mar 7, 2019

[SPARK-27001][SQL][FOLLOWUP] Address primitive array type for serializer #24015

Closed

[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference #23908

[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference #23908

Uh oh!

Conversation

HeartSaVioR commented Feb 27, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fottey commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 27, 2019

Uh oh!

SparkQA commented Feb 27, 2019

Uh oh!

HeartSaVioR commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 27, 2019

Uh oh!

HeartSaVioR commented Feb 27, 2019

Uh oh!

HeartSaVioR commented Feb 27, 2019

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fottey commented Feb 27, 2019 •

edited

Loading

HeartSaVioR commented Feb 27, 2019 •

edited

Loading

HeartSaVioR Feb 28, 2019 •

edited

Loading

viirya Feb 28, 2019 •

edited

Loading

HeartSaVioR Feb 28, 2019 •

edited

Loading

HeartSaVioR Mar 1, 2019 •

edited

Loading