-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference #23908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #102825 has finished for PR 23908 at commit
|
9d62bc9 to
3e17117
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about making this as class (with storing built path in each instance) but soon realized it requires touching other thing as well and feel a bit overkill. I'm still open to make this as individual class so please let me know if it sounds better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making it a class looks better, as it needs to accumulate the walked type path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the support! Just addressed.
|
As an outside observer, would this refactoring allow the method I recently discovered that because most of the common Scala implicit encoders reduce to Specifically, given a java bean type, While it may be unreasonable to solve the problem generically across all potential classes, it would be really nice if See here on Stackoverflow for more details... See below code examples: import com.example.MyBean
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
object Example {
case class Test()
def main(args: Array[String]): Unit = {
val spark: SparkSession = ???
import spark.implicits._
// Works today after above implicit import
val ds: Dataset[Seq[Test]] = Seq(Seq(Test()), Seq(Test()), ...).toDS
// DOES NOT WORK
// ExpressionEncoder's apply method cannot handle type MyBean!
implicit def newMyBeanExpressionEncoder: Encoder[MyBean] = ExpressionEncoder()
//
// Need to do the following:
implicit def newMyBeanBeanEncoder: Encoder[MyBean] = Encoders.bean(classOf[MyBean])
// But this only allows expressing things like this:
val ds: Dataset[MyBean] = Seq(new MyBean(), new MyBean(), ...).toDS
// Due to the above limitation we CANNOT do the following, EVEN AFTER
// newMyBeanBeanEncoder is brought into scope!
// DOES NOT WORK
val ds: Dataset[Seq[MyBean]] = Seq(Seq(new MyBean()), Seq(new MyBean()), ...).toDS
// Finally, these do not work:
// DOES NOT WORK
val ds: Dataset[(Int, MyBean)] = Seq((0, new MyBean()),(0, new MyBean()), ...).toDS
// DOES NOT WORK
implicit def newMyBeanEncoder: Encoder[Seq[MyBean]] = ExpressionEncoder()
// DOES NOT WORK
implicit def newMyBeanEncoder: Encoder[java.util.List[MyBean]] = ExpressionEncoder()
// The above samples all rely on ExpressionEncoder
// being able to handle every type in the expression...
// currently seems to work for:
// - case classes
// - tuples
// - scala.Product
// - scala "primitives"
// other common types with encoders... BUT NOT java beans or java.util.List... :'(
}
} |
|
Test build #102828 has finished for PR 23908 at commit
|
|
Test build #102829 has finished for PR 23908 at commit
|
|
@fottey Someone may want to take a look at it, or I may spend time to take a look at. Just would like to limit scope of concerns. |
|
Test build #102833 has finished for PR 23908 at commit
|
|
Some test failures are occurred just because of one more new line. Fixed. |
|
The default implementation of |
|
Test build #102841 has finished for PR 23908 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this belongs to the previous PR: why not just let the caller side create the expression and pass to deserializerForWithNullSafety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I might thought too complicated. Not a big deal and looks simpler. Will address. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use recordRoot here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use a mutable list for better performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to address diverged paths for map key and value, but we can also copy instance via cloning internal list if necessary. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed via 90df8a3 - I found it a bit complicated to maintain the list without polluting, so please take a look at the change and let me know if you would like to roll back to immutable one if performance gain doesn't seem to have more value than complexity.
|
Test build #102842 has finished for PR 23908 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is same as expressionWithNullSafety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice you also make a lot change to DeserializerBuildHelper in this PR. There might be conflicts if continuing #23916, I will close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think expressionWithNullSafety is more general naming so might be preferred one, but deserializerForWithNullSafety is also a good name cause we have relevant method deserializerForWithNullSafetyAndUpcast.
So that's a matter of preference and either can be removed. Which method would we prefer to keep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya , please feel free to comment even it belongs to previous PR. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you remove the funcForCreatingNewExpr from this and turn to pass in created expression (deserializer)?
I think the previous deserializerForWithNullSafety is more consistent to deserializerForWithNullSafetyAndUpcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to follow the suggestion (#23908 (comment)) unless I have strong opinion, as I'm fairly new to contributing SQL area. For consistency I agree having func is better, but for simplicity we can remove it like applying inline. Either is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to keep expressionWithNullSafety. I don't see why we have to have 2 methods for deserializeFor. Leaving only a deserializerForWithNullSafetyAndUpcast is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left deserializerForWithNullSafetyAndUpcast and expressionWithNullSafety since both are used in multiple places. Please let me know if it doesn't work.
|
So this is a purely refactoring PR and doesn't address bug, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems these helper methods don't reduce code and just add one more wrapper around calling Invoke. Are they needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for reducing code. This is for consistency. These methods ensure we are consistently serialize / deserialize things between ScalaReflection and JavaTypeInference if the type is same.
Yes, and make things consistent between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this revert what you did in previous PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got review comment for previous PR in here as well - this is the reflection of replacing function to applied expression which ends up making deserializerForWithNullSafety and expressionWithNullSafety being same.
|
Test build #102850 has finished for PR 23908 at commit
|
|
Test build #102882 has finished for PR 23908 at commit
|
…ction and JavaTypeInference
|
Nice work, @HeartSaVioR! btw, this pr consists of the two parts you described in the PR description. If so, how about splitting this into the two prs for easy reviews? Refactoring the code for the consistency between ScalaReflection and JavaTypeInference, and adding WalkedTypePath then? |
| expr | ||
| } else { | ||
| AssertNotNull(expr, walkedTypePath) | ||
| AssertNotNull(expr, walkedTypePath.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can let AssertNotNull take a Seq[String], to force us to copy the WalkedTypePath when creating AssertNotNull
| case _: ArrayType => expr | ||
| case _: MapType => expr | ||
| case _ => UpCast(expr, expected, walkedTypePath) | ||
| case _ => UpCast(expr, expected, walkedTypePath.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| valueType.getType.getTypeName) | ||
|
|
||
| val newTypePathForKey = walkedTypePath.copy() | ||
| val newTypePathForValue = walkedTypePath.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back and forth. But seems it's better to make WalkedTypePath immutable as there are branches. It's hard to maintain and we can easily mess it up if we forget the call copy somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, same understanding. No problem! I'll revert back to let WalkedTypePath be immutable one.
|
@maropu I like the idea of splitting PR, but since @cloud-fan already provides feedbacks on WalkedTypePath, might be better to hear opinion and decide. Let me first address his feedback on WalkedTypePath - even we decide to break down it would be needed work. |
This reverts commit c67826a. NOTE: there's conflict which makes revert commit not clearly reverting as before, but WalkedTypePath is clearly reverted
…f3a228e4ff6d47bfd6f0ed98ad2b964)
|
I'm OK both ways. Since the PR already contains the |
|
Thanks! I'll keep it as it is. How about applying |
| "fromPrimitiveArray", | ||
| input :: Nil, | ||
| returnNullable = false) | ||
| createSerializerForPrimitiveArray(input, dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this branch is missing in the java side. We can address it in the followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised PR: #24015
| case class UpCast( | ||
| child: Expression, | ||
| dataType: DataType, | ||
| walkedTypePath: WalkedTypePath = new WalkedTypePath()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep it Seq[String]? When we reach here, the walkedTypePath is only needed for logging/error message, and we don't need the WalkedTypePath class to help accumulate the paths.
| * non-null `s`, `s.i` can't be null. | ||
| */ | ||
| case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil) | ||
| case class AssertNotNull(child: Expression, walkedTypePath: WalkedTypePath = new WalkedTypePath()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
Test build #102899 has finished for PR 23908 at commit
|
|
Test build #102902 has finished for PR 23908 at commit
|
|
Test build #102906 has finished for PR 23908 at commit
|
| val inputObject = BoundReference(0, ObjectType(beanClass), nullable = true) | ||
| val nullSafeInput = AssertNotNull(inputObject, Seq("top level input bean")) | ||
| val nullSafeInput = AssertNotNull(inputObject, | ||
| WalkedTypePath().recordRoot("top level input bean").getPaths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can keep it unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes that's not even same. Will revert.
| // For input object of Product type, we can't encode it to row if it's null, as Spark SQL | ||
| // doesn't allow top-level row to be null, only its columns can be null. | ||
| AssertNotNull(r, Seq("top level Product or row object")) | ||
| AssertNotNull(r, WalkedTypePath().recordRoot("top level Product or row object").getPaths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can keep it unchanged.
| case class UpCast( | ||
| child: Expression, | ||
| dataType: DataType, | ||
| walkedTypePath: Seq[String] = Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we revert the code style change?
| import org.apache.spark.serializer._ | ||
| import org.apache.spark.sql.Row | ||
| import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection} | ||
| import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection, WalkedTypePath} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
|
Test build #102936 has finished for PR 23908 at commit
|
|
thanks, merging to master! |
|
Thanks all for reviewing and merging! |
## What changes were proposed in this pull request? This is follow-up PR which addresses review comment in PR for SPARK-27001: #23908 (comment) This patch proposes addressing primitive array type for serializer - instead of handling it to generic one, Spark now handles it efficiently as primitive array. ## How was this patch tested? UT modified to include primitive array. Closes #24015 from HeartSaVioR/SPARK-27001-FOLLOW-UP-java-primitive-array. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This patch proposes refactoring
serializerFormethod betweenScalaReflectionandJavaTypeInference, being consistent with what we refactored fordeserializerForin #23854.This patch also extracts the logic on recording walk type path since the logic is duplicated across
serializerForanddeserializerForwithScalaReflectionandJavaTypeInference.How was this patch tested?
Existing tests.