-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-25044][SQL] Address translation of LMF closure primitive args to Object in Scala 2.12 #22063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a WIP for now. I think this almost works, but not quite. Putting it out there for comments early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the weak point: unless there is nullability info, don't do anything to the UDF plan, but, that's probably wrong in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach here is to capture at registration time whether the arg types are primitive, or nullable. Not a great way to record this, but might be the least hack for now
|
Test build #94535 has finished for PR 22063 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having 2 list, shall we just keep a Seq[ScalaReflection.Schema] or Seq[(DataType, Boolean)]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that can be optimized. I'll fix the MiMa issue too by restoring a constructor.
|
the idea LGTM |
|
Test build #94561 has finished for PR 22063 at commit
|
|
Test build #94567 has finished for PR 22063 at commit
|
|
Test build #94569 has finished for PR 22063 at commit
|
|
Test build #94571 has finished for PR 22063 at commit
|
|
Test build #94592 has finished for PR 22063 at commit
|
|
Test build #94610 has finished for PR 22063 at commit
|
|
For those following along, the problem right now is this: functions like Although the internals of the I worked around a few, but am having more trouble with others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really surprised that this worked before...
project/MimaExcludes.scala
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Wrong Scala version
|
Test build #95027 has finished for PR 22063 at commit
|
|
Test build #95031 has finished for PR 22063 at commit
|
|
Test build #95106 has finished for PR 22063 at commit
|
|
Test build #95121 has finished for PR 22063 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, and now I understand why it worked before.
Scala 2.11 somehow can generate type tag for Any, then Spark gets the input schema from type tag Try(ScalaReflection.schemaFor(typeTag[A1]).dataType :: Nil).toOption. It will fail and input schema will be None, so no type check will be applied later.
I think it makes more sense to specify the type and ask Spark to do type check.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review @cloud-fan , I could really use your input here. That's a good find. It may be that we want to explicitly support UDFs where a schema isn't available -- see below. But I agree I'd rather not. It gets kind of messy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, but in any case the new version seems nicer :-) Both 2.11 and 2.12 will happily generate a typeTag for Any, though, so that wouldn't immediately explain it. To see what was actually inferred, you could compile with -Xprint:typer (ideally after a full compile and then just making this file recompile incrementally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize, that's my mistake. In the end it isn't related to TypeTags for Any and that is not a difference. Thanks for your input, I think we are close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Happy to help with the 2.12 upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this workaround, but I think we should create an expression for this checkedCast. The current UDF doesn't work well with inputs of different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't even actually work, now that I dig further. Using Number doesn't work either. There are a number of UDFs that fail now in MLlib unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to keep doing this. cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, messy. The constructor and class are public so felt it was worth erring on the side of compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, everything under catalyst package is private, so compatibility is not a concern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is safe to do. Maybe a == b is different from a.toString == b.toString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel udf using Any as input type is rare but a valid use case. Sometimes they just want to accept any input type.
How about we create a few udfInternal methods that takes Any as inputs? e.g.
def udfInternal[R: TypeTag](f: Function1[Any, R]): UserDefinedFunction = {
val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[R]
val udf = UserDefinedFunction(f, dataType, Nil)
if (nullable) udf else udf.asNonNullable()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that the change I made here is not bulletproof. Unfortunately there are several more problems like this. Anywhere there's a UDF on Row it now fails, and workarounds are ugly.
I like your idea, let me work on that. Because the alternative I've been working on is driving me nuts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this? UserDefinedFunction is public but its constructor is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just for MiMa, but I could also suppress the warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, can we combine it with the data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to minimize the change in the signature, but yeah it could be Option[Seq[(DataType, Boolean)]]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or Option[Seq[ScalaReflection.Schema]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to all your comments. I'm overhauling this whole PR and will force push with a rebase once it seems to basically work.
…nstead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)
|
Test build #95234 has finished for PR 22063 at commit
|
| |def udf[$typeTags](f: Function$x[$types]): UserDefinedFunction = { | ||
| | val ScalaReflection.Schema(dataType, nullable) = ScalaReflection.schemaFor[RT] | ||
| | val inputTypes = Try($inputTypes).toOption | ||
| | val inputTypes = Try($argSchema).toOption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan might be worth another look now. So, after making this change, I realize, maybe the whole reason it started failing was that I had moved the schema inference outside the Try(). Now it's back inside. Maybe that makes the whole problem go back to being silent. Did you mean you preferred tackling the problem directly and not suppressing the failure to infer a schema? I added udfInternal above for that.
But maybe this isn't the best approach as user UDFs could fail for the same reason. Maybe I need to back this whole thing out after all, now that I understand what's happening after your comments.
|
Test build #4288 has finished for PR 22063 at commit
|
|
Test build #4289 has finished for PR 22063 at commit
|
|
Test build #4293 has finished for PR 22063 at commit
|
|
See also #22259 |
…primitive args to Object in Scala 2.12 ## What changes were proposed in this pull request? Alternative take on #22063 that does not introduce udfInternal. Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive) ## How was this patch tested? Existing tests. Closes #22259 from srowen/SPARK-25044.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| f: AnyRef, | ||
| dataType: DataType, | ||
| inputTypes: Option[Seq[DataType]]) { | ||
| inputTypes: Option[Seq[ScalaReflection.Schema]]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a stable API. Are we able to make this change in 2.4 instead of 3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the constructor is protected[sql]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, it's a case class, so we would need to keep the def inputTypes(): Option[Seq[DataType]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW it was PR #22259 that was merged)
We can add back accessors, constructors, if it would make life easier for callers. But if this is protected, who are the callers of this code we're accommodating? maybe some hacky but important integration?
We'd have to rename inputTypes and then add back an accessor with the old type.
What changes were proposed in this pull request?
First attempt to resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)
How was this patch tested?
Existing tests.