-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature #22732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to change this constructor, I believe. It's preserved for backwards-compatibility. It should pass an empty Seq or Nil for the new param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this is different from you original PR already, @cloud-fan did a follow-up PR adding the "nullableTypes". I'll revert the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just remove this constructor. It's weird to keep backward compatibility for a private class, and I don't think it can work anymore. It's not OK to omit the nullable info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give a quick example to clarify? like "For example, primitive types can't take on the value null, and need special handling."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it needs a little explanation here regarding the primitive types. Since the types are not nullable, when the values are null, it usually ends up being represented as a zero value. I should add it to the java doc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding -- string gets false because it's a reference type, and a double known to not be null also doesn't need null checking, so gets false. OK that sounds right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as explained in #22732 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not specified, assume nothing needs special null handling. OK, check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not so sure about this part, but this is just to be consistent with the behavior in your previous check-in. Can you give an example of such end-user cases where these flags are unavailable/not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a case. I suppose that's fine default behavior, and matches what happened in the previous change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually assume the default behavior should be the other way around. If we don't know, we just do the if-else null handling and it wouldn't do us any harm correctness-wise, right? Anyway I'm not gonna change that in this PR but hope we can get an idea if and when this default behavior will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing !nullable instead of nullable, yep, looks right. The meaning is flipped here, correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about if handleNullForInputs.exists(_)? if I understand the meaning right, that could be simpler. If anything is true you have to trigger the handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO above seems to apply to the line that was removed? I'm not totally clear. It's OK to not check !expr.isInstanceOf[KnownNotNull] now? you know this better than I. Hey if tests pass I think it proves it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should answer/confirm a couple of other questions above:
Since we already have this flag handleNullForInputs in ScalaUDF, we can take advantage of it in this rule as well. Say, the first time a ScalaUDF hits this rule with handleNullForInputs as "false, true, false", then we add a null-handling (if clause) for the second input which is flagged with "true", and from this point on we are all good with all inputs, and we can flag the new ScalaUDF's handleNullForInputs as all "false". So even if the new ScalaUDF hits this rule for a second time, nothing will be done.
It should work the same way for the "TODO" above, if handleNullForInputs has a "true" flag and the corresponding expression is NOT nullable, we can skip the null handling while flagging it as "false" in the new ScalaUDF in the end.
|
Test build #97409 has finished for PR 22732 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something but:
- Why don't we just merge
handleNullForInputsandinputTypes? - Why
handleNullForInputsis required whereasinputTypes's default isNil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding handleNullForInputs doesn't look reducing confusion a lot to me. Since this PR targets only refactoring mainly to reduce confusion and the easy of use, this concern should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be merged. I guess the preference is to minimize the change required, so we just have a new field.
That said, UDFs now really require nullability information to work correctly in Scala 2.12. This is the reason the new field is required and not optional. It already caused a new test to fail, so I'm more persuaded that it's OK to make a less backwards-compatible change to make it clear to any future callers that new info is needed. It's an internal class, so reasonable to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among all other reasons, one argument for not merging the flags with the types is #22732 (comment).
As to your second question, it would be for same reason as above, plus #22259 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about acceptNullInputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Since I'm also using this flag to replace KnownNotNull, I think inputsNullSafe would be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Seq(false, false)) | |
| handleNullForInputs = Seq(false, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need to do that for required arguments, or we do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be more conservative here. The default behavior needs to handle NULLs. We should fix it.
// Need to handle NULL when the fields can't accept NULL.
val handleNullForInputs = nullableTypes.map(_.map(!_)).getOrElse(Seq.fill(exprs.size)(true))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen @cloud-fan This sounds a new behavior added in Spark 2.4 release by the PR #22259, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's new in the sense that Scala 2.12 forced us to change where we get type information for UDFS, or specifically whether the type is allowed to be null. Before we could infer it directly from the closure, but not in 2.12. This change is supposed to just get the info from elsewhere -- from TypeTags when the UDFs are declared. The change ought to be invisible to users and if done correctly should not result in a behavior change. This follow-up change is more about being conservative and lessening the chance of a mistake (see the test case mary ann found) internally or of a mistake for a library that uses ScalaUDF directly anyway.
For that reason, yeah I agree with your comment here. Default to assuming null checks are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only place where we'd get a not-specified inputSchemas is when ScalaReflection.schemaFor doesn't recognize a type and throws an exception (
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Line 786 in 1fd59c1
| throw new UnsupportedOperationException(s"Schema for type $other is not supported") |
val inputSchemas = Try(ScalaReflection.schemaFor(typeTag[A1]) :: ScalaReflection.schemaFor(typeTag[A2]) :: Nil).toOption
val udf = SparkUserDefinedFunction.create(f, dataType, inputSchemas)
It would mean if the type of only one of the parameters is unrecognizable by ScalaReflection, we'd end up having the entire Seq as None. I think it's fine not to check null for user-defined types that we don't know, coz they can't be primitive types anyway, but I do think we should make the type inference of each parameter independent so we do handle the nulls that need to be taken care of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. There are a number of UDFs in MLlib, etc that have inputs of type "Any", which isn't great, but I wanted to work around rather than change them for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what I just pointed out, which is when we did try to get inputSchemas through ScalaReflection.schemaFor and got an exception for unrecognized types, there's another case where we could get an unspecified nullableTypes, and that is when UserDefinedFunction is instantiated calling the constructor but not the create method.
Then I assume it's created by an earlier version, and we should use the old logic, i.e., ScalaReflection.getParameterTypes (https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2L2153) to get the correct information for nullableTypes. Is that right, @cloud-fan @srowen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but we can't use getParameterTypes anymore. It won't work in Scala 2.12. Where the nullability info is definitely not available, be conservative and assume it all needs null handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't get the type information at compile time(e.g. (a: Any) => xxx), I don't think ScalaReflection.getParameterTypes can get something useful at runtime. For this case we won't add null check anyway, so I think there is no behavior change.
|
Test build #97464 has finished for PR 22732 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputsNullSafe SGTM, let's also rename UserDefinedFunction.nullableTypes
|
There is an argument about whether #22259 introduced behavior changes. Here is my analysis. Before #22259 , the type and null check was done as
After #22259 , the type and null check is done as
So we may have a behavior change if users register UDFs in a weird way. e.g. they define I'd say this is an invalid use case, because the data type check is also lost. I think we don't need to treat it as a behavior change. |
|
After more thoughts, there is one case we don't handle well. For UDFs like |
|
That's a good point, but that was already an issue right? it isn't introduced by this change at least? |
|
@srowen What @cloud-fan described is a change introduced in #22259. We can fix it by keeping each call to |
|
At this point I think you both know more than I do about this, so go ahead. To the limits of my understanding it sounds reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constructor is here for backward compatibility. If we have to change it, I don't think we need to keep it anymore. cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll do the Scala 2.11 approach for such places where nullableTypes info is unavailable so to at least keep legacy usage of ScalaUDF working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. We should not break our end users if they just use Scala 2.11. In Spark 3.0, we can be more aggressive and break the binary compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
Test build #97508 has finished for PR 22732 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are here, shall we also check exprs.length == numOfArgs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the "assert" should better be in ScalaUDF, but then it would be slightly different from the current behavior whatsoever. I'm keeping other things as they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit udf((x: Long, y: Any) => x * 2 + (if (y == null) 1 else 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: upper case for keywords
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this target to backport to 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this fixes a bug introduced by a commit in 2.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we explicitly return seq of true if it's not scala 2.11? Then the behavior is more predictable than may be inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it simply returns if going through the below code path. I should probably make the java doc clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument here is it's not necessarily wrong if using scala 2.12. if all inputs are of boxed types, then it can still be good. I think it's just enough to say "we don't support it. switch to the new interface otherwise we can't guarantee correctness."
7a6b2e1 to
84cb456
Compare
|
Test build #97578 has finished for PR 22732 at commit
|
|
Test build #97574 has finished for PR 22732 at commit
|
|
Test build #97573 has finished for PR 22732 at commit
|
|
For Scala 2.11, we should not introduce any behavior change and also keep binary and source compatibility. For Scala 2.12, since we are unable to know the type nullability in a few APIs, we issue a warning message in these cases. Below is the example which will generate a wrong answer due to the primitive data types: In Spark 3.0, we should have a follow-up PR to block these APIs. In Spark 2.4, the support of Scala 2.12 is beta. |
|
Test build #97587 has finished for PR 22732 at commit
|
|
Test build #97581 has finished for PR 22732 at commit
|
|
retest this please |
|
LGTM pending Jenkins |
|
Test build #97601 has finished for PR 22732 at commit
|
|
thanks, merging to master/2.4! |
## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e816776) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up PR for apache#22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. apache#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after apache#22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes apache#22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? In apache#22732 , we tried our best to keep the behavior of Scala UDF unchanged in Spark 2.4. However, since Spark 3.0, Scala 2.12 is the default. The trick that was used to keep the behavior unchanged doesn't work with Scala 2.12. This PR proposes to remove the Scala 2.11 hack, as it's not useful. ## How was this patch tested? existing tests. Closes apache#23498 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This is a follow-up PR for #22259. The extra field added in
ScalaUDFwith the original PR was declared optional, but should be indeed required, otherwise callers ofScalaUDF's constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name tohandleNullForInputs.This PR also restores the original behavior for null-handling of primitive-type input parameters, which was broken by #22259. For example, for
val f = udf({(x: Int, y: Any) => x}),f(null, "str")should returnnullbut would return0after #22259.In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule
HandleNullInputsForUDFbeing applied infinitely.How was this patch tested?
Added UT in UDFSuite
Passed affected existing UTs:
AnalysisSuite
UDFSuite