[SPARK-26580][SQL] remove Scala 2.11 hack for Scala UDF #23498

cloud-fan · 2019-01-09T11:14:03Z

What changes were proposed in this pull request?

In #22732 , we tried our best to keep the behavior of Scala UDF unchanged in Spark 2.4.

However, since Spark 3.0, Scala 2.12 is the default. The trick that was used to keep the behavior unchanged doesn't work with Scala 2.12.

This PR proposes to remove the Scala 2.11 hack, as it's not useful.

How was this patch tested?

existing tests.

cloud-fan · 2019-01-09T11:15:59Z

docs/sql-migration-guide-upgrade.md


  - Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring.

+  - In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0.


this migration guide should have been added when we switch to Scala 2.12.

should this say this is because of Scala 2.12?

cloud-fan · 2019-01-09T11:17:22Z

cc @maryannxue @srowen @gatorsmile @HyukjinKwon

HyukjinKwon · 2019-01-09T12:10:47Z

Yea, I agree with this change.

SparkQA · 2019-01-09T15:08:42Z

Test build #100962 has finished for PR 23498 at commit d66a9c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-10T10:52:40Z

Test build #101014 has finished for PR 23498 at commit e0053af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-10T10:54:43Z

retest this please

SparkQA · 2019-01-10T12:37:56Z

Test build #101015 has finished for PR 23498 at commit e0053af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-10T16:44:59Z

Test build #4501 has finished for PR 23498 at commit e0053af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-11T02:32:51Z

retest this please

SparkQA · 2019-01-11T06:39:25Z

Test build #101046 has finished for PR 23498 at commit e0053af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-11T06:52:22Z

thanks, merging to master!

## What changes were proposed in this pull request? In apache#22732 , we tried our best to keep the behavior of Scala UDF unchanged in Spark 2.4. However, since Spark 3.0, Scala 2.12 is the default. The trick that was used to keep the behavior unchanged doesn't work with Scala 2.12. This PR proposes to remove the Scala 2.11 hack, as it's not useful. ## How was this patch tested? existing tests. Closes apache#23498 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <[email protected]> Co-authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <[email protected]> Co-authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 82ce475) Signed-off-by: Wenchen Fan <[email protected]>

…F by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to apache#23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes apache#27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <[email protected]> Co-authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

remove Scala 2.11 hack for Scala UDF

d66a9c9

cloud-fan force-pushed the udf branch from 3f2a316 to d66a9c9 Compare January 9, 2019 11:14

cloud-fan commented Jan 9, 2019

View reviewed changes

srowen approved these changes Jan 9, 2019

View reviewed changes

address comment

e0053af

cloud-fan closed this in 1f1d98c Jan 11, 2019

Ngone51 mentioned this pull request Feb 7, 2020

[SPARK-31010][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default #27488

Closed


		- Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to disable such type inferring.

		- In Spark version 2.4 and earlier, if `org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with primitive-type argument, the returned UDF will return null if the input values is null. Since Spark 3.0, the UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column `x` is null, and return 0 in Spark 3.0.

[SPARK-26580][SQL] remove Scala 2.11 hack for Scala UDF #23498

[SPARK-26580][SQL] remove Scala 2.11 hack for Scala UDF #23498

Uh oh!

Conversation

cloud-fan commented Jan 9, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Jan 9, 2019

Choose a reason for hiding this comment

Uh oh!

felixcheung Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 9, 2019

Uh oh!

HyukjinKwon commented Jan 9, 2019

Uh oh!

SparkQA commented Jan 9, 2019

Uh oh!

SparkQA commented Jan 10, 2019

Uh oh!

cloud-fan commented Jan 10, 2019

Uh oh!

SparkQA commented Jan 10, 2019

Uh oh!

SparkQA commented Jan 10, 2019

Uh oh!

HyukjinKwon commented Jan 11, 2019

Uh oh!

SparkQA commented Jan 11, 2019

Uh oh!

cloud-fan commented Jan 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants