[SPARK-28323][SQL][Python] PythonUDF should be able to use in join condition #25091

viirya · 2019-07-10T04:19:26Z

What changes were proposed in this pull request?

There is a bug in ExtractPythonUDFs that produces wrong result attributes. It causes a failure when using PythonUDFs among multiple child plans, e.g., join. An example is using PythonUDFs in join condition.

>>> left = spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, a1=2, a2=2)])                                                                                                                                                                                                                      
>>> right = spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=1, b1=3, b2=1)])                                                                       
>>> f = udf(lambda a: a, IntegerType())                                                                                                               
>>> df = left.join(right, [f("a") == f("b"), left.a1 == right.b1])                                                
>>> df.collect()                                                                                                                                      
19/07/10 12:20:49 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5)                                                                                                                                                 
java.lang.ArrayIndexOutOfBoundsException: 1                                                                                                           
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)                                                    
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)               
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt(rows.scala:36)               
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt$(rows.scala:36)                             
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.isNullAt(rows.scala:195)
        at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
        ...

How was this patch tested?

Added test.

viirya · 2019-07-10T04:23:40Z

cc @HyukjinKwon @BryanCutler

HyukjinKwon · 2019-07-10T05:00:21Z

Cool @viirya! I will take a closer look within 2 days and get this in

viirya · 2019-07-10T05:03:07Z

Thanks! @HyukjinKwon

SparkQA · 2019-07-10T07:05:02Z

Test build #107433 has finished for PR 25091 at commit 95231a6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-07-10T07:36:57Z

retest this please.

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

SparkQA · 2019-07-10T10:47:49Z

Test build #107445 has finished for PR 25091 at commit 95231a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-10T11:45:32Z

Test build #107452 has finished for PR 25091 at commit 0c24787.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-10T15:45:14Z

Test build #107458 has finished for PR 25091 at commit fbf8be9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM

BryanCutler · 2019-07-10T20:30:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

            "Can only extract scalar vectorized udf or sql batch udf")

-          val resultAttrs = udfs.zipWithIndex.map { case (u, i) =>
+          val resultAttrs = validUdfs.zipWithIndex.map { case (u, i) =>


oof, nice catch!

BryanCutler · 2019-07-10T23:35:30Z

merged to master, thanks @viirya !

BryanCutler · 2019-07-10T23:37:06Z

jira seems down, so I wasn't able to resolve the issue. will try later

HyukjinKwon · 2019-07-11T00:25:53Z

Yea, LGTM too!

viirya · 2019-07-11T00:46:56Z

Thanks! @HyukjinKwon @BryanCutler

PythonUDF should be able to use in join condition.

95231a6

viirya mentioned this pull request Jul 10, 2019

[SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF test base #25090

Closed

Add scala test for query plan.

0c24787

HyukjinKwon reviewed Jul 10, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Show resolved Hide resolved

Add assume.

fbf8be9

dongjoon-hyun added PYSPARK SQL labels Jul 10, 2019

BryanCutler approved these changes Jul 10, 2019

View reviewed changes

BryanCutler closed this in 7858e53 Jul 10, 2019

viirya deleted the SPARK-28323 branch December 27, 2023 18:36

[SPARK-28323][SQL][Python] PythonUDF should be able to use in join condition #25091

[SPARK-28323][SQL][Python] PythonUDF should be able to use in join condition #25091

Uh oh!

Conversation

viirya commented Jul 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Jul 10, 2019

Uh oh!

HyukjinKwon commented Jul 10, 2019

Uh oh!

viirya commented Jul 10, 2019

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

viirya commented Jul 10, 2019

Uh oh!

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jul 10, 2019

Uh oh!

BryanCutler commented Jul 10, 2019

Uh oh!

HyukjinKwon commented Jul 11, 2019

Uh oh!

viirya commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Jul 10, 2019 •

edited

Loading