[SPARK-40538] [CONNECT] Improve built-in function support for Python client. #38270

grundprinzip · 2022-10-15T15:47:01Z

What changes were proposed in this pull request?

This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like <, +, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works.

In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT, E2E

grundprinzip · 2022-10-15T15:53:17Z

R: @hvanhovell @cloud-fan @HyukjinKwon @amaliujia

hvanhovell

LGTM

python/pyspark/sql/connect/column.py

amaliujia · 2022-10-15T21:26:28Z

Can you also improve the Scala side testing coverage as you have updated the SparkConnectPlanner?

AmplabJenkins · 2022-10-16T05:55:31Z

Can one of the admins verify this patch?

grundprinzip · 2022-10-16T07:37:42Z

Can you also improve the Scala side testing coverage as you have updated the SparkConnectPlanner?

I'll figure out to add more. Interestingly the work here is already covered by your previous patch on the where clause. Because we rely on the unresolved function call. But I'll add a negative test case.

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip · 2022-10-16T11:40:18Z

@amaliujia it would be great if you could have another look. I added additional tests on the planner side to iron out the issue with the name parts.

Thanks

HyukjinKwon · 2022-10-16T14:16:35Z

python/pyspark/sql/tests/connect/test_connect_basic.py

 import unittest
 import tempfile

+import pandas


Hm .. we gotta fix this or do something. pandas isn't a required library for SQL package. Should probably skip this tests when pandas is not installed for now until we have a clear way to handle this. (see pyspark.testing.sqlutils.have_pandas and pyspark.sql.tests.test_arrow_map

Interestingly, nothing in Spark Connect will work atm without pandas because we always call toPandas in the collection of the result. Let me know what you want to do.

python/pyspark/sql/connect/column.py

HyukjinKwon · 2022-10-16T14:22:47Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

+     * @param args
+     * @return Expression wrapping the unresolved function.
+     */
+    def fun(nameParts: Seq[String], args: Seq[proto.Expression]): proto.Expression = {


I would name it func as that's (much) more common in the codebase as far as I can tell.

or call_function (ref https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.functions.html#snowflake.snowpark.functions.call_function)

this is just in the DSL package that is used for testing. I'll check what Catalyst is doing.

the catalyst DSL has a similar callFunction

For a bit of more context, one thing is that we use snake_case to match w/ SQL function names (see Column or functions.scala). This kind of naming rule is already mixed in our existing SQL DSL (see also org.apache.spark.sql.catalyst.package). Should probably pick one and stick to that.

In the past, we followed camelCase in both DSL, Column and functions.scala. After that, we renamed them all to snake_case for SQL compatibility in Column and functions.scala (so the new DSL added follows snake_case at org.apache.spark.sql.catalyst.package)

Therefore, I tend to use snake_case in this DSL case too but I don't object if others (or you) feel this is better.

maybe name the two methods function?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

Lines 368 to 369 in ff7ab34

def function(exprs: Expression*): UnresolvedFunction =

UnresolvedFunction(s, exprs, isDistinct = false)

don't feel strong about the naming

Oh nice, I definitely missed renaming the other overall callFunction as well. TBH I'm not sure what the right approach is here because the catalyst DSL calls it callFunction instead of call_function 🤷

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L250-L256 Please let me know what you want to do.

HyukjinKwon · 2022-10-16T14:41:18Z

python/pyspark/sql/connect/column.py

-    return _
-
-
 class ColumnRef(Expression):


I think we should rename this to AttributeReference to avoid confusion e.g., I think it's a bit mixed with Column interface that is the user-facing interface. @amaliujia and @cloud-fan

Should better to keep it matched with either Catalyst internal types or user-facing Spark SQL interface classes.

Let's have a discussion about this, but this is an unrelated change to this one. I think we should probably call Expression -> Column and ColumnRef -> AttributeReference but it will require some more digging what the right name should be. However, as said, that's independent of this change.

+1 I have been thinking this ColumnRef thing. Let's revisit it on the naming, etc. in the future.

HyukjinKwon

One comment: #38270 (comment). LGTM since we're still heavily developing but should probably revisit things like #38270 (comment) or #38270 (comment)

zhengruifeng · 2022-10-17T08:49:45Z

...tor/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

+      import org.apache.spark.sql.connect.dsl.expressions._
+      import org.apache.spark.sql.connect.dsl.plans._


nit: only import them once in test("UnresolvedFunction resolution.")?

HyukjinKwon · 2022-10-18T11:08:09Z

Merged to master.

cloud-fan · 2022-10-18T14:29:04Z

...tor/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.select(callFunction(Seq("hex"), Seq("id".protoAttr))))
+    }
+    assert(validPlan.analyze != null)


it's better to compare it with the catalyst plan

This is not to validate that the catalyst plan exists, but really just that existing functions are actually resolved. The !=null is mostly to have any assertion and not throw.

cloud-fan · 2022-10-18T14:29:49Z

late LGTM

cloud-fan · 2022-10-19T02:32:40Z

python/pyspark/sql/connect/column.py

    """

+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op(">")


_bin_op("<")

Yeah, I think this was a mistake.

…onnect expression ### What changes were proposed in this pull request? This PR is a followup of #38270 that changes `__lt__` to use `<` that is less than comparison. ### Why are the changes needed? To less than comparison to use `<` properly. ### Does this PR introduce _any_ user-facing change? No, the original change is not released yet. ### How was this patch tested? Unit test was added. Closes #38303 from HyukjinKwon/SPARK-40538-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…lient ### What changes were proposed in this pull request? This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like `<`, `+`, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works. In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT, E2E Closes apache#38270 from grundprinzip/spark-40538. Authored-by: Martin Grund <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…onnect expression ### What changes were proposed in this pull request? This PR is a followup of apache#38270 that changes `__lt__` to use `<` that is less than comparison. ### Why are the changes needed? To less than comparison to use `<` properly. ### Does this PR introduce _any_ user-facing change? No, the original change is not released yet. ### How was this patch tested? Unit test was added. Closes apache#38303 from HyukjinKwon/SPARK-40538-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Basic stuff should work now

dee4267

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 15, 2022

cleaning up the scala side as well

341d44e

grundprinzip marked this pull request as ready for review October 15, 2022 15:52

hvanhovell approved these changes Oct 15, 2022

View reviewed changes

amaliujia reviewed Oct 15, 2022

View reviewed changes

python/pyspark/sql/connect/column.py Outdated Show resolved Hide resolved

amaliujia reviewed Oct 16, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

grundprinzip added 3 commits October 16, 2022 12:55

Moving the binary expressions up into the base class

b7902cb

adding scala planner tests

b327268

Merge remote-tracking branch 'origin/master' into HEAD

4fe501e

HyukjinKwon reviewed Oct 16, 2022

View reviewed changes

python/pyspark/sql/connect/column.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 16, 2022

View reviewed changes

python/pyspark/sql/connect/column.py Show resolved Hide resolved

HyukjinKwon reviewed Oct 16, 2022

View reviewed changes

grundprinzip added 2 commits October 16, 2022 20:25

review comments

bac0664

review comments

b7acacb

HyukjinKwon approved these changes Oct 17, 2022

View reviewed changes

zhengruifeng reviewed Oct 17, 2022

View reviewed changes

grundprinzip added 3 commits October 17, 2022 21:26

missing rename

1c7b7ef

Merge branch 'master' of github.com:apache/spark into HEAD

5675ff6

format

e03c60a

fix test

09f1540

HyukjinKwon closed this in a9da924 Oct 18, 2022

cloud-fan reviewed Oct 18, 2022

View reviewed changes

cloud-fan reviewed Oct 19, 2022

View reviewed changes

HyukjinKwon mentioned this pull request Oct 19, 2022

[SPARK-40538][CONNECT][FOLLOW-UP] Fix less than comparison in Spark Connect expression #38303

Closed

	def function(exprs: Expression*): UnresolvedFunction =
	UnresolvedFunction(s, exprs, isDistinct = false)

		import org.apache.spark.sql.connect.dsl.expressions._
		import org.apache.spark.sql.connect.dsl.plans._

[SPARK-40538] [CONNECT] Improve built-in function support for Python client. #38270

[SPARK-40538] [CONNECT] Improve built-in function support for Python client. #38270

Uh oh!

Conversation

grundprinzip commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

grundprinzip commented Oct 15, 2022

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amaliujia commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Oct 16, 2022

Uh oh!

grundprinzip commented Oct 16, 2022

Uh oh!

Uh oh!

grundprinzip commented Oct 16, 2022

Uh oh!

HyukjinKwon Oct 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip commented Oct 15, 2022 •

edited

Loading

amaliujia commented Oct 15, 2022 •

edited

Loading

HyukjinKwon Oct 16, 2022 •

edited

Loading

HyukjinKwon Oct 16, 2022 •

edited

Loading

HyukjinKwon Oct 17, 2022 •

edited

Loading

amaliujia Oct 17, 2022 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading