-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40538] [CONNECT] Improve built-in function support for Python client. #38270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Can you also improve the Scala side testing coverage as you have updated the SparkConnectPlanner? |
|
Can one of the admins verify this patch? |
I'll figure out to add more. Interestingly the work here is already covered by your previous patch on the where clause. Because we rely on the unresolved function call. But I'll add a negative test case. |
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
|
@amaliujia it would be great if you could have another look. I added additional tests on the planner side to iron out the issue with the name parts. Thanks |
| import unittest | ||
| import tempfile | ||
|
|
||
| import pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. we gotta fix this or do something. pandas isn't a required library for SQL package. Should probably skip this tests when pandas is not installed for now until we have a clear way to handle this. (see pyspark.testing.sqlutils.have_pandas and pyspark.sql.tests.test_arrow_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, nothing in Spark Connect will work atm without pandas because we always call toPandas in the collection of the result. Let me know what you want to do.
| * @param args | ||
| * @return Expression wrapping the unresolved function. | ||
| */ | ||
| def fun(nameParts: Seq[String], args: Seq[proto.Expression]): proto.Expression = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name it func as that's (much) more common in the codebase as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just in the DSL package that is used for testing. I'll check what Catalyst is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the catalyst DSL has a similar callFunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a bit of more context, one thing is that we use snake_case to match w/ SQL function names (see Column or functions.scala). This kind of naming rule is already mixed in our existing SQL DSL (see also org.apache.spark.sql.catalyst.package). Should probably pick one and stick to that.
In the past, we followed camelCase in both DSL, Column and functions.scala. After that, we renamed them all to snake_case for SQL compatibility in Column and functions.scala (so the new DSL added follows snake_case at org.apache.spark.sql.catalyst.package)
Therefore, I tend to use snake_case in this DSL case too but I don't object if others (or you) feel this is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name the two methods function?
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
Lines 368 to 369 in ff7ab34
| def function(exprs: Expression*): UnresolvedFunction = | |
| UnresolvedFunction(s, exprs, isDistinct = false) |
don't feel strong about the naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, I definitely missed renaming the other overall callFunction as well. TBH I'm not sure what the right approach is here because the catalyst DSL calls it callFunction instead of call_function 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L250-L256 Please let me know what you want to do.
| return _ | ||
|
|
||
|
|
||
| class ColumnRef(Expression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this to AttributeReference to avoid confusion e.g., I think it's a bit mixed with Column interface that is the user-facing interface. @amaliujia and @cloud-fan
Should better to keep it matched with either Catalyst internal types or user-facing Spark SQL interface classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a discussion about this, but this is an unrelated change to this one. I think we should probably call Expression -> Column and ColumnRef -> AttributeReference but it will require some more digging what the right name should be. However, as said, that's independent of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I have been thinking this ColumnRef thing. Let's revisit it on the naming, etc. in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment: #38270 (comment). LGTM since we're still heavily developing but should probably revisit things like #38270 (comment) or #38270 (comment)
| import org.apache.spark.sql.connect.dsl.expressions._ | ||
| import org.apache.spark.sql.connect.dsl.plans._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: only import them once in test("UnresolvedFunction resolution.")?
|
Merged to master. |
| import org.apache.spark.sql.connect.dsl.plans._ | ||
| transform(connectTestRelation.select(callFunction(Seq("hex"), Seq("id".protoAttr)))) | ||
| } | ||
| assert(validPlan.analyze != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to compare it with the catalyst plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not to validate that the catalyst plan exists, but really just that existing functions are actually resolved. The !=null is mostly to have any assertion and not throw.
|
late LGTM |
| """ | ||
|
|
||
| __gt__ = _bin_op(">") | ||
| __lt__ = _bin_op(">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_bin_op("<")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this was a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…onnect expression ### What changes were proposed in this pull request? This PR is a followup of #38270 that changes `__lt__` to use `<` that is less than comparison. ### Why are the changes needed? To less than comparison to use `<` properly. ### Does this PR introduce _any_ user-facing change? No, the original change is not released yet. ### How was this patch tested? Unit test was added. Closes #38303 from HyukjinKwon/SPARK-40538-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…lient ### What changes were proposed in this pull request? This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like `<`, `+`, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works. In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT, E2E Closes apache#38270 from grundprinzip/spark-40538. Authored-by: Martin Grund <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…onnect expression ### What changes were proposed in this pull request? This PR is a followup of apache#38270 that changes `__lt__` to use `<` that is less than comparison. ### Why are the changes needed? To less than comparison to use `<` properly. ### Does this PR introduce _any_ user-facing change? No, the original change is not released yet. ### How was this patch tested? Unit test was added. Closes apache#38303 from HyukjinKwon/SPARK-40538-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like
<,+, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works.In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT, E2E