Skip to content

Conversation

@grundprinzip
Copy link
Contributor

@grundprinzip grundprinzip commented Oct 15, 2022

What changes were proposed in this pull request?

This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like <, +, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works.

In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT, E2E

@grundprinzip grundprinzip marked this pull request as ready for review October 15, 2022 15:52
@grundprinzip
Copy link
Contributor Author

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amaliujia
Copy link
Contributor

amaliujia commented Oct 15, 2022

Can you also improve the Scala side testing coverage as you have updated the SparkConnectPlanner?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@grundprinzip
Copy link
Contributor Author

Can you also improve the Scala side testing coverage as you have updated the SparkConnectPlanner?

I'll figure out to add more. Interestingly the work here is already covered by your previous patch on the where clause. Because we rely on the unresolved function call. But I'll add a negative test case.

@grundprinzip
Copy link
Contributor Author

@amaliujia it would be great if you could have another look. I added additional tests on the planner side to iron out the issue with the name parts.

Thanks

import unittest
import tempfile

import pandas
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. we gotta fix this or do something. pandas isn't a required library for SQL package. Should probably skip this tests when pandas is not installed for now until we have a clear way to handle this. (see pyspark.testing.sqlutils.have_pandas and pyspark.sql.tests.test_arrow_map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, nothing in Spark Connect will work atm without pandas because we always call toPandas in the collection of the result. Let me know what you want to do.

* @param args
* @return Expression wrapping the unresolved function.
*/
def fun(nameParts: Seq[String], args: Seq[proto.Expression]): proto.Expression = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name it func as that's (much) more common in the codebase as far as I can tell.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just in the DSL package that is used for testing. I'll check what Catalyst is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the catalyst DSL has a similar callFunction

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a bit of more context, one thing is that we use snake_case to match w/ SQL function names (see Column or functions.scala). This kind of naming rule is already mixed in our existing SQL DSL (see also org.apache.spark.sql.catalyst.package). Should probably pick one and stick to that.

In the past, we followed camelCase in both DSL, Column and functions.scala. After that, we renamed them all to snake_case for SQL compatibility in Column and functions.scala (so the new DSL added follows snake_case at org.apache.spark.sql.catalyst.package)

Therefore, I tend to use snake_case in this DSL case too but I don't object if others (or you) feel this is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name the two methods function?

def function(exprs: Expression*): UnresolvedFunction =
UnresolvedFunction(s, exprs, isDistinct = false)

don't feel strong about the naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, I definitely missed renaming the other overall callFunction as well. TBH I'm not sure what the right approach is here because the catalyst DSL calls it callFunction instead of call_function 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return _


class ColumnRef(Expression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this to AttributeReference to avoid confusion e.g., I think it's a bit mixed with Column interface that is the user-facing interface. @amaliujia and @cloud-fan

Should better to keep it matched with either Catalyst internal types or user-facing Spark SQL interface classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a discussion about this, but this is an unrelated change to this one. I think we should probably call Expression -> Column and ColumnRef -> AttributeReference but it will require some more digging what the right name should be. However, as said, that's independent of this change.

Copy link
Contributor

@amaliujia amaliujia Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I have been thinking this ColumnRef thing. Let's revisit it on the naming, etc. in the future.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment: #38270 (comment). LGTM since we're still heavily developing but should probably revisit things like #38270 (comment) or #38270 (comment)

Comment on lines +54 to +55
import org.apache.spark.sql.connect.dsl.expressions._
import org.apache.spark.sql.connect.dsl.plans._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: only import them once in test("UnresolvedFunction resolution.")?

@HyukjinKwon
Copy link
Member

Merged to master.

import org.apache.spark.sql.connect.dsl.plans._
transform(connectTestRelation.select(callFunction(Seq("hex"), Seq("id".protoAttr))))
}
assert(validPlan.analyze != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to compare it with the catalyst plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not to validate that the catalyst plan exists, but really just that existing functions are actually resolved. The !=null is mostly to have any assertion and not throw.

@cloud-fan
Copy link
Contributor

late LGTM

"""

__gt__ = _bin_op(">")
__lt__ = _bin_op(">")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bin_op("<")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this was a mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyukjinKwon added a commit that referenced this pull request Oct 19, 2022
…onnect expression

### What changes were proposed in this pull request?

This PR is a followup of #38270 that changes `__lt__` to use `<` that is less than comparison.

### Why are the changes needed?

To less than comparison to use `<` properly.

### Does this PR introduce _any_ user-facing change?

No, the original change is not released yet.

### How was this patch tested?

Unit test was added.

Closes #38303 from HyukjinKwon/SPARK-40538-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…lient

### What changes were proposed in this pull request?
This patch changes the way simple scalar built-in functions are resolved in the Python Spark Connect client. Previously, it was trying to manually load specific functions. With the changes in this patch, the trivial binary operators like `<`, `+`, ... are mapped to their name equivalents in Spark so that the dynamic function lookup works.

In addition, it cleans up the Scala planner side to remove the now unnecessary code translating the trivial binary expressions into their equivalent functions.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT, E2E

Closes apache#38270 from grundprinzip/spark-40538.

Authored-by: Martin Grund <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…onnect expression

### What changes were proposed in this pull request?

This PR is a followup of apache#38270 that changes `__lt__` to use `<` that is less than comparison.

### Why are the changes needed?

To less than comparison to use `<` properly.

### Does this PR introduce _any_ user-facing change?

No, the original change is not released yet.

### How was this patch tested?

Unit test was added.

Closes apache#38303 from HyukjinKwon/SPARK-40538-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants