-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40879][CONNECT] Support Join UsingColumns in proto #38345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @cloud-fan |
|
Can one of the admins verify this patch? |
| Relation right = 2; | ||
| Expression join_condition = 3; | ||
| JoinType join_type = 4; | ||
| repeated string using_columns = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can join_condition and using_columns co-exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current Catalyst implementation. My read is yes.
My understanding is current Catalyst implementation does not support JOIN USING col_name,... in the join_condition. It is a separate code path. Is it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the API layer (DF and SQL), I don't think they can co-exist. E.g. the parser rule is
joinCriteria
: ON booleanExpression
| USING identifierList
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. To match with our API (but not implementation), I added a check to make sure only one of these two will be set.
| JoinType join_type = 4; | ||
| // Optional. using_columns provides a list of columns that should present on both sides of | ||
| // the join inputs that this Join will join on. For example A JOIN B USING col_name is | ||
| // equivalent to A JOIN B on A.col_name = B.col_name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also mention that this can't co-exist with join_condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| def join( | ||
| otherPlan: proto.Relation, | ||
| joinType: JoinType = JoinType.JOIN_TYPE_INNER, | ||
| usingColumns: Seq[String] = Seq(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add an overload of def join which only accepts usingColumns, no condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like only one overloaded version can have default arguments:
[ERROR] /Users/rui.wang/Documents/open-source-spark/spark/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:176: in class DslLogicalPlan, multiple overloaded alternatives of method join define default arguments.
I ended up with adding a few more overloaded version of def join.
|
thanks, merging to master! |
### What changes were proposed in this pull request? I was working on refactoring Connect proto tests from Catalyst DSL to DataFrame API, and identified that Join in Connect does not support `UsingColumns`. This is a gap between the Connect proto and DataFrame API. This also blocks the refactoring work because without `UsingColumns`, there is no compatible DataFrame Join API that we can covert existing tests to. This PR adds the support for Join's `UsingColumns`. ### Why are the changes needed? 1. Improve API coverage. 2. Unblock testing refactoring. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT Closes apache#38345 from amaliujia/proto-join-using-columns. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
I was working on refactoring Connect proto tests from Catalyst DSL to DataFrame API, and identified that Join in Connect does not support
UsingColumns. This is a gap between the Connect proto and DataFrame API. This also blocks the refactoring work because withoutUsingColumns, there is no compatible DataFrame Join API that we can covert existing tests to.This PR adds the support for Join's
UsingColumns.Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT