Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

I was working on refactoring Connect proto tests from Catalyst DSL to DataFrame API, and identified that Join in Connect does not support UsingColumns. This is a gap between the Connect proto and DataFrame API. This also blocks the refactoring work because without UsingColumns, there is no compatible DataFrame Join API that we can covert existing tests to.

This PR adds the support for Join's UsingColumns.

Why are the changes needed?

  1. Improve API coverage.
  2. Unblock testing refactoring.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Relation right = 2;
Expression join_condition = 3;
JoinType join_type = 4;
repeated string using_columns = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can join_condition and using_columns co-exist?

Copy link
Contributor Author

@amaliujia amaliujia Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current Catalyst implementation. My read is yes.

My understanding is current Catalyst implementation does not support JOIN USING col_name,... in the join_condition. It is a separate code path. Is it true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the API layer (DF and SQL), I don't think they can co-exist. E.g. the parser rule is

joinCriteria
    : ON booleanExpression
    | USING identifierList
    ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. To match with our API (but not implementation), I added a check to make sure only one of these two will be set.

JoinType join_type = 4;
// Optional. using_columns provides a list of columns that should present on both sides of
// the join inputs that this Join will join on. For example A JOIN B USING col_name is
// equivalent to A JOIN B on A.col_name = B.col_name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also mention that this can't co-exist with join_condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

def join(
otherPlan: proto.Relation,
joinType: JoinType = JoinType.JOIN_TYPE_INNER,
usingColumns: Seq[String] = Seq(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add an overload of def join which only accepts usingColumns, no condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only one overloaded version can have default arguments:

[ERROR] /Users/rui.wang/Documents/open-source-spark/spark/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:176: in class DslLogicalPlan, multiple overloaded alternatives of method join define default arguments.

I ended up with adding a few more overloaded version of def join.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a27ccd7 Oct 25, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

I was working on refactoring Connect proto tests from Catalyst DSL to DataFrame API, and identified that Join in Connect does not support `UsingColumns`. This is a gap between the Connect proto and DataFrame API. This also blocks the refactoring work because without `UsingColumns`, there is no compatible DataFrame Join API that we can covert existing tests to.

This PR adds the support for Join's `UsingColumns`.

### Why are the changes needed?

1. Improve API coverage.
2. Unblock testing refactoring.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

Closes apache#38345 from amaliujia/proto-join-using-columns.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants