[SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL #38276

amaliujia · 2022-10-16T21:42:07Z

What changes were proposed in this pull request?

This PR supports Deduplicate to Connect proto and DSL.

Note that Deduplicate can not be replaced by SQL's SELECT DISTINCT col_list. The difference is that Deduplicate allows to remove duplicated rows based on a set of columns but returns all the columns. SQL's SELECT DISTINCT col_list, instead, can only return the col_list.

Why are the changes needed?

To improve proto API coverage.
Deduplicate blocks [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server #38166 because we want support Union(isAll=false) but that will return Union().Distinct() to match existing DataFrame API. Deduplicate is needed to write test cases for Union(isAll=false).

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

amaliujia · 2022-10-16T21:42:43Z

R: @cloud-fan

...nnect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectDeduplicateSuite.scala

connector/connect/src/main/protobuf/spark/connect/relations.proto

AmplabJenkins · 2022-10-17T08:26:00Z

Can one of the admins verify this patch?

zhengruifeng · 2022-10-17T08:21:29Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

12?

Suggested change

Deduplicate deduplicate = 13;

Deduplicate deduplicate = 12;

12 is take by Sample tough that PR is not merged :) #38227

This line will cause merge conflict. Whether which PR is in first, it takes 12 and the other takes 13.

zhengruifeng · 2022-10-17T08:24:39Z

...nnect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectDeduplicateSuite.scala

Suggested change

val connectPlan = {

import org.apache.spark.sql.connect.dsl.plans._

Dataset.ofRows(spark, transform(connectTestRelation.distinct()))

}

val sparkPlan = sparkTestRelation.distinct()

comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false)

val connectPlan2 = {

import org.apache.spark.sql.connect.dsl.plans._

Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value"))))

}

import org.apache.spark.sql.connect.dsl.plans._

val connectPlan = Dataset.ofRows(spark, transform(connectTestRelation.distinct()))

val sparkPlan = sparkTestRelation.distinct()

comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false)

val connectPlan2 = Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value"))))

I think there was an issue here with the way that the two implicits of Spark and Spark Connect DSL are handled.

Yeah here are some context that people may not know:

Scala seems to not allow two implicit defined in the same scope even though there is no ambiguity. In this case, Scala chooses to ignore one of the implementation. The workaround was to use sub-scope to limit one implicit (which is for connect) then its parent scope imports another implicit.

See comment here for the context:

spark/connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

Line 42 in 4201a59

// TODO: Scala only allows one implicit per scope so we keep proto implicit imports in

zhengruifeng · 2022-10-17T08:25:58Z

...r/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala

is it possible to reuse SharedSparkSessionBase?

This is relevant to the comment above: the SharedSparkSession and its base defines implicit which will cause ambiguity with Catalyst implicit. For current SparkConnectProtoSuite, we cannot let it inherit SharedSparkSession. Meanwhile for the testing purpose on this Deduplicate implementation, we need a session. This is why this PR does some refactoring on the testing suites to have a separation.

connector/connect/src/main/protobuf/spark/connect/relations.proto

zhengruifeng · 2022-10-17T11:19:24Z

please rebase to enable the codegen check

cloud-fan · 2022-10-17T14:57:19Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

We don't need to do name lookup. We can just create Deduplicate(allColumns, queryExecution.analyzed)

grundprinzip · 2022-10-17T18:25:33Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

given the usage of is_all can't this be inferred by the list of columns? If they're identical is_all is true?

So the question is if is_all is a nice convenience hack or strictly necessary, I'm ok with both :)

It is a nice convenience that users do not want to know the previous schema of the LogicalPlan but just want to chain a DISTINCT. In this case users leave the backend to resolve the complete schema of the previous operation. Though users can val t = df.schema(); df.deduplicate(t)

Well from another perspective if we want to match existing DataFrame API (which is distinct()) this becomes a necessary

grundprinzip · 2022-10-17T18:27:45Z

...nnect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectDeduplicateSuite.scala

I think there was an issue here with the way that the two implicits of Spark and Spark Connect DSL are handled.

cloud-fan · 2022-10-19T08:50:16Z

can you fix the conflicts?

cloud-fan · 2022-10-20T01:29:41Z

oh, conflicts again...

amaliujia · 2022-10-20T02:49:09Z

@cloud-fan let me fix. It is very easy to have conflict on the auto generated python proto files. python/pyspark/sql/connect/proto/relations_pb2.py.

They are acting as "single point of failures".

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

cloud-fan · 2022-10-20T10:30:25Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

I don't think spark connect will have backward compatibility issues. It's a new API.

I see. Removed this comment.

cloud-fan · 2022-10-20T10:34:29Z

...nnect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectDeduplicateSuite.scala

This makes me think that we probably shouldn't use catalyst DSL at all. The tests need Spark Connect planner, which needs SparkSession. Some tests happen to not invoke SparkSession, but it's a bit hacky to rely on this assumption. We should just compare plans produced by proto DSL and DataFrame APIs.

We decided to with Catalyst DSL in #37994. However that also caused the pain of does the small scope to avoid implicit conflicts.

Given that we need a session based test, migrate all such tests to the same place that uses session and also DataFrame API just makes sense. Also I believe in this approach we don't have implicit conflict anymore in the same scope.

How about after this PR, let me send a follow up PR for the testing refactoring?

Or we can the have refactoring happen (I prefer a separate PR) then I can rebase this one. Either way is fine to me.

We can do it in a followup.

cloud-fan · 2022-10-21T03:02:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

seems this is not used?

oh yes because I had to use DataFrame API for testing this ime.

Removed.

cloud-fan

LGTM except for https://github.com/apache/spark/pull/38276/files#r1001312001

amaliujia · 2022-10-21T03:08:36Z

I will monitor the build job. Seeing more flaky than usual.

cloud-fan · 2022-10-21T05:14:37Z

thanks, merging to master!

HyukjinKwon · 2022-10-24T02:19:53Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

+    if (rel.getAllColumnsAsKeys && rel.getColumnNamesCount > 0) {
+      throw InvalidPlanInput("Cannot deduplicate on both all columns and a subset of columns")
+    }
+    if (!rel.getAllColumnsAsKeys && rel.getColumnNamesCount == 0) {


Why do we need getAllColumnsAsKeys? Seems like we can just tell when the columns are not set.

and this is not matched with the logical plan.

The issue is, in Spark connect client, we only see column names, not expr IDs. If the DF has duplicated column names, then deduplicate by all columns can't work in Spark connect client.

actually, column name is unknown either, as the input plan is unresolved.

I mean, we don't need rel.getAllColumnsAsKeys condition because we can know that's the case when rel.getColumnNamesCount == 0.

I mean, we don't need rel.getAllColumnsAsKeys condition because we can know that's the case when rel.getColumnNamesCount == 0.

I want to clarify this specifically:

This is one of the Connect proto API design principle: we need to differentiate if a field is set or not set explicitly, or put it in another way, every intention should be expressed explicitly. Ultimately, this is to avoid ambiguity on the API surface.

One example is Project. If we see a Project without anything in the project list, then how do we interpret that? Does the user want to indicate a SELECT *? Does the user actually generate an invalid plan. The problem now is there are two possibilities for a plan, and the worse part is, one possibility is a valid plan, another is not. This led us explicitly encode SELECT * into the proto #38023.

So one of the reasons that we have a bool flag here is to not use rel.getColumnNamesCount == 0 to infer distinct on all columns which has caused ambiguity problem.

This might not be great because a few more fields could bring another problem: what if the user set them all. In terms of ambiguity, this is not an issue: we know that is an invalid plan without second choice.

…on client ### What changes were proposed in this pull request? Following up on #38276, this PR improve both `distinct()` and `dropDuplicates` DataFrame API in Python client, which both depends on `Deduplicate` plan in the Connect proto. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38327 from amaliujia/python_deduplicate. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR supports `Deduplicate` to Connect proto and DSL. Note that `Deduplicate` can not be replaced by SQL's `SELECT DISTINCT col_list`. The difference is that `Deduplicate` allows to remove duplicated rows based on a set of columns but returns all the columns. SQL's `SELECT DISTINCT col_list`, instead, can only return the `col_list`. ### Why are the changes needed? 1. To improve proto API coverage. 2. `Deduplicate` blocks apache#38166 because we want support `Union(isAll=false)` but that will return `Union().Distinct()` to match existing DataFrame API. `Deduplicate` is needed to write test cases for `Union(isAll=false)`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38276 from amaliujia/supportDropDuplicates. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…on client ### What changes were proposed in this pull request? Following up on apache#38276, this PR improve both `distinct()` and `dropDuplicates` DataFrame API in Python client, which both depends on `Deduplicate` plan in the Connect proto. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38327 from amaliujia/python_deduplicate. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added CONNECT SQL labels Oct 16, 2022

amaliujia changed the title ~~[SPARK-40812][CONNECT]Add Deduplicate to Connect proto and DSL.~~ [SPARK-40812][CONNECT]Add Deduplicate to Connect proto and DSL Oct 16, 2022

HyukjinKwon changed the title ~~[SPARK-40812][CONNECT]Add Deduplicate to Connect proto and DSL~~ [SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL Oct 17, 2022

HyukjinKwon reviewed Oct 17, 2022

View reviewed changes

...nnect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectDeduplicateSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 17, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

zhengruifeng reviewed Oct 17, 2022

View reviewed changes

HyukjinKwon reviewed Oct 17, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 17, 2022

View reviewed changes

grundprinzip reviewed Oct 17, 2022

View reviewed changes

github-actions bot added CORE PYTHON labels Oct 17, 2022

amaliujia force-pushed the supportDropDuplicates branch from ad0451a to 138bc4e Compare October 19, 2022 20:49

amaliujia force-pushed the supportDropDuplicates branch from a25a351 to 8f0ad05 Compare October 20, 2022 03:31

cloud-fan reviewed Oct 20, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 20, 2022

View reviewed changes

amaliujia force-pushed the supportDropDuplicates branch from 40b9c5b to bf453b3 Compare October 20, 2022 21:39

cloud-fan reviewed Oct 21, 2022

View reviewed changes

cloud-fan approved these changes Oct 21, 2022

View reviewed changes

amaliujia added 2 commits October 20, 2022 20:06

[SPARK-40812][CONNECT]Add Deduplicate to Connect proto.

a651178

update

9110226

amaliujia force-pushed the supportDropDuplicates branch from bf453b3 to 9110226 Compare October 21, 2022 03:07

cloud-fan closed this in b14da8b Oct 21, 2022

amaliujia mentioned this pull request Oct 21, 2022

[SPARK-40812][CONNECT][PYTHON][FOLLOW-UP] Improve Deduplicate in Python client #38327

Closed

HyukjinKwon reviewed Oct 24, 2022

View reviewed changes

-    val connectPlan = {
-      import org.apache.spark.sql.connect.dsl.plans._
-      Dataset.ofRows(spark, transform(connectTestRelation.distinct()))
-    }
-    val sparkPlan = sparkTestRelation.distinct()
-    comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false)
-    val connectPlan2 = {
-      import org.apache.spark.sql.connect.dsl.plans._
-      Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value"))))
-    }
+    import org.apache.spark.sql.connect.dsl.plans._
+    val connectPlan = Dataset.ofRows(spark, transform(connectTestRelation.distinct()))
+    val sparkPlan = sparkTestRelation.distinct()
+    comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false)
+    val connectPlan2 = Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value"))))

[SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL #38276

[SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL #38276

Uh oh!

Conversation

amaliujia commented Oct 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Oct 16, 2022

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Oct 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengruifeng commented Oct 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 19, 2022

Uh oh!

cloud-fan commented Oct 20, 2022

Uh oh!

amaliujia commented Oct 20, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Oct 16, 2022 •

edited

Loading

amaliujia Oct 17, 2022 •

edited

Loading

amaliujia Oct 17, 2022 •

edited

Loading

amaliujia Oct 17, 2022 •

edited

Loading

amaliujia Oct 17, 2022 •

edited

Loading

amaliujia Oct 20, 2022 •

edited

Loading

amaliujia Oct 24, 2022 •

edited

Loading