-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL #38276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,68 @@ | ||||||||||||||||||||||||||||||||||||||||||
| /* | ||||||||||||||||||||||||||||||||||||||||||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||||||||||||||||||||||||||||||||||||||||||
| * contributor license agreements. See the NOTICE file distributed with | ||||||||||||||||||||||||||||||||||||||||||
| * this work for additional information regarding copyright ownership. | ||||||||||||||||||||||||||||||||||||||||||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||||||||||||||||||||||||||||||||||||||||||
| * (the "License"); you may not use this file except in compliance with | ||||||||||||||||||||||||||||||||||||||||||
| * the License. You may obtain a copy of the License at | ||||||||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||||||||
| * Unless required by applicable law or agreed to in writing, software | ||||||||||||||||||||||||||||||||||||||||||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||||||||||||||||||||||||||||||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||||||||||||||||||||||||||||||
| * See the License for the specific language governing permissions and | ||||||||||||||||||||||||||||||||||||||||||
| * limitations under the License. | ||||||||||||||||||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||||||||||||||||||
| package org.apache.spark.sql.connect.planner | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.{Dataset, Row, SparkSession} | ||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.catalyst.expressions.AttributeReference | ||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.test.SharedSparkSession | ||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||||||||||||||||||
| * [[SparkConnectPlanTestWithSparkSession]] contains a SparkSession for the connect planner. | ||||||||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||||||||
| * It is not recommended to use Catalyst DSL along with this trait because `SharedSparkSession` | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
| * has also defined implicits over Catalyst LogicalPlan which will cause ambiguity with the | ||||||||||||||||||||||||||||||||||||||||||
| * implicits defined in Catalyst DSL. | ||||||||||||||||||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||||||||||||||||||
| trait SparkConnectPlanTestWithSparkSession extends SharedSparkSession with SparkConnectPlanTest { | ||||||||||||||||||||||||||||||||||||||||||
| override def getSession(): SparkSession = spark | ||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| class SparkConnectDeduplicateSuite extends SparkConnectPlanTestWithSparkSession { | ||||||||||||||||||||||||||||||||||||||||||
| lazy val connectTestRelation = createLocalRelationProto( | ||||||||||||||||||||||||||||||||||||||||||
| Seq( | ||||||||||||||||||||||||||||||||||||||||||
| AttributeReference("id", IntegerType)(), | ||||||||||||||||||||||||||||||||||||||||||
| AttributeReference("key", StringType)(), | ||||||||||||||||||||||||||||||||||||||||||
| AttributeReference("value", StringType)())) | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| lazy val sparkTestRelation = { | ||||||||||||||||||||||||||||||||||||||||||
| spark.createDataFrame( | ||||||||||||||||||||||||||||||||||||||||||
| new java.util.ArrayList[Row](), | ||||||||||||||||||||||||||||||||||||||||||
| StructType( | ||||||||||||||||||||||||||||||||||||||||||
| Seq( | ||||||||||||||||||||||||||||||||||||||||||
| StructField("id", IntegerType), | ||||||||||||||||||||||||||||||||||||||||||
| StructField("key", StringType), | ||||||||||||||||||||||||||||||||||||||||||
| StructField("value", StringType)))) | ||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| test("Test basic deduplicate") { | ||||||||||||||||||||||||||||||||||||||||||
| val connectPlan = { | ||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.connect.dsl.plans._ | ||||||||||||||||||||||||||||||||||||||||||
| Dataset.ofRows(spark, transform(connectTestRelation.distinct())) | ||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| val sparkPlan = sparkTestRelation.distinct() | ||||||||||||||||||||||||||||||||||||||||||
| comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false) | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| val connectPlan2 = { | ||||||||||||||||||||||||||||||||||||||||||
| import org.apache.spark.sql.connect.dsl.plans._ | ||||||||||||||||||||||||||||||||||||||||||
| Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value")))) | ||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
| val connectPlan = { | |
| import org.apache.spark.sql.connect.dsl.plans._ | |
| Dataset.ofRows(spark, transform(connectTestRelation.distinct())) | |
| } | |
| val sparkPlan = sparkTestRelation.distinct() | |
| comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false) | |
| val connectPlan2 = { | |
| import org.apache.spark.sql.connect.dsl.plans._ | |
| Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value")))) | |
| } | |
| import org.apache.spark.sql.connect.dsl.plans._ | |
| val connectPlan = Dataset.ofRows(spark, transform(connectTestRelation.distinct())) | |
| val sparkPlan = sparkTestRelation.distinct() | |
| comparePlans(connectPlan.queryExecution.analyzed, sparkPlan.queryExecution.analyzed, false) | |
| val connectPlan2 = Dataset.ofRows(spark, transform(connectTestRelation.deduplicate(Seq("key", "value")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was an issue here with the way that the two implicits of Spark and Spark Connect DSL are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah here are some context that people may not know:
Scala seems to not allow two implicit defined in the same scope even though there is no ambiguity. In this case, Scala chooses to ignore one of the implementation. The workaround was to use sub-scope to limit one implicit (which is for connect) then its parent scope imports another implicit.
See comment here for the context:
Line 42 in 4201a59
| // TODO: Scala only allows one implicit per scope so we keep proto implicit imports in |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,8 +31,11 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | |
| * test cases. | ||
| */ | ||
| trait SparkConnectPlanTest { | ||
|
|
||
| def getSession(): SparkSession = None.orNull | ||
|
||
|
|
||
| def transform(rel: proto.Relation): LogicalPlan = { | ||
| new SparkConnectPlanner(rel, None.orNull).transform() | ||
| new SparkConnectPlanner(rel, getSession()).transform() | ||
| } | ||
|
|
||
| def readRel: proto.Relation = | ||
|
|
@@ -72,8 +75,6 @@ trait SparkConnectPlanTest { | |
| */ | ||
| class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { | ||
|
|
||
| protected var spark: SparkSession = null | ||
|
|
||
| test("Simple Limit") { | ||
| assertThrows[IndexOutOfBoundsException] { | ||
| new SparkConnectPlanner( | ||
|
|
@@ -266,4 +267,26 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest { | |
| .build())) | ||
| assert(e.getMessage.contains("DataSource requires a format")) | ||
| } | ||
|
|
||
| test("Test invalid deduplicate") { | ||
| val deduplicate = proto.Deduplicate | ||
| .newBuilder() | ||
| .setInput(readRel) | ||
| .setAllColumnsAsKeys(true) | ||
| .addColumnNames("test") | ||
|
|
||
| val e = intercept[InvalidPlanInput] { | ||
| transform(proto.Relation.newBuilder.setDeduplicate(deduplicate).build()) | ||
| } | ||
| assert( | ||
| e.getMessage.contains("Cannot deduplicate on both all columns and a subset of columns")) | ||
|
|
||
| val deduplicate2 = proto.Deduplicate | ||
| .newBuilder() | ||
| .setInput(readRel) | ||
| val e2 = intercept[InvalidPlanInput] { | ||
| transform(proto.Relation.newBuilder.setDeduplicate(deduplicate2).build()) | ||
| } | ||
| assert(e2.getMessage.contains("either deduplicate on all columns or a subset of columns")) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need
getAllColumnsAsKeys? Seems like we can just tell when the columns are not set.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this is not matched with the logical plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is, in Spark connect client, we only see column names, not expr IDs. If the DF has duplicated column names, then deduplicate by all columns can't work in Spark connect client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, column name is unknown either, as the input plan is unresolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we don't need
rel.getAllColumnsAsKeyscondition because we can know that's the case whenrel.getColumnNamesCount == 0.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to clarify this specifically:
This is one of the Connect proto API design principle: we need to differentiate if a field is set or not set explicitly, or put it in another way, every intention should be expressed explicitly. Ultimately, this is to avoid ambiguity on the API surface.
One example is Project. If we see a Project without anything in the project list, then how do we interpret that? Does the user want to indicate a
SELECT *? Does the user actually generate an invalid plan. The problem now is there are two possibilities for a plan, and the worse part is, one possibility is a valid plan, another is not. This led us explicitly encodeSELECT *into the proto #38023.So one of the reasons that we have a bool flag here is to not use
rel.getColumnNamesCount == 0to infer distinct on all columns which has caused ambiguity problem.This might not be great because a few more fields could bring another problem: what if the user set them all. In terms of ambiguity, this is not an issue: we know that is an invalid plan without second choice.