[SPARK-53593][SDP] Add response field for DefineDataset and DefineFlow RPC #52328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

cookiedough77 wants to merge 41 commits into apache:master from cookiedough77:jessie.luo_data/spark-add-response

Contributor

cookiedough77 commented Sep 12, 2025 •

edited

Loading

What changes were proposed in this pull request?

This PR updates the Spark Connect server to return resolved dataset and flow names in the responses of DefineDataset and DefineFlow RPCs.

Changes include:

Adding resolved_data_name and resolved_flow_name to the respective proto response messages.
Updating the RPC handlers to return resolved identifiers as response.
Adding unit tests in SparkDeclarativePipelinesServerSuite to validate the resolved names

Why are the changes needed?

The SC client requires the resolved names for datasets and flows to support graph resolution in the LDP frontend. Returning this info from the server ensures consistent naming and proper registration.

Does this PR introduce any user-facing change?

Yes. The DefineDataset and DefineFlow RPC responses now include fully qualified names like catalog.db.mv. Implicit flows to temp views return unqualified names like mv.

How was this patch tested?

Added targeted unit tests in SparkDeclarativePipelinesServerSuite. Verified both default and custom catalog/database cases.

Run test using

./build/sbt
> project connect
> testOnly *SparkDeclarativePipelinesServerSuite

Was this patch authored or co-authored using generative AI tooling?

No.

cookiedough77 added 3 commits

September 12, 2025 10:24


          padd default instance

585ad11


          return dummy name for define dataset

1c89918


          return dummy for both define flow and dataset

3c7d421

github-actions bot added SQL CONNECT labels

cookiedough77 added 5 commits

September 15, 2025 13:07


          get full name for mv and table works for both dataset and flow

692579c


          refactor to register fully qualified name

84b6d0f


          helper function refactor

30311cb


          refactor using convertToQualifiedIdentifier

36811b4


          refactor flows

cookiedough77 changed the title ~~Jessie.luo data/spark add response~~ Adding response field for DefineDataset and DefineFlow RPC

cookiedough77 changed the title ~~Adding response field for DefineDataset and DefineFlow RPC~~ Add response field for DefineDataset and DefineFlow RPC

SCHJonathan reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Outdated Show resolved Hide resolved

cookiedough77 added 7 commits

September 16, 2025 11:36


          rename proto

7472e0c


          revert convertToQualifiedTableIdentifier

7c425f9


          add val isImplicitFlowForTempView

11aca82


          use gridTest

c032f7a


          add custom default catalog test

540baca


          define dataset and define flow returns qualifiers, return quoted string

51921b9


          refactor to avoid using var

5e4c7a4

SCHJonathan reviewed

View reviewed changes

Contributor

SCHJonathan left a comment

Mostly LGTM

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Show resolved Hide resolved

cookiedough77 added 2 commits

September 16, 2025 15:23


          nit: add comments, rename

2bd1ffe


          refactor test case using class

983df42

JiaqiWang18 reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Show resolved Hide resolved


          git: refactor test

2981e4d

SCHJonathan approved these changes

View reviewed changes

Contributor

SCHJonathan left a comment

Can you fix the PR description to follow the standard format?
See reference: #51590

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

...test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerSuite.scala Outdated Show resolved Hide resolved

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Show resolved Hide resolved

cookiedough77 added 3 commits

September 17, 2025 11:52

nit

6bd24a0

fmt

215e317


          fix proto

ffcc0df

github-actions bot added the PYTHON label


          update current catallg, databases

f042af1

SCHJonathan approved these changes

View reviewed changes

gengliangwang reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

gengliangwang reviewed

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala Outdated Show resolved Hide resolved


          rename resolved_data_name as resolved_dataset_name

a4b309c

sryza reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

cookiedough77 added 2 commits

September 22, 2025 15:17


          refactored DefineDatasetResult proto

f196df9


          refactor DefineFlow proto

26cadf8

sryza reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated

    
                  DefineDatasetResult define_dataset_result = 2;

                  DefineFlowResult define_flow_result = 3;

                }

                message CatalogIdentifier {

Contributor

sryza Sep 22, 2025

@gengliangwang @cloud-fan – thoughts on whether CatalogIdentifier is the right name and whether this is the right location for this message? Since this is a type that might end up useful elsewhere as well, I wonder if it would make more sense as a top-level message inside base.proto or catalog.proto.

Member

gengliangwang Sep 23, 2025

Should it just be Identifier?

Contributor

sryza Sep 23, 2025

Identifier is pretty general. The Identifier class in Scala is scoped within the catalog package. If we had a similar package within the proto namespace, then an Identifier proto could make sense?

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

cookiedough77 and others added 6 commits

September 23, 2025 12:22


          Update sql/connect/common/src/main/protobuf/spark/connect/pipelines.p…

7db13d6

…roto

Co-authored-by: Sandy Ryza <[email protected]>


          Update sql/connect/common/src/main/protobuf/spark/connect/pipelines.p…

e4a7f0f

…roto

Co-authored-by: Sandy Ryza <[email protected]>


          update namespace as repeated string

39042fd


          Merge remote-tracking branch 'origin/jessie.luo_data/spark-add-respon…

256d7d9

…se' into jessie.luo_data/spark-add-response


          updated comments in proto

13beb23


          Merge branch 'master' into jessie.luo_data/spark-add-response

762df1f

sryza reviewed

View reviewed changes

Contributor

sryza left a comment

I left two more comments. After addressing those, this LGTM!

Thanks for bearing with the back and forth on this – protos can't be changed after they're released, so important to get them right the first time.

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto Outdated Show resolved Hide resolved


          rename CatalogIdentifier proto fields and put it in common.proto

23ebeee

sryza approved these changes

View reviewed changes

Contributor

sryza left a comment

LGTM!

Going to wait a day before merging to give others a chance to look at the proto changes. cc @hvanhovell because Spark Connect proto changes

sryza requested a review from hvanhovell

September 24, 2025 21:48

cloud-fan reviewed

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/common.proto Outdated Show resolved Hide resolved

cloud-fan reviewed

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

    
                      val resolvedDataset =

                        defineDataset(cmd.getDefineDataset, sessionHolder)

                      val identifierBuilder = CatalogIdentifier.newBuilder()

                      resolvedDataset.resolvedCatalog.foreach(identifierBuilder.setCatalogName)

Contributor

cloud-fan Sep 25, 2025

The reason we quote the names in CatalystIdentifier is because we need to combine them together as a single string for qualified name. But the protobuf message keeps catalog/schema/table name as separated fields, why do we quote the name?

Contributor

cloud-fan Sep 25, 2025

We store the raw names in each field of the trait CatalystIdentifier, and we should do the same for the corresponding protobug message.

Contributor Author

cookiedough77 Sep 25, 2025

Hi Wenchen! I agree. I believe the current implementation is storing unquoted string in the proto fields, as we can see in the test cases in SparkDeclarativePipelinesServerSuite. The expected values are all unquoted strings.

Contributor

cloud-fan Sep 28, 2025 •

edited

Loading

I think there is a misunderstanding here. Looking at the code

def quoteIdentifier(name: String): String = name.replace("`", "``")
...
def resolvedCatalog: Option[String] = catalog.map(quoteIdentifier)

This is not the raw name, and we will see the difference if the catalog name contains backticks.

This is an obvious bug. We did so in CatalogIdentifier because we use it in s"`${resolvedCatalog.get}`.`${resolvedDb.get}`.`$resolvedId`". It does not make sense when we set the protobuf fields.

Contributor

sryza Sep 29, 2025

I believe @cloud-fan is right. @cookiedough77 – are you up for creating a followup that addresses this? I think we can revert the changes inside identifiers.scala and populate the values of the proto with the existing identifier / database / catalog properties of CatalystIdentifier.

Contributor Author

cookiedough77 Sep 29, 2025 •

edited

Loading

yep, working on this, thanks for pointing it out

Contributor Author

cookiedough77 Sep 30, 2025 •

edited

Loading

fix PR: #52483

cookiedough77 added 2 commits

September 25, 2025 13:14


          Merge branch 'master' into jessie.luo_data/spark-add-response

57f96c4


          rename to ResolvedIdentifier

9f79fc9

Contributor

sryza commented Sep 26, 2025

Going to merge now that @cloud-fan 's feedback has been addressed

sryza closed this in

e56ab2f

dongjoon-hyun mentioned this pull request

[SPARK-53777] Update Spark Connect-generated Swift source code with 4.1.0-preview2 apache/spark-connect-swift#250

Closed

dongjoon-hyun mentioned this pull request

[SPARK-54043] Update Spark Connect-generated Swift source code with 4.1.0-preview3 RC1 apache/spark-connect-swift#252

Closed

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request


          [SPARK-54043] Update Spark Connect-generated Swift source code wi…

89b02b0

…th `4.1.0-preview3` RC1

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1.

### Why are the changes needed?

There are many changes between Apache Spark 4.1.0-preview2 and preview3.

- apache/spark#52685
- apache/spark#52613
- apache/spark#52553
- apache/spark#52532
- apache/spark#52517
- apache/spark#52514
- apache/spark#52487
- apache/spark#52328
- apache/spark#52200
- apache/spark#52154
- apache/spark#51344

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`.

```
$ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect
$ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests).

```
$ swift test --no-parallel
...
✔ Test run with 203 tests in 21 suites passed after 19.088 seconds.
```
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #252 from dongjoon-hyun/SPARK-54043.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request


          [SPARK-53593][SDP] Add response field for DefineDataset and DefineFlo…

00c51f6

…w RPC

### What changes were proposed in this pull request?
This PR updates the Spark Connect server to return resolved dataset and flow names in the responses of DefineDataset and DefineFlow RPCs.

Changes include:

1. Adding resolved_data_name and resolved_flow_name to the respective proto response messages.
2. Updating the RPC handlers to return resolved identifiers as response.
3. Adding unit tests in SparkDeclarativePipelinesServerSuite to validate the resolved names

### Why are the changes needed?
The SC client requires the resolved names for datasets and flows to support graph resolution in the LDP frontend. Returning this info from the server ensures consistent naming and proper registration.

### Does this PR introduce _any_ user-facing change?
Yes. The DefineDataset and DefineFlow RPC responses now include fully qualified names like `catalog`.`db`.`mv`. Implicit flows to temp views return unqualified names like `mv`.

### How was this patch tested?
Added targeted unit tests in SparkDeclarativePipelinesServerSuite. Verified both default and custom catalog/database cases.

Run test using
```
./build/sbt
> project connect
> testOnly *SparkDeclarativePipelinesServerSuite
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#52328 from cookiedough77/jessie.luo_data/spark-add-response.

Lead-authored-by: Jessie Luo <[email protected]>
Co-authored-by: cookiedough77 <[email protected]>
Signed-off-by: Sandy Ryza <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

sryza sryza approved these changes

gengliangwang gengliangwang left review comments

cloud-fan cloud-fan left review comments

hvanhovell Awaiting requested review from hvanhovell

+2 more reviewers

JiaqiWang18 JiaqiWang18 left review comments

SCHJonathan SCHJonathan approved these changes

Labels

CONNECT PYTHON SQL