Skip to content

Conversation

@AnishMahto
Copy link
Contributor

What changes were proposed in this pull request?

Propagate source code location details (line number and file path) E2E for declarative pipelines. That is, collect this information from the python REPL that registers SDP datasets/flows, propagate it through the appropriate spark connect handlers, and associate it to the appropriate datasets/flows in pipeline events/exceptions.

Why are the changes needed?

Better observability and debugging experience for users. Allows users to identify the exact lines that cause a particular exception.

Does this PR introduce any user-facing change?

Yes, we are populating source code information in the origin for pipeline events, which is user-facing. Currently SDP is not released in any spark version however.

How was this patch tested?

Added tests to org.apache.spark.sql.connect.pipelines.PythonPipelineSuite

Was this patch authored or co-authored using generative AI tooling?

No

@AnishMahto
Copy link
Contributor Author

@sryza

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left a few comments, but this looks close to ready to merge to me.

CC @gengliangwang @hvanhovell in case either of you are also interested in taking a look.

@AnishMahto AnishMahto requested a review from sryza July 1, 2025 21:53
.filter(_.nonEmpty),
properties = dataset.getTablePropertiesMap.asScala.toMap,
baseOrigin = QueryOrigin(
filePath = Option.when(dataset.getSourceCodeLocation.hasFileName)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can store filePath and line to variables or a method to avoid duplicated code

@gengliangwang
Copy link
Member

LGTM too.

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sryza
Copy link
Contributor

sryza commented Jul 7, 2025

@anishm-db it looks like there's a failing style check. After that's fixed, I'll merge this.

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one question – otherwise this looks good!

It looks like there are some merge conflicts from a race condition with this PR: #52154

spark: SparkSession,
externalInputs: mutable.HashSet[TableIdentifier] = mutable.HashSet.empty
externalInputs: mutable.HashSet[TableIdentifier] = mutable.HashSet.empty,
queryOrigin: QueryOrigin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the queryOrigin get used from here? Why does it need to be threaded through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I'm not actually using it anywhere. We don't need to attach the query origin to the flow analysis context given that we're already attaching it to the flow object itself. Removed.

Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sryza sryza closed this in 65ff85a Oct 2, 2025
dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 27, 2025
…th `4.1.0-preview3` RC1

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1.

### Why are the changes needed?

There are many changes between Apache Spark 4.1.0-preview2 and preview3.

- apache/spark#52685
- apache/spark#52613
- apache/spark#52553
- apache/spark#52532
- apache/spark#52517
- apache/spark#52514
- apache/spark#52487
- apache/spark#52328
- apache/spark#52200
- apache/spark#52154
- apache/spark#51344

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`.

```
$ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect
$ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests).

```
$ swift test --no-parallel
...
✔ Test run with 203 tests in 21 suites passed after 19.088 seconds.
```
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #252 from dongjoon-hyun/SPARK-54043.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?
Propagate source code location details (line number and file path) E2E for declarative pipelines. That is, collect this information from the python REPL that registers SDP datasets/flows, propagate it through the appropriate spark connect handlers, and associate it to the appropriate datasets/flows in pipeline events/exceptions.

### Why are the changes needed?
Better observability and debugging experience for users. Allows users to identify the exact lines that cause a particular exception.

### Does this PR introduce _any_ user-facing change?
Yes, we are populating source code information in the origin for pipeline events, which is user-facing. Currently SDP is not released in any spark version however.

### How was this patch tested?
Added tests to `org.apache.spark.sql.connect.pipelines.PythonPipelineSuite`

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#51344 from AnishMahto/sdp-python-query-origins.

Authored-by: anishm-db <[email protected]>
Signed-off-by: Sandy Ryza <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants