[SPARK-52640][SDP] Propagate Python Source Code Location #51344

AnishMahto · 2025-07-01T20:24:34Z

What changes were proposed in this pull request?

Propagate source code location details (line number and file path) E2E for declarative pipelines. That is, collect this information from the python REPL that registers SDP datasets/flows, propagate it through the appropriate spark connect handlers, and associate it to the appropriate datasets/flows in pipeline events/exceptions.

Why are the changes needed?

Better observability and debugging experience for users. Allows users to identify the exact lines that cause a particular exception.

Does this PR introduce any user-facing change?

Yes, we are populating source code information in the origin for pipeline events, which is user-facing. Currently SDP is not released in any spark version however.

How was this patch tested?

Added tests to org.apache.spark.sql.connect.pipelines.PythonPipelineSuite

Was this patch authored or co-authored using generative AI tooling?

No

AnishMahto · 2025-07-01T20:26:12Z

@sryza

sryza

i left a few comments, but this looks close to ready to merge to me.

CC @gengliangwang @hvanhovell in case either of you are also interested in taking a look.

python/pyspark/pipelines/spark_connect_graph_element_registry.py

python/pyspark/sql/connect/proto/pipelines_pb2.pyi

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala

gengliangwang · 2025-07-02T22:12:57Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

              .filter(_.nonEmpty),
            properties = dataset.getTablePropertiesMap.asScala.toMap,
            baseOrigin = QueryOrigin(
+              filePath = Option.when(dataset.getSourceCodeLocation.hasFileName)(


nit: we can store filePath and line to variables or a method to avoid duplicated code

gengliangwang · 2025-07-02T22:13:34Z

LGTM too.

sryza

LGTM!

sryza · 2025-07-07T14:44:12Z

@anishm-db it looks like there's a failing style check. After that's fixed, I'll merge this.

sryza

Had one question – otherwise this looks good!

It looks like there are some merge conflicts from a race condition with this PR: #52154

sryza · 2025-10-01T02:30:35Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysisContext.scala

    spark: SparkSession,
-    externalInputs: mutable.HashSet[TableIdentifier] = mutable.HashSet.empty
+    externalInputs: mutable.HashSet[TableIdentifier] = mutable.HashSet.empty,
+    queryOrigin: QueryOrigin


Does the queryOrigin get used from here? Why does it need to be threaded through?

Yeah good point, I'm not actually using it anywhere. We don't need to attach the query origin to the flow analysis context given that we're already attaching it to the flow object itself. Removed.

sryza

LGTM!

…th `4.1.0-preview3` RC1 ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1. ### Why are the changes needed? There are many changes between Apache Spark 4.1.0-preview2 and preview3. - apache/spark#52685 - apache/spark#52613 - apache/spark#52553 - apache/spark#52532 - apache/spark#52517 - apache/spark#52514 - apache/spark#52487 - apache/spark#52328 - apache/spark#52200 - apache/spark#52154 - apache/spark#51344 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`. ``` $ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests). ``` $ swift test --no-parallel ... ✔ Test run with 203 tests in 21 suites passed after 19.088 seconds. ``` ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #252 from dongjoon-hyun/SPARK-54043. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Propagate source code location details (line number and file path) E2E for declarative pipelines. That is, collect this information from the python REPL that registers SDP datasets/flows, propagate it through the appropriate spark connect handlers, and associate it to the appropriate datasets/flows in pipeline events/exceptions. ### Why are the changes needed? Better observability and debugging experience for users. Allows users to identify the exact lines that cause a particular exception. ### Does this PR introduce _any_ user-facing change? Yes, we are populating source code information in the origin for pipeline events, which is user-facing. Currently SDP is not released in any spark version however. ### How was this patch tested? Added tests to `org.apache.spark.sql.connect.pipelines.PythonPipelineSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51344 from AnishMahto/sdp-python-query-origins. Authored-by: anishm-db <[email protected]> Signed-off-by: Sandy Ryza <[email protected]>

impl

57174fc

github-actions bot added SQL PYTHON CONNECT labels Jul 1, 2025

sryza reviewed Jul 1, 2025

View reviewed changes

anishm-db added 2 commits July 1, 2025 21:51

regen with python >= 3.10

18bccc3

run dev/reformat-python

a1ce559

AnishMahto requested a review from sryza July 1, 2025 21:53

fix tests for python 3.10+

8a9dce2

gengliangwang reviewed Jul 2, 2025

View reviewed changes

gengliangwang approved these changes Jul 2, 2025

View reviewed changes

sryza approved these changes Jul 2, 2025

View reviewed changes

anishm-db added 3 commits September 26, 2025 21:43

Merge branch 'master' into sdp-python-query-origins

fd26e7d

rm stale scaladoc

539cba3

run scalafmt

e47673a

sryza reviewed Oct 1, 2025

View reviewed changes

anishm-db added 3 commits October 1, 2025 17:58

remove redundant propagation/injection

1daf705

Merge branch 'master' into sdp-python-query-origins

685d2e1

undo import indentation change

c9e58f0

sryza approved these changes Oct 1, 2025

View reviewed changes

sryza closed this in 65ff85a Oct 2, 2025

dongjoon-hyun mentioned this pull request Oct 27, 2025

[SPARK-54043] Update Spark Connect-generated Swift source code with 4.1.0-preview3 RC1 apache/spark-connect-swift#252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52640][SDP] Propagate Python Source Code Location #51344

[SPARK-52640][SDP] Propagate Python Source Code Location #51344

Uh oh!

AnishMahto commented Jul 1, 2025

Uh oh!

AnishMahto commented Jul 1, 2025

Uh oh!

sryza left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gengliangwang Jul 2, 2025

Uh oh!

gengliangwang commented Jul 2, 2025

Uh oh!

sryza left a comment

Uh oh!

sryza commented Jul 7, 2025

Uh oh!

sryza left a comment

Uh oh!

sryza Oct 1, 2025

Uh oh!

AnishMahto Oct 1, 2025

Uh oh!

sryza left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-52640][SDP] Propagate Python Source Code Location #51344

[SPARK-52640][SDP] Propagate Python Source Code Location #51344

Uh oh!

Conversation

AnishMahto commented Jul 1, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

AnishMahto commented Jul 1, 2025

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gengliangwang Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jul 2, 2025

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza commented Jul 7, 2025

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

AnishMahto Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants