[SPARK-53802][SDP] Support string values for user-specified schema in SDP tables #52517

sryza · 2025-10-04T14:55:39Z

What changes were proposed in this pull request?

When defining a streaming table or materialized view, enable passing a string to its schema, in addition to a StructType. This mimics the flexibility of the DataFrameReader schema arg.

E.g.

from pyspark.sql.functions import lit

@dp.materialized_view(schema="id LONG, name STRING")
def table_with_string_schema():
    return spark.range(5).withColumn("name", lit("test"))

Why are the changes needed?

For flexibility and consistency with similar args.

Does this PR introduce any user-facing change?

Makes changes to unreleased protos.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

gengliangwang · 2025-10-07T06:20:21Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

+    assert(graph.tables.size == 1)
+
+    val table = graph.table(graphIdentifier("table_with_string_schema"))
+    assert(table.specifiedSchema.isDefined)


nit: shall we simply compare table.specifiedSchema with an expected schema (for example, StructType.fromDDL(id LONG, name STRING))

Oo yeah was not aware of that

gengliangwang · 2025-10-07T06:21:47Z

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto

-    optional spark.connect.DataType schema = 7;
+    oneof schema {
+      spark.connect.DataType schema_data_type = 7;
+      string schema_string = 10;


since Spark 4.1 is not offically released. I wonder if we can change the sequence numbers here.

Good point – fixing

gengliangwang · 2025-10-07T22:58:20Z

Thanks, merging to master

…th `4.1.0-preview3` RC1 ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1. ### Why are the changes needed? There are many changes between Apache Spark 4.1.0-preview2 and preview3. - apache/spark#52685 - apache/spark#52613 - apache/spark#52553 - apache/spark#52532 - apache/spark#52517 - apache/spark#52514 - apache/spark#52487 - apache/spark#52328 - apache/spark#52200 - apache/spark#52154 - apache/spark#51344 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`. ``` $ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests). ``` $ swift test --no-parallel ... ✔ Test run with 203 tests in 21 suites passed after 19.088 seconds. ``` ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #252 from dongjoon-hyun/SPARK-54043. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… SDP tables ### What changes were proposed in this pull request? When defining a streaming table or materialized view, enable passing a string to its schema, in addition to a StructType. This mimics the flexibility of [the `DataFrameReader` schema arg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.schema.html#pyspark.sql.streaming.DataStreamReader.schema). E.g. ```python from pyspark.sql.functions import lit dp.materialized_view(schema="id LONG, name STRING") def table_with_string_schema(): return spark.range(5).withColumn("name", lit("test")) ``` ### Why are the changes needed? For flexibility and consistency with similar args. ### Does this PR introduce _any_ user-facing change? Makes changes to unreleased protos. ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#52517 from sryza/dataset-schema-string. Authored-by: Sandy Ryza <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

sryza requested a review from HyukjinKwon October 4, 2025 14:55

github-actions bot added SQL PYTHON CONNECT labels Oct 4, 2025

sryza requested a review from gengliangwang October 4, 2025 14:55

table schema string

cb8ba44

sryza force-pushed the dataset-schema-string branch from 50b839e to cb8ba44 Compare October 6, 2025 14:02

fix mypy and make name better

1a62674

gengliangwang reviewed Oct 7, 2025

View reviewed changes

Gengliang feedback and mypy fix

bc036bf

sryza requested a review from gengliangwang October 7, 2025 13:46

format

335c101

gengliangwang approved these changes Oct 7, 2025

View reviewed changes

gengliangwang closed this in be1de5b Oct 7, 2025

dongjoon-hyun mentioned this pull request Oct 27, 2025

[SPARK-54043] Update Spark Connect-generated Swift source code with 4.1.0-preview3 RC1 apache/spark-connect-swift#252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53802][SDP] Support string values for user-specified schema in SDP tables #52517

[SPARK-53802][SDP] Support string values for user-specified schema in SDP tables #52517

Uh oh!

sryza commented Oct 4, 2025 •

edited

Loading

Uh oh!

gengliangwang Oct 7, 2025

Uh oh!

sryza Oct 7, 2025

Uh oh!

gengliangwang Oct 7, 2025

Uh oh!

sryza Oct 7, 2025

Uh oh!

gengliangwang commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53802][SDP] Support string values for user-specified schema in SDP tables #52517

[SPARK-53802][SDP] Support string values for user-specified schema in SDP tables #52517

Uh oh!

Conversation

sryza commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gengliangwang Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

sryza Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sryza commented Oct 4, 2025 •

edited

Loading