Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Oct 4, 2025

What changes were proposed in this pull request?

When defining a streaming table or materialized view, enable passing a string to its schema, in addition to a StructType. This mimics the flexibility of the DataFrameReader schema arg.

E.g.

from pyspark.sql.functions import lit

@dp.materialized_view(schema="id LONG, name STRING")
def table_with_string_schema():
    return spark.range(5).withColumn("name", lit("test"))

Why are the changes needed?

For flexibility and consistency with similar args.

Does this PR introduce any user-facing change?

Makes changes to unreleased protos.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@sryza sryza force-pushed the dataset-schema-string branch from 50b839e to cb8ba44 Compare October 6, 2025 14:02
assert(graph.tables.size == 1)

val table = graph.table(graphIdentifier("table_with_string_schema"))
assert(table.specifiedSchema.isDefined)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we simply compare table.specifiedSchema with an expected schema (for example, StructType.fromDDL(id LONG, name STRING))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo yeah was not aware of that

optional spark.connect.DataType schema = 7;
oneof schema {
spark.connect.DataType schema_data_type = 7;
string schema_string = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Spark 4.1 is not offically released. I wonder if we can change the sequence numbers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point – fixing

@sryza sryza requested a review from gengliangwang October 7, 2025 13:46
@gengliangwang
Copy link
Member

Thanks, merging to master

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 27, 2025
…th `4.1.0-preview3` RC1

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1.

### Why are the changes needed?

There are many changes between Apache Spark 4.1.0-preview2 and preview3.

- apache/spark#52685
- apache/spark#52613
- apache/spark#52553
- apache/spark#52532
- apache/spark#52517
- apache/spark#52514
- apache/spark#52487
- apache/spark#52328
- apache/spark#52200
- apache/spark#52154
- apache/spark#51344

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`.

```
$ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect
$ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests).

```
$ swift test --no-parallel
...
✔ Test run with 203 tests in 21 suites passed after 19.088 seconds.
```
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #252 from dongjoon-hyun/SPARK-54043.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
… SDP tables

### What changes were proposed in this pull request?

When defining a streaming table or materialized view, enable passing a string to its schema, in addition to a StructType. This mimics the flexibility of [the `DataFrameReader` schema arg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.schema.html#pyspark.sql.streaming.DataStreamReader.schema).

E.g.
```python
from pyspark.sql.functions import lit

dp.materialized_view(schema="id LONG, name STRING")
def table_with_string_schema():
    return spark.range(5).withColumn("name", lit("test"))
```

### Why are the changes needed?

For flexibility and consistency with similar args.

### Does this PR introduce _any_ user-facing change?

Makes changes to unreleased protos.

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#52517 from sryza/dataset-schema-string.

Authored-by: Sandy Ryza <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants