Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 24, 2025

What changes were proposed in this pull request?

This PR introduces a transition to use DataType.Array and DataType.Map for array and map literals throughout the Spark Connect codebase for Array/Map Literals.

While the Spark Connect server supports both new and old data type fields, in this change, the new data type fields are set only in ColumnNodeToProtoConverter for the Spark Connect Scala client. All other components (e.g., ML, Python) still use the old data type fields because literal values are used not only in requests but also in responses, making it difficult to maintain compatibility—clients using older versions may not recognize the new fields in the response. Deprecation and the transition to the new fields require a gradual migration. The key changes include:

Protocol Buffer Updates:

  • Modified expressions.proto to add new data_type fields for Array and Map messages
  • Deprecated existing element_type, key_type, and value_type fields in favor of the unified data_type approach
  • Updated generated protocol buffer files (expressions_pb2.py, expressions_pb2.pyi) to reflect these changes

Core Implementation Changes:

  • Enhanced LiteralValueProtoConverter.scala with new internal method toLiteralProtoBuilderInternal that accepts ToLiteralProtoOptions
  • Updated LiteralExpressionProtoConverter.scala to support inference of array and map data types
  • Modified columnNodeSupport.scala to use the new toLiteralProtoBuilderWithOptions method with useDeprecatedDataTypeFields set to false

Why are the changes needed?

The changes are needed to improve Spark's data type handling for array and map literals:

  • Nullability of Array/Map literals are now included in the DataType.Array/Map: This ensures that nullability information is properly captured and handled within the data type structure itself.

  • Work better with type inference by including all type information in one field: By consolidating all type information into a single field, it is easier to infer data types for complex data structures.

Does this PR introduce any user-facing change?

Yes. Previously, the nullability of arrays and map values using typedlit was not preserved (which I believe was a bug). It is now preserved. Please see the changes in ClientE2ETestSuite for details.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"
build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite -- -z SPARK-52930"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

@heyihong heyihong changed the title [SPARK-52930] Use DataType.Array/Map for Array/Map Literals [SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals Jul 24, 2025
@github-actions github-actions bot removed the ML label Jul 24, 2025
@heyihong
Copy link
Contributor Author

@heyihong heyihong force-pushed the SPARK-52930 branch 4 times, most recently from 78f2bb7 to 4f64261 Compare July 27, 2025 20:43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally feel it might be better to introduce new messages for this purpose, so that we can minimize the code changes, and the conversion logic in the server side can be more clear

Copy link
Contributor Author

@heyihong heyihong Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng FYI, I have simplified the implementation a bit. It is possible to minimize the code changes and make the conversion logic on the server side clearer without introducing new messages.

@heyihong heyihong force-pushed the SPARK-52930 branch 2 times, most recently from ac9d081 to 2ca98c4 Compare August 25, 2025 12:34
@heyihong heyihong requested a review from zhengruifeng August 25, 2025 12:36
@heyihong heyihong force-pushed the SPARK-52930 branch 2 times, most recently from aa2f322 to bb44b23 Compare August 25, 2025 12:59
@zhengruifeng
Copy link
Contributor

merged to master

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants