[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653

heyihong · 2025-07-24T15:32:09Z

What changes were proposed in this pull request?

This PR introduces a transition to use DataType.Array and DataType.Map for array and map literals throughout the Spark Connect codebase for Array/Map Literals.

While the Spark Connect server supports both new and old data type fields, in this change, the new data type fields are set only in ColumnNodeToProtoConverter for the Spark Connect Scala client. All other components (e.g., ML, Python) still use the old data type fields because literal values are used not only in requests but also in responses, making it difficult to maintain compatibility—clients using older versions may not recognize the new fields in the response. Deprecation and the transition to the new fields require a gradual migration. The key changes include:

Protocol Buffer Updates:

Modified expressions.proto to add new data_type fields for Array and Map messages
Deprecated existing element_type, key_type, and value_type fields in favor of the unified data_type approach
Updated generated protocol buffer files (expressions_pb2.py, expressions_pb2.pyi) to reflect these changes

Core Implementation Changes:

Enhanced LiteralValueProtoConverter.scala with new internal method toLiteralProtoBuilderInternal that accepts ToLiteralProtoOptions
Updated LiteralExpressionProtoConverter.scala to support inference of array and map data types
Modified columnNodeSupport.scala to use the new toLiteralProtoBuilderWithOptions method with useDeprecatedDataTypeFields set to false

Why are the changes needed?

The changes are needed to improve Spark's data type handling for array and map literals:

Nullability of Array/Map literals are now included in the DataType.Array/Map: This ensures that nullability information is properly captured and handled within the data type structure itself.
Work better with type inference by including all type information in one field: By consolidating all type information into a single field, it is easier to infer data types for complex data structures.

Does this PR introduce any user-facing change?

Yes. Previously, the nullability of arrays and map values using typedlit was not preserved (which I believe was a bug). It is now preserved. Please see the changes in ClientE2ETestSuite for details.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"
build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite -- -z SPARK-52930"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

heyihong · 2025-07-24T19:14:33Z

@hvanhovell @HyukjinKwon @zhengruifeng @beliefer

zhengruifeng · 2025-08-21T04:45:02Z

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

I personally feel it might be better to introduce new messages for this purpose, so that we can minimize the code changes, and the conversion logic in the server side can be more clear

@zhengruifeng FYI, I have simplified the implementation a bit. It is possible to minimize the code changes and make the conversion logic on the server side clearer without introducing new messages.

zhengruifeng · 2025-08-26T02:40:10Z

merged to master

…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added SQL ML PYTHON CONNECT labels Jul 24, 2025

heyihong changed the title ~~[SPARK-52930] Use DataType.Array/Map for Array/Map Literals~~ [SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals Jul 24, 2025

heyihong force-pushed the SPARK-52930 branch from 08e753b to 2bc0770 Compare July 24, 2025 18:41

github-actions bot removed the ML label Jul 24, 2025

heyihong force-pushed the SPARK-52930 branch from 2bc0770 to 4553bc8 Compare July 24, 2025 19:13

heyihong force-pushed the SPARK-52930 branch 4 times, most recently from 78f2bb7 to 4f64261 Compare July 27, 2025 20:43

zhengruifeng reviewed Aug 21, 2025

View reviewed changes

heyihong force-pushed the SPARK-52930 branch 2 times, most recently from ac9d081 to 2ca98c4 Compare August 25, 2025 12:34

heyihong requested a review from zhengruifeng August 25, 2025 12:36

heyihong force-pushed the SPARK-52930 branch 2 times, most recently from aa2f322 to bb44b23 Compare August 25, 2025 12:59

[SPARK-52930] Use DataType.Array/Map for Array/Map Literals

57d098a

heyihong force-pushed the SPARK-52930 branch from bb44b23 to 57d098a Compare August 25, 2025 13:01

zhengruifeng approved these changes Aug 26, 2025

View reviewed changes

zhengruifeng closed this in b5840e1 Aug 26, 2025

dongjoon-hyun mentioned this pull request Oct 1, 2025

[SPARK-53777] Update Spark Connect-generated Swift source code with 4.1.0-preview2 apache/spark-connect-swift#250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653

[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653

Uh oh!

heyihong commented Jul 24, 2025 •

edited

Loading

Uh oh!

heyihong commented Jul 24, 2025

Uh oh!

zhengruifeng Aug 21, 2025

Uh oh!

heyihong Aug 25, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653

[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653

Uh oh!

Conversation

heyihong commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

heyihong commented Jul 24, 2025

Uh oh!

zhengruifeng Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heyihong commented Jul 24, 2025 •

edited

Loading

heyihong Aug 25, 2025 •

edited

Loading