[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342

heyihong · 2025-09-15T13:25:07Z

What changes were proposed in this pull request?

This PR simplifies data type handling in the Spark Connect LiteralValueProtoConverter by consolidating type information into a single data_type field at the root level of the Literal message, rather than having separate type fields within nested structures.

Key changes:

Protobuf Schema Simplification:
- Added a new data_type field (field 100) to the root Expression.Literal message
- Removed redundant type fields from nested messages (Array.data_type, Map.data_type, Struct.data_type_struct)
Array Type Handling Enhancement:
- Added special handling for ByteType arrays to convert them to Binary type in the protobuf representation
- This addresses a specific edge case where byte arrays should be represented as binary data

Why are the changes needed?

The current data type handling in Spark Connect has several issues:

Redundancy and Complexity: Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone.
Limited Extensibility: Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like String (with collation information), YearMonthInterval, DayTimeInterval, and other types that may require additional type-specific attributes.

Does this PR introduce any user-facing change?

No. This is an internal refactoring of the Spark Connect protobuf schema and converter logic.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"
build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.5.11

heyihong · 2025-09-15T13:47:01Z

@cloud-fan @hvanhovell @zhengruifeng

cloud-fan · 2025-09-16T08:02:06Z

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

Is this an optional field? For simple literals, such as byte, int, we can actually infer the data type.

In proto3, fields that are message types are always optional. If the data_type field is not set, we fall back to inferring the data type or using the deprecated data type fields. I will update the comment.

In my opinion, it is better to always set the data_type field, because avoiding data type inference simplifies both the protocol and its implementation across different languages.

cloud-fan · 2025-09-16T08:02:36Z

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

is this compatible with the old client versions?

Yes, the old client versions don't use these removed fields. Also , PySpark Connect still uses the old client version, and it works.

zhengruifeng · 2025-09-17T09:02:13Z

sql/connect/common/src/test/resources/query-tests/queries/column_between.json

different from your previous PRs, it seems this PR make proto more complex?

The implementation logic was simplified (262 lines added and 464 lines removed) in sql/connect/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala, as it no longer infers data types. In addition, the protobuf message schema was simplified (1 field added and 3 fields removed).

The proto message size may increase a little bit though.

to be more efficient, shall we skip the data type field for trival type literals at the root level? Then we can keep the message unchanged for common cases.

Makes sense. To minimize the changes, I will only set the data_type field for the list below for now:

NULL values

ARRAY type

STRUCT type

MAP type

We can extend the list later.

…toConverter

cloud-fan · 2025-09-19T01:21:40Z

thanks, merging to master!

…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…toConverter ### What changes were proposed in this pull request? This PR simplifies data type handling in the Spark Connect `LiteralValueProtoConverter` by consolidating type information into a single `data_type` field at the root level of the `Literal` message, rather than having separate type fields within nested structures. **Key changes:** 1. **Protobuf Schema Simplification:** - Added a new `data_type` field (field 100) to the root `Expression.Literal` message - Removed redundant type fields from nested messages (`Array.data_type`, `Map.data_type`, `Struct.data_type_struct`) 2. **Array Type Handling Enhancement:** - Added special handling for `ByteType` arrays to convert them to `Binary` type in the protobuf representation - This addresses a specific edge case where byte arrays should be represented as binary data ### Why are the changes needed? The current data type handling in Spark Connect has several issues: 1. **Redundancy and Complexity:** Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone. 2. **Limited Extensibility:** Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like `String` (with collation information), `YearMonthInterval`, `DayTimeInterval`, and other types that may require additional type-specific attributes. ### Does this PR introduce _any_ user-facing change? **No.** This is an internal refactoring of the Spark Connect protobuf schema and converter logic. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"` `build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.5.11 Closes apache#52342 from heyihong/SPARK-53578. Authored-by: Yihong He <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added SQL ML CONNECT labels Sep 15, 2025

heyihong force-pushed the SPARK-53578 branch from b9bcea2 to 2827090 Compare September 15, 2025 13:49

github-actions bot added the PYTHON label Sep 15, 2025

heyihong mentioned this pull request Sep 15, 2025

[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

Closed

heyihong force-pushed the SPARK-53578 branch 5 times, most recently from b43d51d to b390c2c Compare September 15, 2025 21:52

cloud-fan reviewed Sep 16, 2025

View reviewed changes

zhengruifeng reviewed Sep 17, 2025

View reviewed changes

heyihong requested review from cloud-fan and zhengruifeng September 17, 2025 11:56

heyihong force-pushed the SPARK-53578 branch 6 times, most recently from b7fe5a7 to 2d788e6 Compare September 18, 2025 16:28

[SPARK-53578][CONNECT] Simplify data type handling in LiteralValuePro…

6c5ed2e

…toConverter

heyihong force-pushed the SPARK-53578 branch from 2d788e6 to 6c5ed2e Compare September 18, 2025 18:21

cloud-fan approved these changes Sep 19, 2025

View reviewed changes

cloud-fan closed this in db13a38 Sep 19, 2025

dongjoon-hyun mentioned this pull request Oct 1, 2025

[SPARK-53777] Update Spark Connect-generated Swift source code with 4.1.0-preview2 apache/spark-connect-swift#250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342

[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342

Uh oh!

heyihong commented Sep 15, 2025 •

edited

Loading

Uh oh!

heyihong commented Sep 15, 2025

Uh oh!

cloud-fan Sep 16, 2025

Uh oh!

heyihong Sep 17, 2025 •

edited

Loading

Uh oh!

cloud-fan Sep 16, 2025

Uh oh!

heyihong Sep 17, 2025 •

edited

Loading

Uh oh!

zhengruifeng Sep 17, 2025

Uh oh!

heyihong Sep 17, 2025 •

edited

Loading

Uh oh!

cloud-fan Sep 17, 2025

Uh oh!

heyihong Sep 18, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342

[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342

Uh oh!

Conversation

heyihong commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

heyihong commented Sep 15, 2025

Uh oh!

cloud-fan Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heyihong commented Sep 15, 2025 •

edited

Loading

heyihong Sep 17, 2025 •

edited

Loading

heyihong Sep 17, 2025 •

edited

Loading

heyihong Sep 17, 2025 •

edited

Loading

heyihong Sep 18, 2025 •

edited

Loading