[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

heyihong · 2025-09-11T17:01:56Z

What changes were proposed in this pull request?

This PR optimizes the LiteralValueProtoConverter to avoid setting redundant struct data types in protobuf messages when they are not needed. The main changes include:

Modified structBuilder method signature: Added a needDataType: Boolean parameter to control whether struct data type information should be included in the protobuf message.
Conditional data type struct building: The struct builder now only creates and populates the dataTypeStruct field when needDataType is true, avoiding redundant data type information when it's not required.
Updated method calls: Modified the call to structBuilder in the main conversion logic to pass the needDataType parameter.
Added test case: Added a new test case for typedLit with tuple sequences to ensure the optimization works correctly with complex nested structures.

The key optimization is in the structBuilder method where the dataTypeStruct field is now only populated when needDataType is true, preventing unnecessary serialization of struct metadata.

Why are the changes needed?

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

Does this PR introduce any user-facing change?

No, this PR does not introduce any user-facing changes.

How was this patch tested?

build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"
build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.5.11

…ueProtoConverter

heyihong · 2025-09-11T17:07:16Z

@zhengruifeng @hvanhovell @cloud-fan

cloud-fan · 2025-09-12T09:58:33Z

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

I may miss some context. When is this information redundant or not needed? Do we have some examples?

heyihong · 2025-09-12T12:23:41Z

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

I may miss some context. When is this information redundant or not needed? Do we have some examples?

Yes, for an array of structs, only the first element needs to include the struct type information. The rest do not need to have the type information set.

cloud-fan · 2025-09-12T13:36:22Z

Literal should separate the data and metadata (data type), so array of struct literal should have the array of struct value and the array of struct data type. This is how the catalyst Litetal is designed. ArrayData represents the data of an array and it does not contain the data type.

How was the literal protobuf designed?

heyihong · 2025-09-12T15:40:52Z

Literal should separate the data and metadata (data type), so array of struct literal should have the array of struct value and the array of struct data type. This is how the catalyst Litetal is designed. ArrayData represents the data of an array and it does not contain the data type.

How was the literal protobuf designed?

The original design may be inspired by https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L993-L1033 and the type information (through type_variation_reference) is included in the literal message if I understand correctly.

Having a clear separation between the data and metadata (data type) can be handled with the current literal protobuf implementation. For example, you can have an array whose elements don’t include any data type information, while the dataType field holds the full type definition. The reason for mixing data and metadata here is to achieve a more compact format and save space by setting fewer data type fields (since they can be inferred).

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

cloud-fan · 2025-09-15T06:13:03Z

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

+1. Can we refactor it?

heyihong · 2025-09-15T13:52:14Z

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

+1. Can we refactor it?

Yes, here is the PR: #52342. This change is no longer needed.

[SPARK-53556] Avoid setting redundant struct data types in LiteralVal…

43c53a7

…ueProtoConverter

github-actions bot added SQL CONNECT labels Sep 11, 2025

heyihong changed the title ~~[SPARK-53556] Avoid setting redundant struct data types in LiteralValueProtoConverter~~ [SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter Sep 11, 2025

heyihong closed this Sep 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

Uh oh!

heyihong commented Sep 11, 2025 •

edited

Loading

Uh oh!

heyihong commented Sep 11, 2025

Uh oh!

cloud-fan commented Sep 12, 2025

Uh oh!

heyihong commented Sep 12, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Sep 12, 2025 •

edited

Loading

Uh oh!

heyihong commented Sep 12, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Sep 15, 2025

Uh oh!

heyihong commented Sep 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312

Uh oh!

Conversation

heyihong commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

heyihong commented Sep 11, 2025

Uh oh!

cloud-fan commented Sep 12, 2025

Uh oh!

heyihong commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heyihong commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 15, 2025

Uh oh!

heyihong commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heyihong commented Sep 11, 2025 •

edited

Loading

heyihong commented Sep 12, 2025 •

edited

Loading

cloud-fan commented Sep 12, 2025 •

edited

Loading

heyihong commented Sep 12, 2025 •

edited

Loading

heyihong commented Sep 15, 2025 •

edited

Loading