Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Sep 11, 2025

What changes were proposed in this pull request?

This PR optimizes the LiteralValueProtoConverter to avoid setting redundant struct data types in protobuf messages when they are not needed. The main changes include:

  1. Modified structBuilder method signature: Added a needDataType: Boolean parameter to control whether struct data type information should be included in the protobuf message.

  2. Conditional data type struct building: The struct builder now only creates and populates the dataTypeStruct field when needDataType is true, avoiding redundant data type information when it's not required.

  3. Updated method calls: Modified the call to structBuilder in the main conversion logic to pass the needDataType parameter.

  4. Added test case: Added a new test case for typedLit with tuple sequences to ensure the optimization works correctly with complex nested structures.

The key optimization is in the structBuilder method where the dataTypeStruct field is now only populated when needDataType is true, preventing unnecessary serialization of struct metadata.

Why are the changes needed?

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

Does this PR introduce any user-facing change?

No, this PR does not introduce any user-facing changes.

How was this patch tested?

build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"
build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.5.11

@heyihong
Copy link
Contributor Author

@heyihong heyihong changed the title [SPARK-53556] Avoid setting redundant struct data types in LiteralValueProtoConverter [SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter Sep 11, 2025
@cloud-fan
Copy link
Contributor

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

I may miss some context. When is this information redundant or not needed? Do we have some examples?

@heyihong
Copy link
Contributor Author

heyihong commented Sep 12, 2025

The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.

I may miss some context. When is this information redundant or not needed? Do we have some examples?

Yes, for an array of structs, only the first element needs to include the struct type information. The rest do not need to have the type information set.

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 12, 2025

Literal should separate the data and metadata (data type), so array of struct literal should have the array of struct value and the array of struct data type. This is how the catalyst Litetal is designed. ArrayData represents the data of an array and it does not contain the data type.

How was the literal protobuf designed?

@heyihong
Copy link
Contributor Author

heyihong commented Sep 12, 2025

Literal should separate the data and metadata (data type), so array of struct literal should have the array of struct value and the array of struct data type. This is how the catalyst Litetal is designed. ArrayData represents the data of an array and it does not contain the data type.

How was the literal protobuf designed?

The original design may be inspired by https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L993-L1033 and the type information (through type_variation_reference) is included in the literal message if I understand correctly.

Having a clear separation between the data and metadata (data type) can be handled with the current literal protobuf implementation. For example, you can have an array whose elements don’t include any data type information, while the dataType field holds the full type definition. The reason for mixing data and metadata here is to achieve a more compact format and save space by setting fewer data type fields (since they can be inferred).

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

@cloud-fan
Copy link
Contributor

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

+1. Can we refactor it?

@heyihong
Copy link
Contributor Author

heyihong commented Sep 15, 2025

On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead.

+1. Can we refactor it?

Yes, here is the PR: #52342. This change is no longer needed.

@heyihong heyihong closed this Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants