-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53556][CONNECT] Avoid setting redundant struct data types in LiteralValueProtoConverter #52312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I may miss some context. When is this information redundant or not needed? Do we have some examples? |
Yes, for an array of structs, only the first element needs to include the struct type information. The rest do not need to have the type information set. |
|
Literal should separate the data and metadata (data type), so array of struct literal should have the array of struct value and the array of struct data type. This is how the catalyst Litetal is designed. How was the literal protobuf designed? |
The original design may be inspired by https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L993-L1033 and the type information (through Having a clear separation between the data and metadata (data type) can be handled with the current literal protobuf implementation. For example, you can have an array whose elements don’t include any data type information, while the dataType field holds the full type definition. The reason for mixing data and metadata here is to achieve a more compact format and save space by setting fewer data type fields (since they can be inferred). On second thought, the space saved by having fewer data type fields may not be worth the implementation complexity of inferring the type. It may be better to have a dedicated data type field instead. |
+1. Can we refactor it? |
Yes, here is the PR: #52342. This change is no longer needed. |
What changes were proposed in this pull request?
This PR optimizes the
LiteralValueProtoConverterto avoid setting redundant struct data types in protobuf messages when they are not needed. The main changes include:Modified
structBuildermethod signature: Added aneedDataType: Booleanparameter to control whether struct data type information should be included in the protobuf message.Conditional data type struct building: The struct builder now only creates and populates the
dataTypeStructfield whenneedDataTypeis true, avoiding redundant data type information when it's not required.Updated method calls: Modified the call to
structBuilderin the main conversion logic to pass theneedDataTypeparameter.Added test case: Added a new test case for
typedLitwith tuple sequences to ensure the optimization works correctly with complex nested structures.The key optimization is in the
structBuildermethod where thedataTypeStructfield is now only populated whenneedDataTypeis true, preventing unnecessary serialization of struct metadata.Why are the changes needed?
The current implementation always sets struct data type information in protobuf messages, even when this information is redundant or not needed.
Does this PR introduce any user-facing change?
No, this PR does not introduce any user-facing changes.
How was this patch tested?
build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.5.11