-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b9bcea2 to
2827090
Compare
b43d51d to
b390c2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an optional field? For simple literals, such as byte, int, we can actually infer the data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In proto3, fields that are message types are always optional. If the data_type field is not set, we fall back to inferring the data type or using the deprecated data type fields. I will update the comment.
In my opinion, it is better to always set the data_type field, because avoiding data type inference simplifies both the protocol and its implementation across different languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this compatible with the old client versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the old client versions don't use these removed fields. Also , PySpark Connect still uses the old client version, and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different from your previous PRs, it seems this PR make proto more complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation logic was simplified (262 lines added and 464 lines removed) in sql/connect/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala, as it no longer infers data types. In addition, the protobuf message schema was simplified (1 field added and 3 fields removed).
The proto message size may increase a little bit though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more efficient, shall we skip the data type field for trival type literals at the root level? Then we can keep the message unchanged for common cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. To minimize the changes, I will only set the data_type field for the list below for now:
- NULL values
- ARRAY type
- STRUCT type
- MAP type
We can extend the list later.
b7fe5a7 to
2d788e6
Compare
2d788e6 to
6c5ed2e
Compare
|
thanks, merging to master! |
…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…toConverter ### What changes were proposed in this pull request? This PR simplifies data type handling in the Spark Connect `LiteralValueProtoConverter` by consolidating type information into a single `data_type` field at the root level of the `Literal` message, rather than having separate type fields within nested structures. **Key changes:** 1. **Protobuf Schema Simplification:** - Added a new `data_type` field (field 100) to the root `Expression.Literal` message - Removed redundant type fields from nested messages (`Array.data_type`, `Map.data_type`, `Struct.data_type_struct`) 2. **Array Type Handling Enhancement:** - Added special handling for `ByteType` arrays to convert them to `Binary` type in the protobuf representation - This addresses a specific edge case where byte arrays should be represented as binary data ### Why are the changes needed? The current data type handling in Spark Connect has several issues: 1. **Redundancy and Complexity:** Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone. 2. **Limited Extensibility:** Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like `String` (with collation information), `YearMonthInterval`, `DayTimeInterval`, and other types that may require additional type-specific attributes. ### Does this PR introduce _any_ user-facing change? **No.** This is an internal refactoring of the Spark Connect protobuf schema and converter logic. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"` `build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.5.11 Closes apache#52342 from heyihong/SPARK-53578. Authored-by: Yihong He <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR simplifies data type handling in the Spark Connect
LiteralValueProtoConverterby consolidating type information into a singledata_typefield at the root level of theLiteralmessage, rather than having separate type fields within nested structures.Key changes:
Protobuf Schema Simplification:
data_typefield (field 100) to the rootExpression.LiteralmessageArray.data_type,Map.data_type,Struct.data_type_struct)Array Type Handling Enhancement:
ByteTypearrays to convert them toBinarytype in the protobuf representationWhy are the changes needed?
The current data type handling in Spark Connect has several issues:
Redundancy and Complexity: Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone.
Limited Extensibility: Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like
String(with collation information),YearMonthInterval,DayTimeInterval, and other types that may require additional type-specific attributes.Does this PR introduce any user-facing change?
No. This is an internal refactoring of the Spark Connect protobuf schema and converter logic.
How was this patch tested?
build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.5.11