-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-52930][CONNECT] Use DataType.Array/Map for Array/Map Literals #51653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
78f2bb7 to
4f64261
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally feel it might be better to introduce new messages for this purpose, so that we can minimize the code changes, and the conversion logic in the server side can be more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng FYI, I have simplified the implementation a bit. It is possible to minimize the code changes and make the conversion logic on the server side clearer without introducing new messages.
ac9d081 to
2ca98c4
Compare
aa2f322 to
bb44b23
Compare
|
merged to master |
…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR introduces a transition to use
DataType.ArrayandDataType.Mapfor array and map literals throughout the Spark Connect codebase for Array/Map Literals.While the Spark Connect server supports both new and old data type fields, in this change, the new data type fields are set only in ColumnNodeToProtoConverter for the Spark Connect Scala client. All other components (e.g., ML, Python) still use the old data type fields because literal values are used not only in requests but also in responses, making it difficult to maintain compatibility—clients using older versions may not recognize the new fields in the response. Deprecation and the transition to the new fields require a gradual migration. The key changes include:
Protocol Buffer Updates:
expressions.prototo add newdata_typefields forArrayandMapmessageselement_type,key_type, andvalue_typefields in favor of the unifieddata_typeapproachexpressions_pb2.py,expressions_pb2.pyi) to reflect these changesCore Implementation Changes:
LiteralValueProtoConverter.scalawith new internal methodtoLiteralProtoBuilderInternalthat acceptsToLiteralProtoOptionsLiteralExpressionProtoConverter.scalato support inference of array and map data typescolumnNodeSupport.scalato use the newtoLiteralProtoBuilderWithOptionsmethod withuseDeprecatedDataTypeFieldsset tofalseWhy are the changes needed?
The changes are needed to improve Spark's data type handling for array and map literals:
Nullability of Array/Map literals are now included in the DataType.Array/Map: This ensures that nullability information is properly captured and handled within the data type structure itself.
Work better with type inference by including all type information in one field: By consolidating all type information into a single field, it is easier to infer data types for complex data structures.
Does this PR introduce any user-facing change?
Yes. Previously, the nullability of arrays and map values using typedlit was not preserved (which I believe was a bug). It is now preserved. Please see the changes in ClientE2ETestSuite for details.
How was this patch tested?
build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite -- -z SPARK-52930"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.4.5