-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
62372ff to
2f4c24b
Compare
9e7737d to
71814d0
Compare
...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala
Outdated
Show resolved
Hide resolved
7e5b478 to
da00208
Compare
|
cc @zhengruifeng too |
sql/connect/common/src/main/protobuf/spark/connect/expressions.proto
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @WeichenXu123 for the ml side
c819bcb to
c579c1c
Compare
|
friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123 Update: No need to review at the moment — I need to finish SPARK-52930 first. |
f2387fe to
e8daef5
Compare
e8daef5 to
7a6bb95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the diff here? why it changed the analyzed plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng The diff is caused by an extra test case I added:
fn.typedLit(
Seq(
mutable.LinkedHashMap("a" -> Seq("1", "2"), "b" -> Seq("3", "4")),
mutable.LinkedHashMap("a" -> Seq("5", "6"), "b" -> Seq("7", "8")),
mutable.LinkedHashMap("a" -> Seq.empty[String], "b" -> Seq.empty[String])))[keys: [a,b], values: [[1,2],[3,4]],keys: [a,b], values: [[5,6],[7,8]],keys: [a,b], values: [[],[]]] AS ARRAY(MAP('a', ARRAY('1', '2'), 'b', ARRAY('3', '4')), MAP('a', ARRAY('5', '6'), 'b', ARRAY('7', '8')), MAP('a', ARRAY(), 'b', ARRAY()))#0
|
@heyihong please resolve the conflicts |
e5e59aa to
dc6ed7c
Compare
…teral.Array optional
313835d to
e301da3
Compare
|
merged to master |
…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ral.Map/Array optional ### What changes were proposed in this pull request? This PR optimizes the `LiteralValueProtoConverter` to reduce redundant type information in Spark Connect protocol buffers. The key changes include: 1. **Optimized type inference for arrays and maps**: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element. 2. **Added `needDataType` parameter**: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information. 3. **Updated protobuf documentation**: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference. 4. **Improved test coverage**: Added new test cases for complex nested structures including tuples and maps with array values. ### Why are the changes needed? The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time. ### Does this PR introduce any user-facing change? **No** - This PR does not introduce any user-facing changes. The change is backward compatible and existing connect clients will continue to work unchanged. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.4.5 Closes apache#51473 from heyihong/SPARK-52449. Authored-by: Yihong He <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
This PR optimizes the
LiteralValueProtoConverterto reduce redundant type information in Spark Connect protocol buffers. The key changes include:Optimized type inference for arrays and maps: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.
Added
needDataTypeparameter: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.Updated protobuf documentation: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.
Improved test coverage: Added new test cases for complex nested structures including tuples and maps with array values.
Why are the changes needed?
The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.
Does this PR introduce any user-facing change?
No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.
How was this patch tested?
build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.4.5