Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 14, 2025

What changes were proposed in this pull request?

This PR optimizes the LiteralValueProtoConverter to reduce redundant type information in Spark Connect protocol buffers. The key changes include:

  1. Optimized type inference for arrays and maps: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.

  2. Added needDataType parameter: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.

  3. Updated protobuf documentation: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.

  4. Improved test coverage: Added new test cases for complex nested structures including tuples and maps with array values.

Why are the changes needed?

The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.

Does this PR introduce any user-facing change?

No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

@heyihong heyihong changed the title [SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Expression.Literal.Array optional [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional Jul 14, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 62372ff to 2f4c24b Compare July 16, 2025 14:15
@heyihong heyihong changed the title [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional Jul 16, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 9e7737d to 71814d0 Compare July 16, 2025 15:28
@heyihong
Copy link
Contributor Author

@heyihong heyihong requested a review from hvanhovell July 16, 2025 19:00
@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from 7e5b478 to da00208 Compare July 16, 2025 19:57
@HyukjinKwon
Copy link
Member

cc @zhengruifeng too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @WeichenXu123 for the ml side

@heyihong heyihong changed the title [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional Jul 17, 2025
@heyihong heyihong requested a review from zhengruifeng July 17, 2025 13:41
@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from c819bcb to c579c1c Compare July 21, 2025 13:56
@heyihong
Copy link
Contributor Author

heyihong commented Jul 21, 2025

friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123

Update: No need to review at the moment — I need to finish SPARK-52930 first.

@heyihong
Copy link
Contributor Author

heyihong commented Sep 1, 2025

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the diff here? why it changed the analyzed plan?

Copy link
Contributor Author

@heyihong heyihong Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng The diff is caused by an extra test case I added:

fn.typedLit(
  Seq(
      mutable.LinkedHashMap("a" -> Seq("1", "2"), "b" -> Seq("3", "4")),
      mutable.LinkedHashMap("a" -> Seq("5", "6"), "b" -> Seq("7", "8")),
      mutable.LinkedHashMap("a" -> Seq.empty[String], "b" -> Seq.empty[String])))
[keys: [a,b], values: [[1,2],[3,4]],keys: [a,b], values: [[5,6],[7,8]],keys: [a,b], values: [[],[]]] AS ARRAY(MAP('a', ARRAY('1', '2'), 'b', ARRAY('3', '4')), MAP('a', ARRAY('5', '6'), 'b', ARRAY('7', '8')), MAP('a', ARRAY(), 'b', ARRAY()))#0

@zhengruifeng
Copy link
Contributor

@heyihong please resolve the conflicts

@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from e5e59aa to dc6ed7c Compare September 8, 2025 10:38
@zhengruifeng
Copy link
Contributor

merged to master

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…ral.Map/Array optional

### What changes were proposed in this pull request?

This PR optimizes the `LiteralValueProtoConverter` to reduce redundant type information in Spark Connect protocol buffers. The key changes include:

1. **Optimized type inference for arrays and maps**: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.

2. **Added `needDataType` parameter**: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.

3. **Updated protobuf documentation**: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.

4. **Improved test coverage**: Added new test cases for complex nested structures including tuples and maps with array values.

### Why are the changes needed?

The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.

### Does this PR introduce any user-facing change?

**No** - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

### How was this patch tested?

`build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

Closes apache#51473 from heyihong/SPARK-52449.

Authored-by: Yihong He <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants