Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 18, 2025

What changes were proposed in this pull request?

This PR adds a new data_type_struct field to the protobuf definition for struct literals in Spark Connect, addressing the ambiguity issues with the existing struct_type field. The changes include:

  1. Protobuf Schema Update: Added a new data_type_struct field of type DataType.Struct to the Literal.Struct message in expressions.proto, while marking the existing struct_type field as deprecated.

  2. Enhanced Struct Conversion Logic: Updated LiteralValueProtoConverter.scala to:

    • Use the new data_type_struct field when available for more precise struct type definition
    • Maintain backward compatibility by still supporting the deprecated struct_type field
    • Add proper field metadata handling in struct conversions
    • Improve type inference for struct fields when data types can be inferred from literal values

Why are the changes needed?

The current Expression.Struct literal is somewhat overcomplicated since it duplicates most of the information its fields already have. This is bulky to send over the wire, and it can be ambiguous.

Does this PR introduce any user-facing change?

No. This PR maintains backward compatibility with existing struct literal implementations. Existing code using the deprecated struct_type field will continue to work without modification.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.2.4

@heyihong heyihong force-pushed the SPARK-52448 branch 3 times, most recently from 5d8e764 to 9d68335 Compare July 18, 2025 20:23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this improvement. So we will contain the field name in every element, the space complexity would be O(n) compared to O(1) in the deprecated way? So what's the perf gain from this change?

Copy link
Contributor Author

@heyihong heyihong Jul 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the space complexity in the deprecated approach is also O(n), since every element includes the field name in the struct_type field, correct?

Copy link
Contributor Author

@heyihong heyihong Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it is not necessary to include the field name/data type in every element, since not setting the data_type field should be sufficient for saving space.

@heyihong
Copy link
Contributor Author

@heyihong heyihong force-pushed the SPARK-52448 branch 6 times, most recently from 1db4711 to f38223d Compare July 21, 2025 17:29
@HyukjinKwon
Copy link
Member

Merged to master.

//
// Whether data_type_struct.fields.data_type should be set depends on
// whether each field's type can be inferred from the elements field.
DataType.Struct data_type_struct = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyihong sorry for the late comments, what about introducing a new message for this purpose?
for example, SimplifiedStruct or LightStruct, my feeling is that the conversion in the server side become too complex

Copy link
Contributor Author

@heyihong heyihong Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng FYI, I simplified the implementation in this PR: #52098, but it's not necessary to introduce a new message.

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants