Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Sep 15, 2025

What changes were proposed in this pull request?

This PR simplifies data type handling in the Spark Connect LiteralValueProtoConverter by consolidating type information into a single data_type field at the root level of the Literal message, rather than having separate type fields within nested structures.

Key changes:

  1. Protobuf Schema Simplification:

    • Added a new data_type field (field 100) to the root Expression.Literal message
    • Removed redundant type fields from nested messages (Array.data_type, Map.data_type, Struct.data_type_struct)
  2. Array Type Handling Enhancement:

    • Added special handling for ByteType arrays to convert them to Binary type in the protobuf representation
    • This addresses a specific edge case where byte arrays should be represented as binary data

Why are the changes needed?

The current data type handling in Spark Connect has several issues:

  1. Redundancy and Complexity: Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone.

  2. Limited Extensibility: Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like String (with collation information), YearMonthInterval, DayTimeInterval, and other types that may require additional type-specific attributes.

Does this PR introduce any user-facing change?

No. This is an internal refactoring of the Spark Connect protobuf schema and converter logic.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"
build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.5.11

@heyihong
Copy link
Contributor Author

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optional field? For simple literals, such as byte, int, we can actually infer the data type.

Copy link
Contributor Author

@heyihong heyihong Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In proto3, fields that are message types are always optional. If the data_type field is not set, we fall back to inferring the data type or using the deprecated data type fields. I will update the comment.

In my opinion, it is better to always set the data_type field, because avoiding data type inference simplifies both the protocol and its implementation across different languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this compatible with the old client versions?

Copy link
Contributor Author

@heyihong heyihong Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the old client versions don't use these removed fields. Also , PySpark Connect still uses the old client version, and it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different from your previous PRs, it seems this PR make proto more complex?

Copy link
Contributor Author

@heyihong heyihong Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation logic was simplified (262 lines added and 464 lines removed) in sql/connect/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala, as it no longer infers data types. In addition, the protobuf message schema was simplified (1 field added and 3 fields removed).

The proto message size may increase a little bit though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more efficient, shall we skip the data type field for trival type literals at the root level? Then we can keep the message unchanged for common cases.

Copy link
Contributor Author

@heyihong heyihong Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. To minimize the changes, I will only set the data_type field for the list below for now:

  • NULL values
  • ARRAY type
  • STRUCT type
  • MAP type

We can extend the list later.

@heyihong heyihong force-pushed the SPARK-53578 branch 6 times, most recently from b7fe5a7 to 2d788e6 Compare September 18, 2025 16:28
@cloud-fan
Copy link
Contributor

thanks, merging to master!

dongjoon-hyun added a commit to apache/spark-connect-swift that referenced this pull request Oct 1, 2025
…th `4.1.0-preview2`

### What changes were proposed in this pull request?

This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`.

### Why are the changes needed?

There are many changes from Apache Spark 4.1.0.

- apache/spark#52342
- apache/spark#52256
- apache/spark#52271
- apache/spark#52242
- apache/spark#51473
- apache/spark#51653
- apache/spark#52072
- apache/spark#51561
- apache/spark#51563
- apache/spark#51489
- apache/spark#51507
- apache/spark#51462
- apache/spark#51464
- apache/spark#51442

To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`.

```
$ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git
$ cd spark/sql/connect/common/src/main/protobuf/
$ protoc --swift_out=. spark/connect/*.proto
$ protoc --grpc-swift_out=. spark/connect/*.proto

// Remove empty GRPC files
$ cd spark/connect

$ grep 'This file contained no services' *
catalog.grpc.swift:// This file contained no services.
commands.grpc.swift:// This file contained no services.
common.grpc.swift:// This file contained no services.
example_plugins.grpc.swift:// This file contained no services.
expressions.grpc.swift:// This file contained no services.
ml_common.grpc.swift:// This file contained no services.
ml.grpc.swift:// This file contained no services.
pipelines.grpc.swift:// This file contained no services.
relations.grpc.swift:// This file contained no services.
types.grpc.swift:// This file contained no services.

$ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift
```

### Does this PR introduce _any_ user-facing change?

Pass the CIs.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #250 from dongjoon-hyun/SPARK-53777.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…toConverter

### What changes were proposed in this pull request?

This PR simplifies data type handling in the Spark Connect `LiteralValueProtoConverter` by consolidating type information into a single `data_type` field at the root level of the `Literal` message, rather than having separate type fields within nested structures.

**Key changes:**

1. **Protobuf Schema Simplification:**
   - Added a new `data_type` field (field 100) to the root `Expression.Literal` message
   - Removed redundant type fields from nested messages (`Array.data_type`, `Map.data_type`, `Struct.data_type_struct`)

2. **Array Type Handling Enhancement:**
   - Added special handling for `ByteType` arrays to convert them to `Binary` type in the protobuf representation
   - This addresses a specific edge case where byte arrays should be represented as binary data

### Why are the changes needed?

The current data type handling in Spark Connect has several issues:

1. **Redundancy and Complexity:** Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone.

2. **Limited Extensibility:** Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like `String` (with collation information), `YearMonthInterval`, `DayTimeInterval`, and other types that may require additional type-specific attributes.

### Does this PR introduce _any_ user-facing change?

**No.** This is an internal refactoring of the Spark Connect protobuf schema and converter logic.

### How was this patch tested?

`build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"`
`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"`
`build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.5.11

Closes apache#52342 from heyihong/SPARK-53578.

Authored-by: Yihong He <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants