-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter #52342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -207,6 +207,13 @@ message Expression { | |
| Time time = 26; | ||
| } | ||
|
|
||
| // Data type information for the literal. | ||
| // This field is required only in the root literal message for null values or | ||
| // for data types (e.g., array, map, or struct) with non-trivial information. | ||
| // If the data_type field is not set at the root level, the data type will be | ||
| // inferred or retrieved from the deprecated data type fields using best efforts. | ||
| DataType data_type = 100; | ||
|
|
||
| message Decimal { | ||
| // the string representation. | ||
| string value = 1; | ||
|
|
@@ -230,18 +237,7 @@ message Expression { | |
| DataType element_type = 1 [deprecated = true]; | ||
|
|
||
| // The literal values that make up the array elements. | ||
| // | ||
| // For inferring the data_type.element_type, only the first element needs to | ||
| // contain the type information. | ||
| repeated Literal elements = 2; | ||
|
|
||
| // The type of the array. You don't need to set this field if the type information is not needed. | ||
| // | ||
| // If the element type can be inferred from the first element of the elements field, | ||
| // then you don't need to set data_type.element_type to save space. | ||
| // | ||
| // On the other hand, redundant type information is also acceptable. | ||
| DataType.Array data_type = 3; | ||
|
||
| } | ||
|
|
||
| message Map { | ||
|
|
@@ -257,41 +253,21 @@ message Expression { | |
| DataType value_type = 2 [deprecated = true]; | ||
|
|
||
| // The literal keys that make up the map. | ||
| // | ||
| // For inferring the data_type.key_type, only the first key needs to | ||
| // contain the type information. | ||
| repeated Literal keys = 3; | ||
|
|
||
| // The literal values that make up the map. | ||
| // | ||
| // For inferring the data_type.value_type, only the first value needs to | ||
| // contain the type information. | ||
| repeated Literal values = 4; | ||
|
|
||
| // The type of the map. You don't need to set this field if the type information is not needed. | ||
| // | ||
| // If the key/value types can be inferred from the first element of the keys/values fields, | ||
| // then you don't need to set data_type.key_type/data_type.value_type to save space. | ||
| // | ||
| // On the other hand, redundant type information is also acceptable. | ||
| DataType.Map data_type = 5; | ||
| } | ||
|
|
||
| message Struct { | ||
| // (Deprecated) The type of the struct. | ||
| // | ||
| // This field is deprecated since Spark 4.1+ because using DataType as the type of a struct | ||
| // is ambiguous. Use data_type_struct field instead. | ||
| // is ambiguous. Use data_type field instead. | ||
| DataType struct_type = 1 [deprecated = true]; | ||
|
|
||
| // (Required) The literal values that make up the struct elements. | ||
| // The literal values that make up the struct elements. | ||
| repeated Literal elements = 2; | ||
|
|
||
| // The type of the struct. You don't need to set this field if the type information is not needed. | ||
| // | ||
| // Whether data_type_struct.fields.data_type should be set depends on | ||
| // whether each field's type can be inferred from the elements field. | ||
| DataType.Struct data_type_struct = 3; | ||
| } | ||
|
|
||
| message SpecializedArray { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an optional field? For simple literals, such as byte, int, we can actually infer the data type.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In proto3, fields that are message types are always optional. If the data_type field is not set, we fall back to inferring the data type or using the deprecated data type fields. I will update the comment.
In my opinion, it is better to always set the data_type field, because avoiding data type inference simplifies both the protocol and its implementation across different languages.