-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49443][SQL][PYTHON] Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects #47907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I haven't generated the golden files yet as I don't remember which commands to run. I'll figure it out based on test failures. |
gene-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks for this functionality! I left a few comments.
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Show resolved
Hide resolved
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(expr) - Convert a nested input (array/map/struct) into a variant where maps and structs are converted to variant objects which are unordered unlike SQL structs. Input maps can only have string keys.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we define the element order in the resulting variant object? Random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. From a logical perspective, the keys should be thought of as random and the users should not assume anything about the order.
However, in the spec, the field IDs are sorted based on the lexicographic order of the keys. This is to make it possible to binary search for the required key.
|
Note to reviewers: There is currently a bug when using |
Thia issue has been fixed |
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
Outdated
Show resolved
Hide resolved
|
@HyukjinKwon Can you go over the Python changes in this PR? |
gene-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db Thanks! I left a few minor comments.
LGTM
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
Show resolved
Hide resolved
...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
Show resolved
Hide resolved
|
@cloud-fan Can you go over this PR again whenever you're available? |
|
@cloud-fan There may be an issue with the Scala linter test. It was passing earlier and is failing after a very minor commit. It says there are lint failures in the |
| /** | ||
| * Converts a column containing nested inputs (array/map/struct) into a variants where maps and | ||
| * structs are converted to variant objects which are unordered unlike SQL structs. Input maps can | ||
| * only have string keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the input has no array/map/struct, this function is noop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, the type check will fail for it.
|
The link failure is unrelated, thanks, merging to master! |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Here is the fix for formatting: #48060
What changes were proposed in this pull request?
This PR prohibits casts from data types containing structs or maps to variant and introduces a new expression
to_variant_objectwhich allows converting nested types to variants and retains the old functionality. This PR also changes the behavior of theschema_of_variantandschema_of_variant_aggexpressions where they now printOBJECTinstead ofSTRUCT(which is not technically correct).Why are the changes needed?
Cast from structs to variant objects should not be legal since variant objects are unordered bags of key-value pairs while structs are ordered sets of elements of fixed types. Therefore, casts between structs and variant objects do not behave like casts between structs. Example (produced by Serge Rielau):
Casts from maps to variant objects should also not be legal since they represent completely orthogonal data types. Maps can represent a variable number of key value pairs based on just a key and value type in the schema but in objects, the schema (produced by schema_of_variant expressions) will have a type corresponding to each value in the object. Objects can have values of different types while maps cannot and objects can only have string keys while maps can also have complex keys.
We should therefore prohibit the existing behavior of allowing explicit casts from structs and maps to variants as the variant spec currently only supports an object type which is remotely compatible with structs and maps. We introduce a new expression that converts schemas containing structs and maps to variants (where these types are converted to objects). We will call it
to_variant_object.Does this PR introduce any user-facing change?
Yes, it introduces the
to_variant_objectexpression and changes the behavior of theschema_of_variant/schema_of_variant_aggexpressions.How was this patch tested?
Several unit tests with codegen enabled/disabled.
Was this patch authored or co-authored using generative AI tooling?
Yes.
Generated-by: GitHub Copilot, perplexity.ai