Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

@harshmotw-db harshmotw-db commented Aug 28, 2024

What changes were proposed in this pull request?

This PR prohibits casts from data types containing structs or maps to variant and introduces a new expression to_variant_object which allows converting nested types to variants and retains the old functionality. This PR also changes the behavior of the schema_of_variant and schema_of_variant_agg expressions where they now print OBJECT instead of STRUCT (which is not technically correct).

Why are the changes needed?

Cast from structs to variant objects should not be legal since variant objects are unordered bags of key-value pairs while structs are ordered sets of elements of fixed types. Therefore, casts between structs and variant objects do not behave like casts between structs. Example (produced by Serge Rielau):

scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2') as struct<b int, c int>)").show()
+------------------------+
|named_struct(c, 1, b, 2)|

+------------------------+
|{1, 2}|

+------------------------+

Passing a struct into VARIANT loses the position
scala> spark.sql("SELECT cast(named_struct('c', 1, 'b', '2')::variant as struct<b int, c int>)").show()
+-----------------------------------------+
|CAST(named_struct(c, 1, b, 2) AS VARIANT)|

+-----------------------------------------+
|{2, 1}|

+-----------------------------------------+

Casts from maps to variant objects should also not be legal since they represent completely orthogonal data types. Maps can represent a variable number of key value pairs based on just a key and value type in the schema but in objects, the schema (produced by schema_of_variant expressions) will have a type corresponding to each value in the object. Objects can have values of different types while maps cannot and objects can only have string keys while maps can also have complex keys.

We should therefore prohibit the existing behavior of allowing explicit casts from structs and maps to variants as the variant spec currently only supports an object type which is remotely compatible with structs and maps. We introduce a new expression that converts schemas containing structs and maps to variants (where these types are converted to objects). We will call it to_variant_object.

Does this PR introduce any user-facing change?

Yes, it introduces the to_variant_object expression and changes the behavior of the schema_of_variant/schema_of_variant_agg expressions.

How was this patch tested?

Several unit tests with codegen enabled/disabled.

Was this patch authored or co-authored using generative AI tooling?

Yes.
Generated-by: GitHub Copilot, perplexity.ai

@harshmotw-db harshmotw-db changed the title [SPARK-49443] Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects [SPARK-49443][SQL][PYTHON] Implement to_variant_object expression and make schema_of_variant expressions print OBJECT for for Variant Objects Aug 28, 2024
@harshmotw-db
Copy link
Contributor Author

I haven't generated the golden files yet as I don't remember which commands to run. I'll figure it out based on test failures.

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks for this functionality! I left a few comments.


// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(expr) - Convert a nested input (array/map/struct) into a variant where maps and structs are converted to variant objects which are unordered unlike SQL structs. Input maps can only have string keys.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we define the element order in the resulting variant object? Random?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. From a logical perspective, the keys should be thought of as random and the users should not assume anything about the order.
However, in the spec, the field IDs are sorted based on the lexicographic order of the keys. This is to make it possible to binary search for the required key.

@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Aug 29, 2024

Note to reviewers: There is currently a bug when using UTF8_LCASE collation. I am looking into it.

scala> sql("""select to_variant_object(map("a" collate utf8_lcase, 2))""").collect()
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000

@harshmotw-db
Copy link
Contributor Author

Note to reviewers: There is currently a bug when using UTF8_LCASE collation. I am looking into it.

scala> sql("""select to_variant_object(map("a" collate utf8_lcase, 2))""").collect()
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase optimization failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000

Thia issue has been fixed

@harshmotw-db
Copy link
Contributor Author

@HyukjinKwon Can you go over the Python changes in this PR?

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db Thanks! I left a few minor comments.

LGTM

@harshmotw-db
Copy link
Contributor Author

@cloud-fan Can you go over this PR again whenever you're available?

@harshmotw-db
Copy link
Contributor Author

@cloud-fan There may be an issue with the Scala linter test. It was passing earlier and is failing after a very minor commit. It says there are lint failures in the sql/connect and connector/connect spaces which this PR is not even modifying. It recommends a command to fix these issues but that command is making several unrelated changes across the codebase.

/**
* Converts a column containing nested inputs (array/map/struct) into a variants where maps and
* structs are converted to variant objects which are unordered unlike SQL structs. Input maps can
* only have string keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the input has no array/map/struct, this function is noop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, the type check will fail for it.

@cloud-fan
Copy link
Contributor

The link failure is unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in 3709c2e Sep 10, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Here is the fix for formatting: #48060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants