-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1686,6 +1686,61 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { | |
| StructType(Seq(StructField("a", IntegerType, nullable = true)))) | ||
| } | ||
|
|
||
| test("SPARK-35213: chained withField operations should have correct schema for new columns") { | ||
| val df = spark.createDataFrame( | ||
| sparkContext.parallelize(Row(null) :: Nil), | ||
| StructType(Seq(StructField("data", NullType)))) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to just create an empty dataframe with no columns in Scala? I mostly operate and python and can just do |
||
|
|
||
| checkAnswer( | ||
| df.withColumn("data", struct() | ||
| .withField("a", struct()) | ||
| .withField("b", struct()) | ||
| .withField("a.aa", lit("aa1")) | ||
| .withField("b.ba", lit("ba1")) | ||
| .withField("a.ab", lit("ab1"))), | ||
| Row(Row(Row("aa1", "ab1"), Row("ba1"))) :: Nil, | ||
| StructType(Seq( | ||
| StructField("data", StructType(Seq( | ||
| StructField("a", StructType(Seq( | ||
| StructField("aa", StringType, nullable = false), | ||
| StructField("ab", StringType, nullable = false) | ||
| )), nullable = false), | ||
| StructField("b", StructType(Seq( | ||
| StructField("ba", StringType, nullable = false) | ||
| )), nullable = false) | ||
| )), nullable = false) | ||
|
Comment on lines
+1702
to
+1711
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Using ddl might be more readable?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah it's kinda verbose, but I feel like for complicated things the objects are easier to understand than the DDL strings, especially with structs. Wasn't sure if there was an easier way to not have to explicitly mark everything as not nullable at least |
||
| )) | ||
| ) | ||
| } | ||
|
|
||
| test("SPARK-35213: optimized withField operations should maintain correct nested struct " + | ||
| "ordering") { | ||
| val df = spark.createDataFrame( | ||
| sparkContext.parallelize(Row(null) :: Nil), | ||
| StructType(Seq(StructField("data", NullType)))) | ||
|
|
||
| checkAnswer( | ||
| df.withColumn("data", struct() | ||
| .withField("a", struct().withField("aa", lit("aa1"))) | ||
| .withField("b", struct().withField("ba", lit("ba1"))) | ||
| ) | ||
| .withColumn("data", col("data").withField("b.bb", lit("bb1"))) | ||
| .withColumn("data", col("data").withField("a.ab", lit("ab1"))), | ||
| Row(Row(Row("aa1", "ab1"), Row("ba1", "bb1"))) :: Nil, | ||
| StructType(Seq( | ||
| StructField("data", StructType(Seq( | ||
| StructField("a", StructType(Seq( | ||
| StructField("aa", StringType, nullable = false), | ||
| StructField("ab", StringType, nullable = false) | ||
| )), nullable = false), | ||
| StructField("b", StructType(Seq( | ||
| StructField("ba", StringType, nullable = false), | ||
| StructField("bb", StringType, nullable = false) | ||
| )), nullable = false) | ||
| )), nullable = false) | ||
| )) | ||
| ) | ||
| } | ||
|
|
||
| test("dropFields should throw an exception if called on a non-StructType column") { | ||
| intercept[AnalysisException] { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check the output data type, you can see the struct type is not different:
By design,
UpdateFieldswill keep the order of fields in struct expression.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yea, it looks better to keep original
WithFieldorder.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just to sanity check the WithField order does actually stay the same, the tests on the Column Suite show how it can actually give you an incorrect schema. I don't fully know how a schema is determined (what part of the planning phase)