[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338

Kimahriman · 2021-04-25T15:09:29Z

What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

Does this PR introduce any user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

How was this patch tested?

Added new unit tests for these edge cases.

…ations

Kimahriman · 2021-04-25T15:11:11Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeWithFieldsSuite.scala

      .select(
        Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out1")(),
-        Alias(UpdateFields('a, WithField("B1", Literal(5)) :: Nil), "out2")())
+        Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out2")())


One result is that for case-insensitive cases, the first casing seen for a field is maintained, rather than the last one. If this isn't what we want, I can update it to keep the last casing seen

As this is for case-insensitive, seems no big deal. Although for the semantics, the "B1" is specified later, so I guess it is more reasonable to keep later one.

Changed it to keep the last casing instead

Kimahriman · 2021-04-25T15:12:51Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

+  test("SPARK-35213: chained withField operations should have correct schema for new columns") {
+    val df = spark.createDataFrame(
+      sparkContext.parallelize(Row(null) :: Nil),
+      StructType(Seq(StructField("data", NullType))))


Is it possible to just create an empty dataframe with no columns in Scala? I mostly operate and python and can just do spark.createDataFrame([[]])

Kimahriman · 2021-04-25T16:19:02Z

Tests seem to fail if there's a slash in the source branch name. Not sure if there's anything I can do other than recreate the PR with a different branch name

viirya · 2021-04-25T17:25:47Z

ok to test

viirya · 2021-04-25T17:26:52Z

Huh, let's see if Jenkins works. Otherwise, you may need to submit another PR using different branch name.

viirya · 2021-04-25T18:04:28Z

Seems okay, Jenkins tests are running.

SparkQA · 2021-04-25T18:17:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42444/

SparkQA · 2021-04-25T18:17:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42444/

Kimahriman · 2021-04-25T18:29:47Z

Also pushed a new branch in my fork: https://github.com/Kimahriman/spark/runs/2432306637

viirya · 2021-04-25T19:52:17Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeWithFieldsSuite.scala

+      .analyze
+
+    comparePlans(optimized, correctAnswer)
+  }


If you check the output data type, you can see the struct type is not different:

optimized: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false))) correctAnswer: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false)))

By design, UpdateFields will keep the order of fields in struct expression.

But yea, it looks better to keep original WithField order.

This was just to sanity check the WithField order does actually stay the same, the tests on the Column Suite show how it can actually give you an incorrect schema. I don't fully know how a schema is determined (what part of the planning phase)

viirya · 2021-04-25T20:11:55Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeWithFieldsSuite.scala

    }
  }
+
+  test("SPARK-35213: ensure optimize WithFields maintains correct struct ordering") {


struct ordering -> withfield ordering

viirya · 2021-04-25T21:23:13Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

+        StructType(Seq(
+          StructField("data", StructType(Seq(
+            StructField("a", StructType(Seq(
+              StructField("aa", StringType, nullable = false),
+              StructField("ab", StringType, nullable = false)
+            )), nullable = false),
+            StructField("b", StructType(Seq(
+              StructField("ba", StringType, nullable = false)
+            )), nullable = false)
+          )), nullable = false)


nit: Using ddl might be more readable?

Yeah it's kinda verbose, but I feel like for complicated things the objects are easier to understand than the DDL strings, especially with structs. Wasn't sure if there was an easier way to not have to explicitly mark everything as not nullable at least

viirya

Thanks for catching and fixing it.

SparkQA · 2021-04-25T22:06:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42446/

SparkQA · 2021-04-25T22:06:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42446/

SparkQA · 2021-04-25T22:25:27Z

Test build #137924 has finished for PR 32338 at commit 1af0dae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-26T01:46:15Z

@Kimahriman can you retrigger https://github.com/Kimahriman/spark/actions/runs/783536510 please? The PR in Apache Spark repo runs the build in your forked repository.

SparkQA · 2021-04-26T01:55:16Z

Test build #137926 has finished for PR 32338 at commit 23ee428.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Kimahriman · 2021-04-26T01:59:54Z

@Kimahriman can you retrigger https://github.com/Kimahriman/spark/actions/runs/783536510 please? The PR in Apache Spark repo runs the build in your forked repository.

It will fail because of the slash in the branch name. It tries to just checkout the part after the slash and fails. I forgot about this when I made the PR though. I pushed to a separate branch and you can see the action here: https://github.com/Kimahriman/spark/actions/runs/783542775. it's the same commit, but the only way I can get the action to pass on this PR is to close it and open a new one on the other named branch

viirya · 2021-04-26T03:23:46Z

Jenkins tests already passed, so I think it should be fine.

viirya · 2021-04-26T06:39:16Z

Thanks. Merging to master/3.1.

…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes #32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <[email protected]>

…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes apache#32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <[email protected]>

Keep the correct ordering of nested structs in chained withField oper…

1af0dae

…ations

github-actions bot added the SQL label Apr 25, 2021

Kimahriman commented Apr 25, 2021

View reviewed changes

viirya reviewed Apr 25, 2021

View reviewed changes

Use the last casing instead of the first for case insensitivity

23ee428

viirya reviewed Apr 25, 2021

View reviewed changes

viirya approved these changes Apr 25, 2021

View reviewed changes

viirya closed this in 74afc68 Apr 26, 2021

Kimahriman mentioned this pull request May 11, 2021

[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling #32448

Closed

[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338

[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338

Uh oh!

Conversation

Kimahriman commented Apr 25, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Apr 25, 2021

Uh oh!

viirya commented Apr 25, 2021

Uh oh!

viirya commented Apr 25, 2021

Uh oh!

viirya commented Apr 25, 2021

Uh oh!

SparkQA commented Apr 25, 2021

Uh oh!

SparkQA commented Apr 25, 2021

Uh oh!

Kimahriman commented Apr 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2021

Uh oh!

SparkQA commented Apr 25, 2021

Uh oh!

SparkQA commented Apr 25, 2021

Uh oh!

HyukjinKwon commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

Kimahriman commented Apr 26, 2021

Uh oh!

viirya commented Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented Apr 26, 2021 •

edited

Loading