-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .select( | ||
| Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out1")(), | ||
| Alias(UpdateFields('a, WithField("B1", Literal(5)) :: Nil), "out2")()) | ||
| Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out2")()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One result is that for case-insensitive cases, the first casing seen for a field is maintained, rather than the last one. If this isn't what we want, I can update it to keep the last casing seen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is for case-insensitive, seems no big deal. Although for the semantics, the "B1" is specified later, so I guess it is more reasonable to keep later one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to keep the last casing instead
| test("SPARK-35213: chained withField operations should have correct schema for new columns") { | ||
| val df = spark.createDataFrame( | ||
| sparkContext.parallelize(Row(null) :: Nil), | ||
| StructType(Seq(StructField("data", NullType)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just create an empty dataframe with no columns in Scala? I mostly operate and python and can just do spark.createDataFrame([[]])
|
Tests seem to fail if there's a slash in the source branch name. Not sure if there's anything I can do other than recreate the PR with a different branch name |
|
ok to test |
|
Huh, let's see if Jenkins works. Otherwise, you may need to submit another PR using different branch name. |
|
Seems okay, Jenkins tests are running. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Also pushed a new branch in my fork: https://github.com/Kimahriman/spark/runs/2432306637 |
| .analyze | ||
|
|
||
| comparePlans(optimized, correctAnswer) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check the output data type, you can see the struct type is not different:
optimized: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false)))
correctAnswer: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false)))
By design, UpdateFields will keep the order of fields in struct expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yea, it looks better to keep original WithField order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just to sanity check the WithField order does actually stay the same, the tests on the Column Suite show how it can actually give you an incorrect schema. I don't fully know how a schema is determined (what part of the planning phase)
| } | ||
| } | ||
|
|
||
| test("SPARK-35213: ensure optimize WithFields maintains correct struct ordering") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct ordering -> withfield ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| StructType(Seq( | ||
| StructField("data", StructType(Seq( | ||
| StructField("a", StructType(Seq( | ||
| StructField("aa", StringType, nullable = false), | ||
| StructField("ab", StringType, nullable = false) | ||
| )), nullable = false), | ||
| StructField("b", StructType(Seq( | ||
| StructField("ba", StringType, nullable = false) | ||
| )), nullable = false) | ||
| )), nullable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Using ddl might be more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's kinda verbose, but I feel like for complicated things the objects are easier to understand than the DDL strings, especially with structs. Wasn't sure if there was an easier way to not have to explicitly mark everything as not nullable at least
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching and fixing it.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137924 has finished for PR 32338 at commit
|
|
@Kimahriman can you retrigger https://github.com/Kimahriman/spark/actions/runs/783536510 please? The PR in Apache Spark repo runs the build in your forked repository. |
|
Test build #137926 has finished for PR 32338 at commit
|
It will fail because of the slash in the branch name. It tries to just checkout the part after the slash and fails. I forgot about this when I made the PR though. I pushed to a separate branch and you can see the action here: https://github.com/Kimahriman/spark/actions/runs/783542775. it's the same commit, but the only way I can get the action to pass on this PR is to close it and open a new one on the other named branch |
|
Jenkins tests already passed, so I think it should be fine. |
|
Thanks. Merging to master/3.1. |
…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes #32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <[email protected]>
…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes apache#32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <[email protected]>
…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes apache#32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <[email protected]>
What changes were proposed in this pull request?
Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.
Why are the changes needed?
Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.
Does this PR introduce any user-facing change?
It fixes exceptions and incorrect results for valid uses in the latest Spark release.
How was this patch tested?
Added new unit tests for these edge cases.