Skip to content

Conversation

@Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

Does this PR introduce any user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

How was this patch tested?

Added new unit tests for these edge cases.

@github-actions github-actions bot added the SQL label Apr 25, 2021
.select(
Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out1")(),
Alias(UpdateFields('a, WithField("B1", Literal(5)) :: Nil), "out2")())
Alias(UpdateFields('a, WithField("b1", Literal(5)) :: Nil), "out2")())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One result is that for case-insensitive cases, the first casing seen for a field is maintained, rather than the last one. If this isn't what we want, I can update it to keep the last casing seen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is for case-insensitive, seems no big deal. Although for the semantics, the "B1" is specified later, so I guess it is more reasonable to keep later one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to keep the last casing instead

test("SPARK-35213: chained withField operations should have correct schema for new columns") {
val df = spark.createDataFrame(
sparkContext.parallelize(Row(null) :: Nil),
StructType(Seq(StructField("data", NullType))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just create an empty dataframe with no columns in Scala? I mostly operate and python and can just do spark.createDataFrame([[]])

@Kimahriman
Copy link
Contributor Author

Tests seem to fail if there's a slash in the source branch name. Not sure if there's anything I can do other than recreate the PR with a different branch name

@viirya
Copy link
Member

viirya commented Apr 25, 2021

ok to test

@viirya
Copy link
Member

viirya commented Apr 25, 2021

Huh, let's see if Jenkins works. Otherwise, you may need to submit another PR using different branch name.

@viirya
Copy link
Member

viirya commented Apr 25, 2021

Seems okay, Jenkins tests are running.

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42444/

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42444/

@Kimahriman
Copy link
Contributor Author

Also pushed a new branch in my fork: https://github.com/Kimahriman/spark/runs/2432306637

.analyze

comparePlans(optimized, correctAnswer)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the output data type, you can see the struct type is not different:

optimized: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false)))
correctAnswer: ArrayBuffer(StructType(StructField(a1,IntegerType,false), StructField(b1,IntegerType,false)))

By design, UpdateFields will keep the order of fields in struct expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yea, it looks better to keep original WithField order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to sanity check the WithField order does actually stay the same, the tests on the Column Suite show how it can actually give you an incorrect schema. I don't fully know how a schema is determined (what part of the planning phase)

}
}

test("SPARK-35213: ensure optimize WithFields maintains correct struct ordering") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct ordering -> withfield ordering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +1702 to +1711
StructType(Seq(
StructField("data", StructType(Seq(
StructField("a", StructType(Seq(
StructField("aa", StringType, nullable = false),
StructField("ab", StringType, nullable = false)
)), nullable = false),
StructField("b", StructType(Seq(
StructField("ba", StringType, nullable = false)
)), nullable = false)
)), nullable = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using ddl might be more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's kinda verbose, but I feel like for complicated things the objects are easier to understand than the DDL strings, especially with structs. Wasn't sure if there was an easier way to not have to explicitly mark everything as not nullable at least

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing it.

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42446/

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42446/

@SparkQA
Copy link

SparkQA commented Apr 25, 2021

Test build #137924 has finished for PR 32338 at commit 1af0dae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@Kimahriman can you retrigger https://github.com/Kimahriman/spark/actions/runs/783536510 please? The PR in Apache Spark repo runs the build in your forked repository.

@SparkQA
Copy link

SparkQA commented Apr 26, 2021

Test build #137926 has finished for PR 32338 at commit 23ee428.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Kimahriman
Copy link
Contributor Author

@Kimahriman can you retrigger https://github.com/Kimahriman/spark/actions/runs/783536510 please? The PR in Apache Spark repo runs the build in your forked repository.

It will fail because of the slash in the branch name. It tries to just checkout the part after the slash and fails. I forgot about this when I made the PR though. I pushed to a separate branch and you can see the action here: https://github.com/Kimahriman/spark/actions/runs/783542775. it's the same commit, but the only way I can get the action to pass on this PR is to close it and open a new one on the other named branch

@viirya
Copy link
Member

viirya commented Apr 26, 2021

Jenkins tests already passed, so I think it should be fine.

@viirya
Copy link
Member

viirya commented Apr 26, 2021

Thanks. Merging to master/3.1.

@viirya viirya closed this in 74afc68 Apr 26, 2021
viirya pushed a commit that referenced this pull request Apr 26, 2021
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes apache#32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes apache#32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants