Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 7, 2020

What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in NestedColumnAliasing. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

This backports #28988 to branch-3.0.

Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

Does this PR introduce any user-facing change?

Yes, fixing a bug in nested column pruning.

How was this patch tested?

Unit test.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-32163][SQL][BRANCH-3.0] Nested pruning should work even with cosmetic variations [SPARK-32163][SQL][3.0] Nested pruning should work even with cosmetic variations Jul 7, 2020
@maropu
Copy link
Member

maropu commented Jul 7, 2020

Thanks for the backport, @viirya .

if (nestedFieldToAlias.nonEmpty &&
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
nestedFields.map(_.canonicalized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main difference from 3.0, dedupNestedFields -> nestedFields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for the change in NestedColumnAliasing. Another difference is test. One test in master branch cannot pass in branch-3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Test part looked correct because it's a subset. For this part, it looks a little different and needs more validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for check. Yes, there is a bit difference between master and branch-3.0 here. So no dedupNestedFields in branch-3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed this part. I think the added test still fails if we don't have this change. Is this correct?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It does. The new test case still validate this patch in terms of that part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this test fails in current branch-3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@dongjoon-hyun
Copy link
Member

Merged to branch-3.0. Thank you, @viirya and @maropu .
All UTs (including R) already passed in the current running Jenkins.

dongjoon-hyun pushed a commit that referenced this pull request Jul 8, 2020
… variations

### What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

This backports #28988 to branch-3.0.

### Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a bug in nested column pruning.

### How was this patch tested?

Unit test.

Closes #29027 from viirya/SPARK-32163-3.0.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@SparkQA
Copy link

SparkQA commented Jul 8, 2020

Test build #125245 has finished for PR 29027 at commit 5e4a420.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 8, 2020

Thanks @dongjoon-hyun @maropu

@viirya viirya deleted the SPARK-32163-3.0 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants