Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 3, 2020

What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in NestedColumnAliasing. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

Does this PR introduce any user-facing change?

Yes, fixing a bug in nested column pruning.

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Jul 3, 2020

cc @maropu @dongjoon-hyun @frankyin-factual

@ukby1234
Copy link
Contributor

ukby1234 commented Jul 3, 2020

Great! Thanks. I gonna apply the same patch to my branch.

@viirya
Copy link
Member Author

viirya commented Jul 3, 2020

You can rebase after this gets merged.

@viirya
Copy link
Member Author

viirya commented Jul 3, 2020

Does Jenkins not work again?

@viirya
Copy link
Member Author

viirya commented Jul 3, 2020

ok to test

val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
.groupBy(_.references.head)
.groupBy(_.references.head.canonicalized.asInstanceOf[Attribute])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, @viirya @frankyin-factual !

testSchemaPruning("SPARK-32163: nested pruning should work even with cosmetic variations") {
withTempView("contact_alias") {
sql("select * from contacts")
.repartition(100, col("name.first"), col("name.last"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue cannot happen in branch-3.0?

private def canProjectPushThrough(plan: LogicalPlan) = plan match {
case _: GlobalLimit => true
case _: LocalLimit => true
case _: Repartition => true
case _: Sample => true
case _ => false

Copy link
Member Author

@viirya viirya Jul 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is only for master branch. However this issue can happen in branch-3.0 too. I added another new test here, which is for branch-3.0.

But when we backport this to branch-3.0, we need to remove first test case as it will fail on checkScan(query, "struct<name:struct<first:string,last:string>>"), because branch-3.0 doesn't prune for repartition by expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, thanks for adding the test.

if (nestedFieldToAlias.nonEmpty &&
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
dedupNestedFields.map(_.canonicalized.asInstanceOf[ExtractValue])
Copy link
Member

@maropu maropu Jul 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need the cast here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is related to the query failure in the test? Looks it is just an optimization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, we cannot avoid the generating duplicated aliases (name#1.first AS _gen_alias_52#52 and name#1.first AS _gen_alias_53#53) below? Is this technically difficult? (This is not related to this PR and just a question)

scala> sql("select name.first from contact_alias").explain()
== Physical Plan ==
*(2) Project [_gen_alias_52#52 AS first#50]
+- Exchange hashpartitioning(_gen_alias_53#53, _gen_alias_54#54, 100), false, [id=#46]
   +- *(1) Project [name#1.first AS _gen_alias_52#52, name#1.first AS _gen_alias_53#53, name#1.last AS _gen_alias_54#54]
      +- FileScan parquet [name#1,p#8] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/contacts], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<first:string,last:string>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to cosmetic variations, Extractors with different qualifiers, for example, will cause incorrect total number of fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.

But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.

Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the duplicated aliases, we can avoid it. We can work on canonicalized extractors when generating aliases.
But we also need to convert coming extractors to canonicalized versions when we look up into the alias map.
Currently the code looks clear. And it seems not a big deal, and I think it is rare case that there are multiple extractors with cosmetic difference. So currently I don't try to do that.

Thanks for the explanation. Looks okay.

Comment on lines +514 to +516
val query2 = sql("select friends.middle, col from contact_alias")
checkScan(query2, "struct<friends:array<struct<first:string,middle:string>>>")
checkAnswer(query2, Row(Array("Z."), "Susan") :: Nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu This test is for branch-3.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the test case.

@maropu
Copy link
Member

maropu commented Jul 5, 2020

Looks okay and anyone could check this? @dongjoon-hyun @dbtsai

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124928 has finished for PR 28988 at commit 04f6bb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please

@maropu
Copy link
Member

maropu commented Jul 6, 2020

FYI: @zhengruifeng @ScrapCodes (Not sure which one is a release manager though) I think this fix's better to be included in the v3.0.1 release.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124991 has finished for PR 28988 at commit 04f6bb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please...

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please

@SparkQA

This comment has been minimized.

@maropu
Copy link
Member

maropu commented Jul 6, 2020

retest this please

@SparkQA

This comment has been minimized.

@viirya

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

The Jenkins looks unstable.

@maropu
Copy link
Member

maropu commented Jul 6, 2020

Yeah, Shane is working hard on this issue now, so we need to wait a little until it stabilizes.

@SparkQA

This comment has been minimized.

@viirya

This comment has been minimized.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jul 7, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125166 has finished for PR 28988 at commit 04f6bb6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125194 has finished for PR 28988 at commit 04f6bb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @viirya , @maropu , @frankyin-factual , @HyukjinKwon .
Merged to master.

The last commit is a only-indentation fix.

@dongjoon-hyun
Copy link
Member

Could you adjust the test case and make a backporting PR against branch-3.0, @viirya?

cc @dbtsai

@viirya
Copy link
Member Author

viirya commented Jul 7, 2020

Thanks all. Sure, let me create a backporting PR for branch-3.0.

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125233 has finished for PR 28988 at commit d352dbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jul 8, 2020
… variations

### What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

This backports #28988 to branch-3.0.

### Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a bug in nested column pruning.

### How was this patch tested?

Unit test.

Closes #29027 from viirya/SPARK-32163-3.0.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya viirya deleted the SPARK-32163 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants