Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 17, 2020

What changes were proposed in this pull request?

Currently we only push nested column pruning through a few operators such as LIMIT, SAMPLE, etc. This patch extends the feature to other operators including RepartitionByExpression, Join.

Why are the changes needed?

Currently nested column pruning only applied on a few operators. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.

Does this PR introduce any user-facing change?

Yes. More SQL operators are covered by nested column pruning.

How was this patch tested?

Added unit test, end-to-end tests.

@SparkQA
Copy link

SparkQA commented May 17, 2020

Test build #122750 has finished for PR 28556 at commit 2c95e81.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 17, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 17, 2020

Test build #122752 has finished for PR 28556 at commit 2c95e81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 17, 2020

@viirya
Copy link
Member Author

viirya commented May 17, 2020

cc @maropu

case Project(projectList, child)
if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
getAliasSubMap(projectList)
val exprsToPrune = projectList ++ child.expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: exprsToPrune -> exprCandidatesToPrune?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

})
}).transformExpressions {
case f: ExtractValue if nestedFieldToAlias.contains(f) =>
nestedFieldToAlias(f).toAttribute
Copy link
Member

@maropu maropu May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed only for supporting joins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for example RepartitionByExpression also needs this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the operators which have expressions should need this to replace ExtractValue and nested column aliases.

val newGenerate = g.copy(generator = newGenerator)

NestedColumnAliasing.replaceChildrenWithAliases(newGenerate, attrToAliases)
NestedColumnAliasing.replaceChildrenWithAliases(g, nestedFieldToAlias, attrToAliases)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we need to update the method name of replaceChildrenWithAliases. We don't need Children in the name, anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the method name.

@viirya viirya changed the title [SPARK-31736][SQL] Nested column aliasing for other operators [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join May 21, 2020
@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122916 has finished for PR 28556 at commit b77a1ba.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 21, 2020

retest this please

@viirya
Copy link
Member Author

viirya commented May 21, 2020

@maropu I addressed your comments. Could you help take another look? Thanks.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122917 has finished for PR 28556 at commit b77a1ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("Pushing a single nested field projection - negative") {
val ops = Seq(
(input: LogicalPlan) => input.distribute('name)(1),
(input: LogicalPlan) => input.distribute($"name.middle")(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks nice. This PR could support this case.

.analyze
comparePlans(optimized1, expected1)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary line break.

checkAnswer(query3, Row("abc") :: Row(null) :: Nil)
}

testSchemaPruning("select one deep nested complex field after outer join") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests.

"struct<contactId:int>")
checkAnswer(query1, Row("X.") :: Row("Y.") :: Nil)

val query2 = sql("select contacts.name.middle from contacts, departments where " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think its better to use uppercases for SQL keywords where possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems all tests in this test suite are using lowercases. Changing all tests seems too bothering... :)

@maropu
Copy link
Member

maropu commented May 21, 2020

I left some minor comments though, it looks okay. cc: @dongjoon-hyun @dbtsai

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122922 has finished for PR 28556 at commit db601df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122966 has finished for PR 28556 at commit f720bdf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 22, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122977 has finished for PR 28556 at commit f720bdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 27, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123168 has finished for PR 28556 at commit f720bdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 2, 2020

ping @cloud-fan @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123785 has finished for PR 28556 at commit ce5d8dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val exprCandidatesToPrune = projectList ++ child.expressions
getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq)

case plan if SQLConf.get.nestedSchemaPruningEnabled && canPruneOn(plan) =>
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal but I would rename plan to p to avoid shadowing the plan argument. At least my IDE complains on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it in other PR. Thanks.

if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
getAliasSubMap(projectList)
val exprCandidatesToPrune = projectList ++ child.expressions
getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, just to clarify, you added producedAttributes here just to be safe but not related to the current changes (?). Seems Join and RepartitionByExpression have an empty producedAttributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if it's going to output, it shouldn't be pruned anyway.

@HyukjinKwon
Copy link
Member

Merged to master.

@viirya viirya deleted the others-column-pruning branch December 27, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants