Skip to content

Conversation

@rshkv
Copy link

@rshkv rshkv commented Dec 8, 2020

Upstream

Taking fix for failures in which Spark will fail to bind a column if an aggregation contains a subquery.

What changes were proposed in this pull request?

Mostly a cherry-pick. I had to introduce the BaseAggregateExec trait (it came from a different PR) the one big method it has is cherry-picked from upstream.

How was this patch tested?

Upstream PR introduced test.

…agg contains subquery

Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression  for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`.

When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105

For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348

This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy.

Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`.

Yes, user hit error previously but run query successfully after this change.

Added a regression test.

Closes apache#28496 from Ngone51/spark-31620.

Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@rshkv rshkv changed the title [SPARK-31620][SQL] Fix reference binding failure in case of an final … [SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery Dec 8, 2020
@rshkv rshkv merged commit 118b4c4 into master Dec 8, 2020
@rshkv rshkv deleted the wr/SPARK-31620 branch December 8, 2020 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants