-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41985][SQL] Centralize more column resolution rules #39508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions are pulled out to ColumnResolutionHelper without actual changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual change. Now we resolve group by alias and group by all before resolving outer references.
|
cc @viirya |
|
There are some conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new code, all code are moved from Analyzer.
anchovYu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented one issue with a failed case. Tricky, we need to note that once an expression is resolved to OuterReference there is no way back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for this resolution order: 1) resolve group by alias 2) resolve group by all.
E.g. select .. as all .. group by all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, select a + 1 as b, b + 1 as c .. group by c will fail as LCA now doesn't handle grouping expr. Need a followup if we want to support above query (which I think we should).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be problematic, try with this query:
sql(s"select * from values(1) as t(something) " +
s"where exists (select salary * 1.5 as something from $testTable group by something)")
where testTable has a salary field. (I just wrote in LCA suite.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It's important to remember that rules will be executed repeatedly and we need to make sure the rule won't conflict with its previous runs.
|
Thanks @cloud-fan for doing this as well! Very useful for the health of the analyzer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the agg exprs are not changed from here.
Shall we abstract the code as
val resolvedGroupExprs = if (resolvedAggExprsWithOuter.forall(_.resolved)) {
if (resolvedAggExprsWithOuter.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE))) {
...
} else {
...
}
} else {
...
}
a.copy(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, even use single-layer if, because all conditions are on one variable resolvedAggExprsWithOuter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use "aggregation expressions" instead of Aggregate.aggregateExpressions? The comment is using "grouping expressions" below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: column => columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for putting the outer ref at the last order. outer ref can only be used in where/having so it makes sense to resolve as the other references first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not group by alias but group by outer reference, it works? From ResolveReferencesInAggregate seems so, just want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually reading the code I think group by alias is triggered, because the whole resolvedAggExprsWithOuter are resolved (to outer reference)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GROUP BY alias takes precedence than outer reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add test for group by all:
SELECT 1 AS x, x + 1 AS k FROM testData GROUP BY all;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, even use single-layer if, because all conditions are on one variable resolvedAggExprsWithOuter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add extra handling to report user-friendly error message, showing that queries of group by all and group by alias are not supported when there is LCA in the select clause? For example, check if the grouping expression is all or alias, throw exception immediately.
Currently I don't have a clear solution how to support the above case with the LCA design. So this unsupported situation may last for some time, and we better provide clean error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually reading the code I think group by alias is triggered, because the whole resolvedAggExprsWithOuter are resolved (to outer reference)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, do we require the child of Sort to be resolved? Or is it implied somewhere? Any risk of not doing so (the outer reference thing)? Though i can't think of any negative case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, ResolveRefererences only invokes these sub-rules if all children are resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I forgot ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any tests covering this order: resolveColWithAgg > resolveOrderByAll ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anchovYu can you help review the new tests? I think I've covered all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
sql/core/src/test/resources/sql-tests/inputs/column-resolution-aggregate.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing reference won't be able to cover this case, it should only add grouping expressions to the Aggregate. v1.b is not a grouping expression, so if order by v1.b the query should throw exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is good. just comment needs correction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I forgot ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extends the fix of https://issues.apache.org/jira/browse/SPARK-31670.
Alias in grouping expressions is harmful. Semantically, alias in grouping expressions should be ignored. We trim aliases in grouping expressions at the end of analysis by the rule CleanupAliases, so that CheckAnalysis can correctly match grouping expressions with aggregate expressions. However, there are rules doing the same matching during analysis, and CleanupAliases is too late for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For details in how it affects LCA:
Without the trim,
=== Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveLateralColumnAliasReference ===
!Aggregate [salary#19 AS lca#15], [salary#19 AS lca#15, (lateralAliasReference(lca#15, lca, lca#15) + 1) AS col#16] Project [lca#15, (lateralAliasReference(lca#15, lca, lca#15) + 1) AS col#16]
!+- SubqueryAlias spark_catalog.default.employee +- Aggregate [salary#19 AS lca#15], [salary#19 AS lca#15]
! +- Relation spark_catalog.default.employee[dept#17,name#18,salary#19,bonus#20,properties#21] orc +- SubqueryAlias spark_catalog.default.employee
! +- Relation spark_catalog.default.employee[dept#17,name#18,salary#19,bonus#20,properties#21] orc
ResolveLateralColumnAliasReference matches the whole salary#19 AS lca#15 same as the grouping expression and pushes it down to the Aggregate (this rule finds matches in grouping expression and aggregate functions to push down). It causes problem for later application of this rule on Project: there is no alias in the Project any more so the lateralAliasReference can't find an alias to match and be removed.
|
The pyspark failure is unrelated, I'm merging it to master/3.4 (as it fixes a regression), thanks for the review! |
### What changes were proposed in this pull request? This is a followup of #38888 . When I search for all the matching of `UnresolvedAttribute`, I found that there are still a few rules doing column resolution: 1. ResolveAggAliasInGroupBy 2. ResolveGroupByAll 3. ResolveOrderByAll 4. ResolveDefaultColumns This PR merges the first 3 into `ResolvedReferences`. The last one will be done with a separate PR, as it's more complicated. To avoid making the rule `ResolvedReferences` bigger and bigger, this PR pulls out the resolution code for `Aggregate` to a separated virtual rule (only be used by `ResolvedReferences`). The same to `Sort`. We can refactor and add more virtual rules later. ### Why are the changes needed? It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after #38888 : `select a from t where exists (select 1 as a group by a)`. The `group by a` should be resolved as `1 as a`, but now it's resolved as outer reference `a`. This is because `ResolveReferences` runs before `ResolveAggAliasInGroupBy`, and resolves outer references too early. ### Does this PR introduce _any_ user-facing change? Fixes a bug, but the bug is not released yet. ### How was this patch tested? new tests Closes #39508 from cloud-fan/column. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 40ca27c) Signed-off-by: Wenchen Fan <[email protected]>
…xpr is resolved ### What changes were proposed in this pull request? This is a followup of #39508 to fix a regression. We should not remove aliases from grouping expressions if they are not resolved, as the alias may be necessary for resolution, such as `CreateNamedStruct`. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #39867 from cloud-fan/column. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…xpr is resolved ### What changes were proposed in this pull request? This is a followup of #39508 to fix a regression. We should not remove aliases from grouping expressions if they are not resolved, as the alias may be necessary for resolution, such as `CreateNamedStruct`. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #39867 from cloud-fan/column. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit 02b39f0) Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This is a followup of apache#38888 . When I search for all the matching of `UnresolvedAttribute`, I found that there are still a few rules doing column resolution: 1. ResolveAggAliasInGroupBy 2. ResolveGroupByAll 3. ResolveOrderByAll 4. ResolveDefaultColumns This PR merges the first 3 into `ResolvedReferences`. The last one will be done with a separate PR, as it's more complicated. To avoid making the rule `ResolvedReferences` bigger and bigger, this PR pulls out the resolution code for `Aggregate` to a separated virtual rule (only be used by `ResolvedReferences`). The same to `Sort`. We can refactor and add more virtual rules later. ### Why are the changes needed? It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after apache#38888 : `select a from t where exists (select 1 as a group by a)`. The `group by a` should be resolved as `1 as a`, but now it's resolved as outer reference `a`. This is because `ResolveReferences` runs before `ResolveAggAliasInGroupBy`, and resolves outer references too early. ### Does this PR introduce _any_ user-facing change? Fixes a bug, but the bug is not released yet. ### How was this patch tested? new tests Closes apache#39508 from cloud-fan/column. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 40ca27c) Signed-off-by: Wenchen Fan <[email protected]>
…xpr is resolved ### What changes were proposed in this pull request? This is a followup of apache#39508 to fix a regression. We should not remove aliases from grouping expressions if they are not resolved, as the alias may be necessary for resolution, such as `CreateNamedStruct`. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#39867 from cloud-fan/column. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit 02b39f0) Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
This is a followup of #38888 .
When I search for all the matching of
UnresolvedAttribute, I found that there are still a few rules doing column resolution:This PR merges the first 3 into
ResolvedReferences. The last one will be done with a separate PR, as it's more complicated.To avoid making the rule
ResolvedReferencesbigger and bigger, this PR pulls out the resolution code forAggregateto a separated virtual rule (only be used byResolvedReferences). The same toSort. We can refactor and add more virtual rules later.Why are the changes needed?
It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after #38888 :
select a from t where exists (select 1 as a group by a). Thegroup by ashould be resolved as1 as a, but now it's resolved as outer referencea. This is becauseResolveReferencesruns beforeResolveAggAliasInGroupBy, and resolves outer references too early.Does this PR introduce any user-facing change?
Fixes a bug, but the bug is not released yet.
How was this patch tested?
new tests