[SPARK-21580][SQL]Integers in aggregation expressions are wrongly taken as group-by ordinal#18779
[SPARK-21580][SQL]Integers in aggregation expressions are wrongly taken as group-by ordinal#1877910110346 wants to merge 1 commit intoapache:masterfrom
Conversation
|
A specified title for this might be better. Such as "Integers in aggregation expressions are wrongly taken as group-by ordinal". |
There was a problem hiding this comment.
We can use pattern matching here like:
ng match {
case a: Alias => if (!isIntLiteral(a.child)) filterGroups :+= ng
case _ => filterGroups :+= ng
}
There was a problem hiding this comment.
Please also fix the indent of code here. Like:
newGroups.foreach { ng =>
ng match {
...
}
}
There was a problem hiding this comment.
Please leave a comment explaining why we need to filter the group-by exprs.
|
BTW, we should modify the PR description too. Please briefly describe the problem, what the fix is. Thanks. |
|
Thanks,i will update @viirya |
Group by ordinal|
Test build #80067 has finished for PR 18779 at commit
|
|
I checked the root cause of this and It seems the current fix is similar to |
|
@maropu I am not sure if I understand your idea correctly. Those int literals in group-by is intended to be wrapped in |
There was a problem hiding this comment.
Please also comment on why this filtering is safe from changing grouping result. Thanks.
There was a problem hiding this comment.
I think we may need to trim all possible Aliases, e.g., Alias(Alias(1, 'a'), 'b').
There was a problem hiding this comment.
Have they been trimmed in aggregateExpressions?
There was a problem hiding this comment.
This rule is before CleanupAliases.
|
@viirya sorry for my ambiguous explanation and I mean; I think replacing |
|
@maropu Thanks for clarifying it. Although they looks similar, from semantics I'd treat them different rules. However, I don't have strong opinion on this. Btw, |
There was a problem hiding this comment.
Similar to RemoveLiteralFromGroupExpressions, we shouldn't drop all grouping expressions if they are all int literals after resolved from UnresolvedOrdinal.
There was a problem hiding this comment.
It looks no problem here, although we drop all grouping expressions
There was a problem hiding this comment.
Please see the comments at
. We should not make the grouping expressions empty.There was a problem hiding this comment.
Actually, We have supported GlobalAggregates and group by null
There was a problem hiding this comment.
When the aggregates that have an empty input, if we remove all grouping expressions, this triggers the ungrouped code path (which aways returns a single row). So the query semantics are changed.
You meant this is not an issue anymore?
There was a problem hiding this comment.
If we reimplement this by using ResolvedOrdinal, we probably need not consider the empty issue here, I think.
There was a problem hiding this comment.
@maropu yap, I think so. Because RemoveLiteralFromGroupExpressions takes care of this.
There was a problem hiding this comment.
thanks, I'm not very familiar with this module , how to do it?
There was a problem hiding this comment.
@10110346 Please see previous comment: #18779 (comment)
The basic idea is, instead of resolve UnresolvedOrdinal to actual aggregate expressions, we resolve it to ResolvedOrdinal.
Then in the beginning of optimization, we replace all ResolvedOrdinal to actual aggregate expressions.
What do you think?
|
yea, just suggestion. |
|
@maropu As an alternative approach, we can avoid to resolve Instead, we can create a By doing this, we can fix this issue (because we don't replace the int literal with |
|
Aha, I feel better idea to add |
|
Test build #80070 has finished for PR 18779 at commit
|
|
Test build #80105 has finished for PR 18779 at commit
|
|
@viirya Only to |
|
@10110346 Can't we also do the same on order by ordinal? |
|
@viirya Maybe adding |
|
I have updated this PR, please help to review it again. @viirya also cc @cloud-fan @gatorsmile |
|
@10110346 Why? In the query Is there any problem? |
|
@viirya |
|
Actually I don't think |
There was a problem hiding this comment.
Please also add order-by test too. Maybe add to DataFrameSuite.
There was a problem hiding this comment.
Why do we need to have change like this?
There was a problem hiding this comment.
Do we still need to keep this change?
There was a problem hiding this comment.
it is not necessary, i will remove it.thanks
|
Test build #80237 has finished for PR 18779 at commit
|
There was a problem hiding this comment.
I've run this test. Even without transform -> resolveOperators, it still works. Can you check it again?
There was a problem hiding this comment.
OK. I see. When we input query like df.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), the query plan looks like:
Sort [7#22 ASC NULLS FIRST, a#5 ASC NULLS FIRST, b#6 ASC NULLS FIRST], true
+- Project [7 AS 7#22, a#5, b#6]
+- Project [_1#2 AS a#5, _2#3 AS b#6]
+- LocalRelation [_1#2, _2#3]
We have a Project below Sort. The ordinal 1 be replaced with the attribute 7#22. So we won't get an int literal 7 here. That is why it passes.
Can you have a test for ordinal order-by that show different behavior? If not, I think ordinal order-by should be safe from this bug. And we don't need to add this test.
There was a problem hiding this comment.
Hmm, maybe we still can keep this test for the case that someone changes the Sort/Project relationship in the future.
There was a problem hiding this comment.
I've verified that the above tests will fail without this fix.
|
LGTM. cc @gatorsmile for final check. |
|
@10110346 Thanks for working this! Sorry I've confused you in previous comments. Current changes looks good to me. |
|
Test build #80244 has finished for PR 18779 at commit
|
|
Test build #80246 has finished for PR 18779 at commit
|
|
retest this please. |
|
Test build #80249 has finished for PR 18779 at commit
|
There was a problem hiding this comment.
One more comment. You need to use withTempView
There was a problem hiding this comment.
Nit: Please use upper cases for SQL keywords.
|
LGTM except a few minor comments. The fix looks great now! Thanks everyone! |
|
I learned a lot from you, thanks all. |
|
Test build #80277 has finished for PR 18779 at commit
|
|
Really appreciate your patience and your work! This is how we work in Spark SQL. Thanks! |
…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
|
Thanks! Merging to master/2.2 |
…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes apache#18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
select 3, 4, sum(b) from data group by 1, 2;select 3 as c, 4 as d, sum(b) from data group by c, d;When running these two cases, the following exception occurred:
Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using
resolveOperatorsinSubstituteUnresolvedOrdinals.How was this patch tested?
Added unit test case