-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream #13995
Changes from all commits
da2b11a
01d2b60
e094adb
d066aff
04d9123
242f5ab
270efd7
38ade08
f4fedea
9d6f405
5471775
b682e8c
8a00829
4e312e1
f521846
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -24,8 +24,8 @@ use std::vec; | |||||||
| use crate::aggregates::group_values::{new_group_values, GroupValues}; | ||||||||
| use crate::aggregates::order::GroupOrderingFull; | ||||||||
| use crate::aggregates::{ | ||||||||
| evaluate_group_by, evaluate_many, evaluate_optional, group_schema, AggregateMode, | ||||||||
| PhysicalGroupBy, | ||||||||
| create_schema, evaluate_group_by, evaluate_many, evaluate_optional, group_schema, | ||||||||
| AggregateMode, PhysicalGroupBy, | ||||||||
| }; | ||||||||
| use crate::metrics::{BaselineMetrics, MetricBuilder, RecordOutput}; | ||||||||
| use crate::sorts::sort::sort_batch; | ||||||||
|
|
@@ -490,6 +490,31 @@ impl GroupedHashAggregateStream { | |||||||
| .collect::<Result<_>>()?; | ||||||||
|
|
||||||||
| let group_schema = group_schema(&agg.input().schema(), &agg_group_by)?; | ||||||||
|
|
||||||||
| // fix https://github.com/apache/datafusion/issues/13949 | ||||||||
| // Builds a **partial aggregation** schema by combining the group columns and | ||||||||
| // the accumulator state columns produced by each aggregate expression. | ||||||||
| // | ||||||||
| // # Why Partial Aggregation Schema Is Needed | ||||||||
| // | ||||||||
| // In a multi-stage (partial/final) aggregation strategy, each partial-aggregate | ||||||||
| // operator produces *intermediate* states (e.g., partial sums, counts) rather | ||||||||
| // than final scalar values. These extra columns do **not** exist in the original | ||||||||
| // input schema (which may be something like `[colA, colB, ...]`). Instead, | ||||||||
| // each aggregator adds its own internal state columns (e.g., `[acc_state_1, acc_state_2, ...]`). | ||||||||
| // | ||||||||
| // Therefore, when we spill these intermediate states or pass them to another | ||||||||
| // aggregation operator, we must use a schema that includes both the group | ||||||||
| // columns **and** the partial-state columns. | ||||||||
| let partial_agg_schema = create_schema( | ||||||||
| &agg.input().schema(), | ||||||||
| &agg_group_by, | ||||||||
| &aggregate_exprs, | ||||||||
| AggregateMode::Partial, | ||||||||
| )?; | ||||||||
|
|
||||||||
| let partial_agg_schema = Arc::new(partial_agg_schema); | ||||||||
|
|
||||||||
| let spill_expr = group_schema | ||||||||
| .fields | ||||||||
| .into_iter() | ||||||||
|
|
@@ -522,7 +547,7 @@ impl GroupedHashAggregateStream { | |||||||
| let spill_state = SpillState { | ||||||||
| spills: vec![], | ||||||||
| spill_expr, | ||||||||
| spill_schema: Arc::clone(&agg_schema), | ||||||||
| spill_schema: partial_agg_schema, | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like the issue was related only to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi @korowa ,
In other words, remove these lines, am I correct? datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs Lines 967 to 969 in 487b952
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this line seems to be redundant now -- I'd expect all aggregation modes to have the same spill schema (which is set by this PR), so it shouldn't depend on stream input anymore.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for confirming. |
||||||||
| is_stream_merging: false, | ||||||||
| merging_aggregate_arguments, | ||||||||
| merging_group_by: PhysicalGroupBy::new_single(agg_group_by.expr.clone()), | ||||||||
|
|
@@ -964,9 +989,6 @@ impl GroupedHashAggregateStream { | |||||||
| && self.update_memory_reservation().is_err() | ||||||||
| { | ||||||||
| assert_ne!(self.mode, AggregateMode::Partial); | ||||||||
| // Use input batch (Partial mode) schema for spilling because | ||||||||
| // the spilled data will be merged and re-evaluated later. | ||||||||
| self.spill_state.spill_schema = batch.schema(); | ||||||||
| self.spill()?; | ||||||||
| self.clear_shrink(batch); | ||||||||
| } | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this import reordering can be reverted to leave the file unmodified