-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix duplicate group keys after hash aggregation spill (#20724) #20858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
64631cc
c5eced1
31e4e15
54aaed9
4034396
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1267,6 +1267,18 @@ impl GroupedHashAggregateStream { | |
| // on the grouping columns. | ||
| self.group_ordering = GroupOrdering::Full(GroupOrderingFull::new()); | ||
|
|
||
| // Recreate group_values to use streaming mode (GroupValuesColumn<true> | ||
| // with scalarized_intern) which preserves input row order, as required | ||
| // by GroupOrderingFull. This is only needed for multi-column group by, | ||
| // since single-column uses GroupValuesPrimitive which is always safe. | ||
|
gboucher90 marked this conversation as resolved.
Outdated
|
||
| let group_schema = self | ||
| .spill_state | ||
| .merging_group_by | ||
| .group_schema(&self.spill_state.spill_schema)?; | ||
| if group_schema.fields().len() > 1 { | ||
| self.group_values = new_group_values(group_schema, &self.group_ordering)?; | ||
|
Comment on lines
+1278
to
+1279
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note:
|
||
| } | ||
|
|
||
| // Use `OutOfMemoryMode::ReportError` from this point on | ||
| // to ensure we don't spill the spilled data to disk again. | ||
| self.oom_mode = OutOfMemoryMode::ReportError; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how long our CI already runs for we probably want to avoid a new target (this will recompile a bunch of DataFusion even though it only runs one test)
Another technique we might use to add coverage / find a reproducer here is fuzzing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR here to propose removing the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thank you 🙇 , I was about to do it after my lunch break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged it on this branch, thanks a lot