Implement set_agg aggregate function #1511
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
👋 hi all - first PR here. Excited to contribute to this project.
Which issue does this PR close?
This closes #1323.
Rationale for this change
This provides an efficient way to aggregate unique values into an array. This is beneficial for aggregating low cardinality fields where
array_aggmay require significantly more memory thanset_agg.I mainly implemented this as a way to get familiar the codebase. Though - I'm not 100% sure merging this actually makes sense if the goal of the project is to be as Postgres-like as possible.
set_aggis supported by Presto (as linked in the issue above) and other DBMS, but Postgres neither supportsset_aggnorarray_distinct.The recommended approach to something likeset_aggin Postgres is to usearray_agg, unnest the values, select distinct, then usearray_aggagain on the output.Edit - looks like this is the general approach for getting unique items from an array, but for everything
set_aggwould work forarray_agg(distinct <expr>)would work as well.What changes are included in this PR?
This includes the implementation of
set_aggand a couple tests. It borrows heavily from the patch that implementedarray_agg: #1300Open questions
There's a couple specific points I could use feedback on:
hashbrown::HashSetfor the accumulator instead ofstd::collections::HashSet- it seems this is preferred in the codebase.set_aggis nondeterministic. I managed to hack it together but I'm sure there's an easier way (for both integration and unit tests).