Skip to content

Conversation

@james727
Copy link
Contributor

@james727 james727 commented Dec 31, 2021

👋 hi all - first PR here. Excited to contribute to this project.

Which issue does this PR close?

This closes #1323.

Rationale for this change

This provides an efficient way to aggregate unique values into an array. This is beneficial for aggregating low cardinality fields where array_agg may require significantly more memory than set_agg.

I mainly implemented this as a way to get familiar the codebase. Though - I'm not 100% sure merging this actually makes sense if the goal of the project is to be as Postgres-like as possible. set_agg is supported by Presto (as linked in the issue above) and other DBMS, but Postgres neither supports set_agg nor array_distinct. The recommended approach to something like set_agg in Postgres is to use array_agg, unnest the values, select distinct, then use array_agg again on the output.

Edit - looks like this is the general approach for getting unique items from an array, but for everything set_agg would work for array_agg(distinct <expr>) would work as well.

What changes are included in this PR?

This includes the implementation of set_agg and a couple tests. It borrows heavily from the patch that implemented array_agg: #1300

Open questions

There's a couple specific points I could use feedback on:

  • Using hashbrown::HashSet for the accumulator instead of std::collections::HashSet - it seems this is preferred in the codebase.
  • Tests - this was actually the most difficult part of writing this as output ordering of set_agg is nondeterministic. I managed to hack it together but I'm sure there's an easier way (for both integration and unit tests).
  • Documentation - In general, what docs need to be updated? And given this is a divergence from Postgres, is there anywhere specific this should be called out?

@james727
Copy link
Contributor Author

james727 commented Dec 31, 2021

Closing this for now as upon further review, I think array_agg(distinct <expr>) should work fine for this use case, but there are some bugs in how Datafusion handles non-column expressions in this syntax. Will open another issue when I figure out more.

@james727 james727 closed this Dec 31, 2021
@alamb
Copy link
Contributor

alamb commented Jan 1, 2022

Thanks for the contribution @james727 -- let us know if we can help out in some way 👍

@james727
Copy link
Contributor Author

james727 commented Jan 3, 2022

Thanks @alamb - I think I have a solution for the strange behavior with complex expressions in distinct aggregations in #1519 - if you or someone else could have a look and let me know if this makes sense or is totally off base, it would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement set_agg aggregation function

2 participants