Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance for streaming grouping with single string columns #9195

Open
alamb opened this issue Feb 11, 2024 · 0 comments
Open

Improved performance for streaming grouping with single string columns #9195

alamb opened this issue Feb 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 11, 2024

Is your feature request related to a problem or challenge?

Follow on to #7064

The GroupsValues for aggregates need to handle "emitTo" for streaming groups so that the can flush groups that have already been built but will never be seen again.

The initial implementation of the specialized accumulator for Uft8/LargeUtf8 in #8827 is inefficient in that it copies / rehashes any strings remaining in the set after emission

This is likely not a large performance overhead in practice as most groups should be emitted so only a few groups will need to be rehashed. However, if it turns out it is a problem, we can make something more optimized

Describe the solution you'd like

Optimize emitTo for binary groups

#9188

Describe alternatives you've considered

I have one proposal in #9188 (look at ArrowStringSet::emit_first_n) -- it works and passes tests but I think is very complicated and hard to convince onesself that the unsafe usage is sound

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant