Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Use multifrag kernels on CPU for groupby with a big result. #505

Merged
merged 1 commit into from
May 26, 2023

Conversation

ienkovich
Copy link
Contributor

The basic idea is to not use multiple kernels when each kernel produces a hash table comparable in size to the whole input data. Then execution would go in a single thread and we would avoid a very costly reduction.

I'm sure my criteria are too strong and we might benefit from a single kernel in much more cases but this one at least would allow us to run queries like H2O Q10 on big data sets until we switch to better aggregation algorithms.

Copy link
Contributor

@alexbaden alexbaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed a bit in person yesterday -- this seems fine for now. I think it would be worthwhile to remove the extra group by buffers when we have baseline hash if we know the estimator query ran. The estimator query should tell us how many entries are in the buffer, making a group by buffer per fragment unnecessary.
We discussed adding number of fragments to the heuristic but we aren't sure it would matter.

@ienkovich ienkovich merged commit 73bb814 into main May 26, 2023
@ienkovich ienkovich deleted the ienkovich/multifrag-cpu-groupby branch May 26, 2023 17:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants