Use multifrag kernels on CPU for groupby with a big result. #505

ienkovich · 2023-05-24T18:29:27Z

The basic idea is to not use multiple kernels when each kernel produces a hash table comparable in size to the whole input data. Then execution would go in a single thread and we would avoid a very costly reduction.

I'm sure my criteria are too strong and we might benefit from a single kernel in much more cases but this one at least would allow us to run queries like H2O Q10 on big data sets until we switch to better aggregation algorithms.

Signed-off-by: ienkovich <[email protected]>

alexbaden

We discussed a bit in person yesterday -- this seems fine for now. I think it would be worthwhile to remove the extra group by buffers when we have baseline hash if we know the estimator query ran. The estimator query should tell us how many entries are in the buffer, making a group by buffer per fragment unnecessary.
We discussed adding number of fragments to the heuristic but we aren't sure it would matter.

Use multifrag kernels on CPU for groupby with a big result.

678eb24

Signed-off-by: ienkovich <[email protected]>

ienkovich requested review from alexbaden and kurapov-peter May 24, 2023 18:29

alexbaden approved these changes May 26, 2023

View reviewed changes

ienkovich merged commit 73bb814 into main May 26, 2023

ienkovich deleted the ienkovich/multifrag-cpu-groupby branch May 26, 2023 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multifrag kernels on CPU for groupby with a big result. #505

Use multifrag kernels on CPU for groupby with a big result. #505

ienkovich commented May 24, 2023

alexbaden left a comment

Use multifrag kernels on CPU for groupby with a big result. #505

Use multifrag kernels on CPU for groupby with a big result. #505

Conversation

ienkovich commented May 24, 2023

alexbaden left a comment

Choose a reason for hiding this comment