Rewrite groupby aggregations in cudf-polars to simplify evaluation#18369
Conversation
Now we only see aggregations of columns in the groupby agg requests.
Now we support more things, we should test them.
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks for putting this up. I think I somewhat follow things and it seems to mostly make sense.
One meta question: How careful are we being with the public API for cudf_polars? If we're adding things like cudf_polars.dsl.utils.utils.apply_pre_evaluation to the public API, then I might recommend returning something like a dataclass instead of a tuple, to give us some flexibility if we need to change the return type in the future. If we don't consider that part of the public API, then maybe we make them private?
I guess it depends what you consider to be "public". cudf-polars has no user-facing API, so in that sense everything is private. |
Although it acts pointwise, because inputs can be scalar and then broadcast, it doesn't commute through grouped aggregations.
|
I think this is ready @TomAugspurger / @rjzamora |
TomAugspurger
left a comment
There was a problem hiding this comment.
Take my review with a grain of salt since I'm still getting up to speed on this, but it seems to make sense.
rjzamora
left a comment
There was a problem hiding this comment.
Thanks Lawrence! I took a pass at everything besides python/cudf_polars/cudf_polars/dsl/utils/aggregations.py - Looks really good so far.
I also pulled this into an experimental branch and did some multi-gpu tpch tests - Seems to work as expected.
Matt711
left a comment
There was a problem hiding this comment.
Thanks Lawrence! Non blocking question, minor suggestions
…olars-rewrite-groupby
…olars-rewrite-groupby
…olars-rewrite-groupby
This was always a hack and now we can remove it.
|
/merge |
Description
Since we can only aggregate expressions that produce a single column, grouped aggregations can be split into "pointwise expressions we can pre-evaluate", "aggregations on such expression", "pointwise expressions on aggregations". Rather than doing an ad-hoc post-aggregation in the groupby evaluation, instead split a groupby node from polars into groupby of "intermediate" aggregations and then post-aggregations (if necessary).
This simplifies the implementation for the partitioned case as well, and lays the groundwork for the same setup when we will introduce rolling aggregations.
Checklist