Skip to content

Rewrite groupby aggregations in cudf-polars to simplify evaluation#18369

Merged
rapids-bot[bot] merged 25 commits intorapidsai:branch-25.06from
wence-:wence/fea/polars-rewrite-groupby
Apr 30, 2025
Merged

Rewrite groupby aggregations in cudf-polars to simplify evaluation#18369
rapids-bot[bot] merged 25 commits intorapidsai:branch-25.06from
wence-:wence/fea/polars-rewrite-groupby

Conversation

@wence-
Copy link
Contributor

@wence- wence- commented Mar 25, 2025

Description

Since we can only aggregate expressions that produce a single column, grouped aggregations can be split into "pointwise expressions we can pre-evaluate", "aggregations on such expression", "pointwise expressions on aggregations". Rather than doing an ad-hoc post-aggregation in the groupby evaluation, instead split a groupby node from polars into groupby of "intermediate" aggregations and then post-aggregations (if necessary).

This simplifies the implementation for the partitioned case as well, and lays the groundwork for the same setup when we will introduce rolling aggregations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner March 25, 2025 18:20
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Mar 25, 2025
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 25, 2025
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up. I think I somewhat follow things and it seems to mostly make sense.

One meta question: How careful are we being with the public API for cudf_polars? If we're adding things like cudf_polars.dsl.utils.utils.apply_pre_evaluation to the public API, then I might recommend returning something like a dataclass instead of a tuple, to give us some flexibility if we need to change the return type in the future. If we don't consider that part of the public API, then maybe we make them private?

@wence-
Copy link
Contributor Author

wence- commented Mar 26, 2025

One meta question: How careful are we being with the public API for cudf_polars? If we're adding things like cudf_polars.dsl.utils.utils.apply_pre_evaluation to the public API, then I might recommend returning something like a dataclass instead of a tuple, to give us some flexibility if we need to change the return type in the future. If we don't consider that part of the public API, then maybe we make them private?

I guess it depends what you consider to be "public". cudf-polars has no user-facing API, so in that sense everything is private.

@wence-
Copy link
Contributor Author

wence- commented Apr 9, 2025

I think this is ready @TomAugspurger / @rjzamora

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take my review with a grain of salt since I'm still getting up to speed on this, but it seems to make sense.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lawrence! I took a pass at everything besides python/cudf_polars/cudf_polars/dsl/utils/aggregations.py - Looks really good so far.

I also pulled this into an experimental branch and did some multi-gpu tpch tests - Seems to work as expected.

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lawrence! Non blocking question, minor suggestions

@wence-
Copy link
Contributor Author

wence- commented Apr 29, 2025

/merge

@rapids-bot rapids-bot bot merged commit 4c737aa into rapidsai:branch-25.06 Apr 30, 2025
119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA] Broaden set of supported expressions in groupby-aggregation for cudf-polars

4 participants

Comments