Skip to content

Support multi-partition Select operations with aggregations#18492

Merged
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.06from
rjzamora:complex-aggregations-ir
Apr 29, 2025
Merged

Support multi-partition Select operations with aggregations#18492
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.06from
rjzamora:complex-aggregations-ir

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Apr 14, 2025

Description

This PR supersedes #17941

In contrast to 17941, this PR does not introduce any new task-graph logic. Instead, complex expression graphs (expression graphs containing non-pointwise nodes) are decomposed into multiple IR nodes.

The design used in this PR is probably more intuitive than FusedExpr concept.

Illustration

Aggregations_small drawio

TODO:

  • Test that performance is similar to FusedExpr.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Apr 14, 2025
@rjzamora rjzamora self-assigned this Apr 14, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 14, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file comes from #18369

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file comes from #18369

@rjzamora
Copy link
Member Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 14, 2025

/ok to test

@rjzamora, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@rjzamora rjzamora marked this pull request as ready for review April 14, 2025 20:11
@rjzamora rjzamora requested a review from a team as a code owner April 14, 2025 20:11
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good I think. I have relatively minor comments

Comment on lines +106 to +109
# Try decomposing the underlying expressions
return decompose_select(
ir, child, partition_info, rec.state["config_options"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that you could do the rewrite as a pass over the IR representation before lowering and assigning partitioning (which I think would make things a bit simpler), but I guess you don't want to do the "complicated" thing if there's only a single partition so we need to do things here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the fact that we don't handle iterative lowering at the moment, so it would take a larger diff to make something like that work. We would run into Select(Agg) nodes during the lowering stage and need to know (or deduce) that those nodes are already decomposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I do think your right that the key problem is that you don't want to do the decomposition unless you need to.

Comment on lines 198 to 207
See Also
--------
_add_select_ir
_decompose_expr_node

Notes
-----
This function is called by ``_decompose_expr_node`` to decompose
an Agg node into multiple IR nodes. The new IR nodes are added
with ``_add_select_ir``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As above, I am not sure these xrefs add much to locally understanding how to use this function/what it does, and would perhaps be better served by an overview module-level docstring/comment.

@rjzamora rjzamora added 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Apr 25, 2025
@rjzamora rjzamora requested a review from wence- April 25, 2025 16:47
Comment on lines +240 to +255
columns, input_ir, partition_info = select(
[Cast(agg.dtype, agg)],
input_ir,
partition_info,
names=names,
repartition=True,
)

# Combined stage
(column,) = columns
columns, input_ir, partition_info = select(
[Agg(agg.dtype, "sum", None, column)],
input_ir,
partition_info,
names=names,
)
Copy link
Contributor

@wence- wence- Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand this, is there a reason we can't put the cast into the final sum aggregation? Ah, it's the repartitioning step?

Comment on lines +372 to +374
schema: MutableMapping[str, Any] = {}
for ir in unique_input_irs:
schema.update(ir.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (maybe a followup): should we check that none of the column names overlap?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work here @rjzamora! Looks good to me

@wence-
Copy link
Contributor

wence- commented Apr 29, 2025

/merge

@rapids-bot rapids-bot bot merged commit c40ca62 into rapidsai:branch-25.06 Apr 29, 2025
111 checks passed
@rjzamora rjzamora deleted the complex-aggregations-ir branch April 29, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond cudf-polars Issues specific to cudf-polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants