feat(expr-ir): Support `group_by`, utilize `pyarrow.acero` #3143

dangotbanned · 2025-09-18T18:51:34Z

What type of PR is this? (check all applicable)

✨ Feature

Related issues

Child of #feat(RFC): A richer Expr IR #2572

Description

This one spiralled into something pretty interesting IMO 😏

Acero is used under-the-hood for pyarrow features like:

Table.group_by (most important to this PR)
- See https://github.com/narwhals-dev/narwhals/blob/e0a3684ba4f77a59fe45d2914c74b4cff25cf344/narwhals/_plan/arrow/group_by.py)
Table.filter
Table.{join,join_asof}

But it also has limited support for other operations like:

I've been experimenting with those a bit in (https://github.com/narwhals-dev/narwhals/blob/e0a3684ba4f77a59fe45d2914c74b4cff25cf344/narwhals/_plan/arrow/acero.py).
Anticipating that combining those nodes into one big soup will result in over(...) 😄

Tasks

Mapping things out a bit, no compliant yet

There's a few gaps, but overall surprised how much was reusable 🥳

lol didn't realise it was just describing python dict behavior

Everything here seems to be working already? 😱 May as well show it off ```py >>> df.group_by("a", nwp.nth(2, 8)).agg(nwp.mean("d", "e", "g").name.suffix("_mean")) NotImplementedError: TODO: `GroupBy.agg` needs a `CompliantGroupBy` to dispatch to: keys: (a=col('a'), c=col('c'), i=col('i')) aggs: (d_mean=col('d').mean(), e_mean=col('e').mean(), g_mean=col('g').mean()) result_schema: FrozenSchema([ ('a', String), ('c', Int64), ('i', Unknown), ('d_mean', Unknown), ('e_mean', Unknown), ('g_mean', Unknown), ]) ```

Quite different to current version(s)

oops https://github.com/narwhals-dev/narwhals/actions/runs/17838467107/job/50721552166?pr=3143

Gonna need space for the mini translator

Borrowing some ideas from #2528, #2680

woops Making it a separate node rather than having a flag https://github.com/pola-rs/polars/blob/cdd247aaba8db3332be0bd031e0f31bc3fc33f77/crates/polars-plan/src/dsl/mod.rs#L872-L889

https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/narwhals/_arrow/group_by.py#L132-L141

#2660 (comment)

`pyarrow` has the same behavior as `polars`

Wasn't expecting so much to be working already 🥳 🥳 🥳

https://github.com/narwhals-dev/narwhals/actions/runs/17869832358/job/50820920705?pr=3143

Just pushing this as tests are working. Useful changes to follow: - Column renaming stuff will be avoidable - we just use `ArrowAggExpr.output_name` - Awkward stuff `first`, `last`, `_ensure_single_thread` can be avoided - `use_threads` was always available on `Declaration.to_table` - Whether we need to use can just be an `__ior__`

From #2528 https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/tests/frame/group_by_test.py#L686-L781

`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`

In theory, we should be able to compose `over()` using combinations of: - aggregate - both scalar and hash - order_by - project - hashjoin

Will make for an easier diff in the PR that splits up this mess

tests/plan/group_by_test.py

Resolves (#3143 (comment))

Towards #3143 (comment)

narwhals/_plan/common.py

Resolves #3143 (comment)

well, move it out of the docstring is more accurate i suppose

Still need to do `temp.column_names` as well That one is quite different to #3147, but was needed in this PR

dangotbanned added 10 commits September 15, 2025 18:19

feat(expr-ir): Getting started on GroupBy

4d33b68

Mapping things out a bit, no compliant yet

feat(DRAFT): mock up resolve_group_by

3718690

There's a few gaps, but overall surprised how much was reusable 🥳

fix: re-sync GroupByKeys

f70c021

feat: Make rewrite_projections(keys) optional

3828ea4

feat: Add FrozenSchema.merge

feb7661

lol didn't realise it was just describing python dict behavior

feat(DRAFT): Start spec-ing CompliantGroupBy

7a811b6

Quite different to current version(s)

feat(DRAFT): Implement some of ArrowGroupBy

6d3c0a9

feat(DRAFT): Fill out more of GroupBy.agg

e71d092

Merge branch 'oh-nodes' into expr-ir/group-by

9179ec3

dangotbanned added the enhancement New feature or request label Sep 18, 2025

dangotbanned mentioned this pull request Sep 18, 2025

feat(RFC): A richer Expr IR #2572

Draft

75 tasks

fix: avoid typing_extensions import

8aaf9a9

oops https://github.com/narwhals-dev/narwhals/actions/runs/17838467107/job/50721552166?pr=3143

dangotbanned added the internal label Sep 18, 2025

dangotbanned added 16 commits September 18, 2025 20:40

refactor: Move ArrowGroupBy

d9b918f

Gonna need space for the mini translator

feat(DRAFT): Simple cases working?

767261c

Borrowing some ideas from #2528, #2680

feat(expr-ir): Add missing Expr.len

648d5d9

woops Making it a separate node rather than having a flag https://github.com/pola-rs/polars/blob/cdd247aaba8db3332be0bd031e0f31bc3fc33f77/crates/polars-plan/src/dsl/mod.rs#L872-L889

feat(expr-ir): Support nw.len()

2682b10

https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/narwhals/_arrow/group_by.py#L132-L141

feat(expr-ir): support auto-implode

e1c3145

#2660 (comment)

feat(DRAFT): Support nw.col("a").unique() in group_by

5d36607

`pyarrow` has the same behavior as `polars`

test: Port over tests/frame/group_by_test

1aa2464

Wasn't expecting so much to be working already 🥳 🥳 🥳

cov

45a816f

https://github.com/narwhals-dev/narwhals/actions/runs/17869832358/job/50820920705?pr=3143

chore: Update todo

8f2ad50

chore: Add todo for drop_null_keys=True

4650456

fix: Avoid shadowed output aggregation names

ce18f51

feat(expr-ir): Rewrite, fix ordered aggregations

16148b2

test: Port over first, last group_by tests

ce86f8f

From #2528 https://github.com/narwhals-dev/narwhals/blob/0fb045536f5b56b978f354f8178b292301e9598c/tests/frame/group_by_test.py#L686-L781

test: Add failing drop_null_keys, __iter__ tests

581e511

feat(expr-ir): Support group_by(drop_null_keys=True)

4b77500

`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`

dangotbanned added 10 commits September 29, 2025 13:45

lil progress on order_by, sort_by

aae3936

In theory, we should be able to compose `over()` using combinations of: - aggregate - both scalar and hash - order_by - project - hashjoin

docs: Leave more useful notes in group_by

cb51c67

refine typing

d046981

address some acero todos

6bacae1

improve project parsing + docs

f568ce0

finish most remaining acero todos

f04146d

make __getattr__ more visible

575f07f

refactor: Move new DataFrame impls up

2e04ed5

docs: Explain new group bits

e111cef

chore: move all group_by stuff to the end

0f63dbf

Will make for an easier diff in the PR that splits up this mess

dangotbanned commented Sep 29, 2025

View reviewed changes

tests/plan/group_by_test.py Outdated Show resolved Hide resolved

dangotbanned added 2 commits September 29, 2025 18:36

minor nits

6b3018e

feat: Improve error reporting

e0a3684

Resolves (#3143 (comment))

dangotbanned changed the title ~~feat(expr-ir): Support group_by~~ feat(expr-ir): Support group_by, utilize pyarrow.acero Sep 29, 2025

dangotbanned added the pyarrow Issue is related to pyarrow backend label Sep 29, 2025

dangotbanned marked this pull request as ready for review September 29, 2025 22:05

dangotbanned mentioned this pull request Sep 29, 2025

[Python] Adopt a more pythonic method for UDF registration apache/arrow#33883

Open

dangotbanned added 2 commits September 30, 2025 13:10

feat: Improve temp.column_name error message

bfd0fe8

test: Add tests for temp.column_name

1c4210b

Towards #3143 (comment)

dangotbanned commented Sep 30, 2025

View reviewed changes

narwhals/_plan/common.py Outdated Show resolved Hide resolved

dangotbanned added 4 commits September 30, 2025 16:33

fix: Omit indent before 3.12

1fe1a89

Resolves #3143 (comment)

fix: welp no constructor

0c8921b

test: Add test_temp_column_names_failed_unique

8b605ed

well, move it out of the docstring is more accurate i suppose

test: More temp.column_names tests

cef3e06

dangotbanned mentioned this pull request Sep 30, 2025

GH-32609: [Python] Add type annotations to PyArrow apache/arrow#47609

Open

docs: Add meaningful examples to temp.column_name

79e49f4

Still need to do `temp.column_names` as well That one is quite different to #3147, but was needed in this PR

dangotbanned merged commit 8208d32 into oh-nodes Oct 1, 2025
28 of 32 checks passed

dangotbanned deleted the expr-ir/group-by branch October 1, 2025 10:46

This was referenced Oct 2, 2025

feat: Adds {Expr,Series}.{first,last} #2528

Merged

feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct #3173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Support `group_by`, utilize `pyarrow.acero` #3143

feat(expr-ir): Support `group_by`, utilize `pyarrow.acero` #3143

Uh oh!

dangotbanned commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support group_by, utilize pyarrow.acero #3143

feat(expr-ir): Support group_by, utilize pyarrow.acero #3143

Uh oh!

Conversation

dangotbanned commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Description

Tasks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support `group_by`, utilize `pyarrow.acero` #3143

feat(expr-ir): Support `group_by`, utilize `pyarrow.acero` #3143

dangotbanned commented Sep 18, 2025 •

edited

Loading