Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Sep 18, 2025

What type of PR is this? (check all applicable)

  • ✨ Feature

Related issues

Description

This one spiralled into something pretty interesting IMO 😏

Acero is used under-the-hood for pyarrow features like:

But it also has limited support for other operations like:

I've been experimenting with those a bit in (https://github.com/narwhals-dev/narwhals/blob/e0a3684ba4f77a59fe45d2914c74b4cff25cf344/narwhals/_plan/arrow/acero.py).
Anticipating that combining those nodes into one big soup will result in over(...) 😄

Tasks

Mapping things out a bit, no compliant yet
There's a few gaps, but overall surprised how much was reusable 🥳
lol didn't realise it was just describing python dict behavior
Everything here seems to be working already? 😱

May as well show it off

```py
>>> df.group_by("a", nwp.nth(2, 8)).agg(nwp.mean("d", "e", "g").name.suffix("_mean"))
NotImplementedError: TODO: `GroupBy.agg` needs a `CompliantGroupBy` to dispatch to:

keys:
(a=col('a'), c=col('c'), i=col('i'))

aggs:
(d_mean=col('d').mean(), e_mean=col('e').mean(), g_mean=col('g').mean())

result_schema:
FrozenSchema([
 ('a', String),
 ('c', Int64),
 ('i', Unknown),
 ('d_mean', Unknown),
 ('e_mean', Unknown),
 ('g_mean', Unknown),
])
```
Quite different to current version(s)
@dangotbanned dangotbanned added the enhancement New feature or request label Sep 18, 2025
@dangotbanned dangotbanned mentioned this pull request Sep 18, 2025
75 tasks
Gonna need space for the mini translator
Borrowing some ideas from #2528, #2680
`pyarrow` has the same behavior as `polars`
Wasn't expecting so much to be working already 🥳 🥳 🥳
Just pushing this as tests are working.

Useful changes to follow:
- Column renaming stuff will be avoidable
  - we just use `ArrowAggExpr.output_name`
- Awkward stuff `first`, `last`, `_ensure_single_thread` can be avoided
  - `use_threads` was always available on `Declaration.to_table`
  - Whether we need to use can just be an `__ior__`
`ArrowDataFrame.drop_nulls` is shorter and waaaaaaay more efficient than `main`
@dangotbanned dangotbanned changed the title feat(expr-ir): Support group_by feat(expr-ir): Support group_by, utilize pyarrow.acero Sep 29, 2025
@dangotbanned dangotbanned added the pyarrow Issue is related to pyarrow backend label Sep 29, 2025
@dangotbanned dangotbanned marked this pull request as ready for review September 29, 2025 22:05
Still need to do `temp.column_names` as well
That one is quite different to #3147, but was needed in this PR
@dangotbanned dangotbanned merged commit 8208d32 into oh-nodes Oct 1, 2025
28 of 32 checks passed
@dangotbanned dangotbanned deleted the expr-ir/group-by branch October 1, 2025 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal pyarrow Issue is related to pyarrow backend tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants