Skip to content
Merged
Show file tree
Hide file tree
Changes from 106 commits
Commits
Show all changes
156 commits
Select commit Hold shift + click to select a range
ff661ae
chore: Add `CompliantExpr.first`
dangotbanned May 10, 2025
1b77bd7
feat: "Implement" `PolarsExpr.First`
dangotbanned May 10, 2025
e84cba3
feat: Add `EagerExpr.first`
dangotbanned May 10, 2025
25ef241
chore: Repeat for `*Series`
dangotbanned May 10, 2025
78822aa
feat: Add `(Arrow|PandasLike)Series.first()`
dangotbanned May 10, 2025
4075c50
chore: Mark `LazyExpr.first` as `not_implemented` for now
dangotbanned May 10, 2025
45f24b9
feat: Add `SparkLikeExpr.first`
dangotbanned May 10, 2025
4041dd1
feat: Add `DuckDBExpr.first`
dangotbanned May 10, 2025
bb9912d
feat: Add `DaskExpr.first`
dangotbanned May 10, 2025
6a53aa1
revert: 4075c50f2496ab9908b25dc15e240650bc686dc0
dangotbanned May 10, 2025
4efc939
feat: Add `nw.Series.first`
dangotbanned May 10, 2025
fc149c1
test: Add `Series.first` tests
dangotbanned May 10, 2025
7489e61
fix: I guess the stubs were wrong then?
dangotbanned May 10, 2025
d2719a4
fix: Handle the out-of-bounds case
dangotbanned May 10, 2025
0af11db
fix: `polars` backcompat
dangotbanned May 10, 2025
afe20f0
docs: Add `Series.first`
dangotbanned May 10, 2025
6c0bd6f
lol version typo
dangotbanned May 10, 2025
e0fdf78
cov
dangotbanned May 10, 2025
aa7c510
chore: Add `nw.Expr.first`
dangotbanned May 11, 2025
4fdc0aa
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 11, 2025
bd4ab89
feat: Maybe `SparkLike` requires `order_by`?
dangotbanned May 11, 2025
9f7f5a9
test: Try out eager backends
dangotbanned May 11, 2025
ddb50d2
Merge branch 'main' into expr-first
dangotbanned May 11, 2025
7146f60
test: Add mostly broken lazy tests 😒
dangotbanned May 11, 2025
8c24e6e
feat: `duckdb` support?
dangotbanned May 11, 2025
54a4cb4
test: Update xfails
dangotbanned May 11, 2025
63e0459
fix: Use `head(1)` in `DaskExpr`
dangotbanned May 11, 2025
9493aad
ignore cov
dangotbanned May 11, 2025
88535a4
Apply suggestion
dangotbanned May 11, 2025
77ae9c0
test: Remove dask `xfail`
dangotbanned May 11, 2025
c1a6173
revert: Remove `dask` implementation
dangotbanned May 11, 2025
3c4ff9b
refactor(typing): Use `PythonLiteral` for `Series` return
dangotbanned May 11, 2025
696e35d
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
b2866d2
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
cd002f3
test: Add `test_group_by_agg_first`
dangotbanned May 12, 2025
1458530
feat(DRAFT): Start trying `pyarrow` `agg(first())`
dangotbanned May 12, 2025
962ebcd
fix: Maybe `pyarrow` support?
dangotbanned May 12, 2025
5d310bc
refactor: Add `ArrowGroupBy._configure_agg`
dangotbanned May 12, 2025
a417341
fix: Add `pyarrow` compat for `first`
dangotbanned May 12, 2025
354da1a
fix: Don't support below `14` ever
dangotbanned May 12, 2025
0cea41b
test: Add some `None` cases
dangotbanned May 12, 2025
5229096
feat(DRAFT): Partial support for `pandas`
dangotbanned May 12, 2025
8d3aaec
docs: Tidy error and comments
dangotbanned May 12, 2025
a62e3ef
Merge branch 'main' into expr-first
dangotbanned May 12, 2025
9c36285
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 13, 2025
ad8e3f7
test: xfail `ibis`
dangotbanned May 13, 2025
628f71e
feat: Add `IbisExpr.first`
dangotbanned May 13, 2025
deacc71
test: Don't xfail for `pandas<1.0.0`
dangotbanned May 13, 2025
5c52ee4
Merge branch 'main' into expr-first
dangotbanned May 14, 2025
eec2a4f
Merge branch 'main' into expr-first
dangotbanned May 16, 2025
e003bab
Merge branch 'main' into expr-first
dangotbanned May 16, 2025
fb2dc1c
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned May 18, 2025
211673b
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jun 3, 2025
652615f
fix: Use reverted `partition_by`, `_sort`
dangotbanned Jun 13, 2025
68fdfe8
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jun 13, 2025
ecaca9a
fix: Update `DuckDBExpr.first`
dangotbanned Jun 15, 2025
ea30f26
fix: Update `IbisExpr.first`
dangotbanned Jun 15, 2025
12987ee
fix: Update `SparkLikeExpr.first`
dangotbanned Jun 15, 2025
7d70a42
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jun 15, 2025
5446095
test: Update `pandas` xfail
dangotbanned Jun 15, 2025
b927340
Merge branch 'main' into expr-first
dangotbanned Jun 20, 2025
f62c085
test: Don't xfail for pandas `1.1.3<=...<1.1.5`
dangotbanned Jun 20, 2025
45d20c8
Merge branch 'main' into expr-first
dangotbanned Jun 21, 2025
72ab185
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jun 29, 2025
e72b115
fix: Upgrade `DuckDBExpr.first` again
dangotbanned Jun 29, 2025
fae137c
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 19, 2025
cb363be
test(DRAFT): Let's start trying to fix pandas
dangotbanned Jul 19, 2025
bc80a5f
try `pandas>=2.2.1` path
dangotbanned Jul 19, 2025
14051fa
allow very old pandas that worked?
dangotbanned Jul 19, 2025
3d42dcf
test: xfail `pandas[pyarrow]`, `modin[pyarrow]`
dangotbanned Jul 19, 2025
934d09e
Apply suggestion narwhals/_polars/series.py
dangotbanned Jul 20, 2025
3fbf6f2
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 20, 2025
801a7a8
docs: Be more explicit on WIP `pandas`
dangotbanned Jul 20, 2025
47bfaba
docs: Link to long explanation
dangotbanned Jul 20, 2025
4618d01
revert: remove lazy support
dangotbanned Jul 20, 2025
1998ad2
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 20, 2025
570cdaf
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 20, 2025
d561027
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 21, 2025
b77d2b3
try `nth` for `>=1.1.5; <2.0.0`
dangotbanned Jul 21, 2025
2b0bc16
Is this fixed?
dangotbanned Jul 21, 2025
abbb4b7
cov
dangotbanned Jul 21, 2025
ccfe532
feat: Add `(Expr|Series).last`
dangotbanned Jul 21, 2025
dd1f89e
test: Add `last_test.py`
dangotbanned Jul 22, 2025
54b3188
test: Add `test_group_by_agg_last`
dangotbanned Jul 22, 2025
5f9ff6f
fix: Add missing `PandasLikeGroupBy._REMAP_AGGS` entry
dangotbanned Jul 22, 2025
4000b25
test: Repeat `@single_cases` pattern for `first`
dangotbanned Jul 22, 2025
1c62ce2
docs: Examples for `Expr.(first|last)`
dangotbanned Jul 22, 2025
64fdf10
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 22, 2025
063e5d0
Remove `modin` todo
dangotbanned Jul 22, 2025
2e4f260
Merge branch 'main' into expr-first
dangotbanned Jul 23, 2025
65e6804
clean up and doc `pandas`
dangotbanned Jul 23, 2025
22fae20
feat: Warn on new pandas apply path
dangotbanned Jul 23, 2025
60624b9
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 23, 2025
d66fddc
cov
dangotbanned Jul 23, 2025
5e444a5
always use `apply` for `cudf` 😒
dangotbanned Jul 24, 2025
e1a9bc3
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 25, 2025
0cbe33d
Merge branch 'main' into expr-first
dangotbanned Jul 26, 2025
2960736
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Jul 28, 2025
4ede6b2
merge main
FBruzzesi Jul 29, 2025
2ae4245
special path for orderable_aggregation in over
FBruzzesi Jul 29, 2025
b8066c4
expand on comments
FBruzzesi Jul 29, 2025
2dae6ef
assign metadata in arrow
FBruzzesi Jul 29, 2025
3aa52dc
Merge branch 'main' into expr-first
FBruzzesi Aug 2, 2025
7c578c7
Merge branch 'main' into expr-first
dangotbanned Aug 5, 2025
30bad0e
Merge branch 'main' into expr-first
dangotbanned Aug 7, 2025
d269d56
Merge branch 'main' into expr-first
dangotbanned Aug 7, 2025
c0e37aa
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 8, 2025
6f5c05b
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 12, 2025
20be193
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 13, 2025
abd027a
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 14, 2025
1fd9fd3
Merge branch 'main' into expr-first
dangotbanned Aug 15, 2025
94d6b19
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 17, 2025
849a6d9
Merge branch 'main' into expr-first
dangotbanned Aug 18, 2025
476c63e
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 18, 2025
c169104
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 19, 2025
d77fcd1
Merge branch 'main' into expr-first
dangotbanned Aug 19, 2025
1f38bde
Merge branch 'main' into expr-first
dangotbanned Aug 19, 2025
3c63726
Merge branch 'main' into expr-first
dangotbanned Aug 20, 2025
6d7b09b
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 21, 2025
3b6301f
Merge branch 'main' into expr-first
dangotbanned Aug 23, 2025
bfc55c7
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 23, 2025
f47ef14
docs: Remove *Returns* from `Expr` version
dangotbanned Aug 23, 2025
b32db75
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 25, 2025
f22a497
Merge branch 'main' into expr-first
dangotbanned Aug 25, 2025
b5fe1ba
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Aug 28, 2025
0f301e7
Merge branch 'main' into expr-first
dangotbanned Aug 30, 2025
16a2762
Merge branch 'main' into expr-first
dangotbanned Sep 3, 2025
09dca76
Merge branch 'main' into expr-first
dangotbanned Sep 5, 2025
ffe7e24
Merge remote-tracking branch 'upstream/main' into expr-first
dangotbanned Sep 13, 2025
0fb0455
chore(typing): fix incompatible override
dangotbanned Sep 13, 2025
6d63ea6
simplify grouped first/last#
MarcoGorelli Oct 2, 2025
7b00310
simplify test, remove unnecessary over(order_by)
MarcoGorelli Oct 2, 2025
016abc9
combine tests
MarcoGorelli Oct 2, 2025
29d6cb7
combine tests
MarcoGorelli Oct 2, 2025
3b91e23
duckdb fix#
MarcoGorelli Oct 2, 2025
c87935d
sort out ibis
MarcoGorelli Oct 2, 2025
0393dfe
dask
MarcoGorelli Oct 2, 2025
466c922
add note to docs
MarcoGorelli Oct 2, 2025
4266e4b
remove unnecessary code
MarcoGorelli Oct 2, 2025
555098b
pyarrow
MarcoGorelli Oct 2, 2025
36e38e0
fixup
MarcoGorelli Oct 2, 2025
42d2cd6
typing
MarcoGorelli Oct 2, 2025
63f012a
dask
MarcoGorelli Oct 2, 2025
c4ac043
test and support `diff().sum().over(order_by=...)`
MarcoGorelli Oct 3, 2025
8739b6a
cross-pandas version compat
MarcoGorelli Oct 3, 2025
ff22604
make test more unusual
MarcoGorelli Oct 3, 2025
d9c4a1b
fix another pyarrow issue
MarcoGorelli Oct 3, 2025
03b7969
catch more warnings for modin
MarcoGorelli Oct 3, 2025
d01a398
factor out sql_expression, link to feature request
MarcoGorelli Oct 3, 2025
18c0861
combine first and last blocks
MarcoGorelli Oct 3, 2025
948d96d
remove more unneeded
MarcoGorelli Oct 3, 2025
8810d03
less special-casing
MarcoGorelli Oct 3, 2025
843549f
simplify further
MarcoGorelli Oct 3, 2025
d7be792
Merge remote-tracking branch 'upstream/main' into expr-first
MarcoGorelli Oct 3, 2025
363490d
typing
MarcoGorelli Oct 3, 2025
c25d649
use repeat_by instead of lit for polars
MarcoGorelli Oct 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api-reference/expr.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
- exp
- fill_null
- filter
- first
- last
- clip
- is_between
- is_duplicated
Expand Down
2 changes: 2 additions & 0 deletions docs/api-reference/series.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@
- exp
- fill_null
- filter
- first
- from_numpy
- last
- gather_every
- head
- hist
Expand Down
22 changes: 14 additions & 8 deletions narwhals/_arrow/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@

from typing import TYPE_CHECKING, Any

import pyarrow as pa
import pyarrow.compute as pc

from narwhals._arrow.series import ArrowSeries
from narwhals._compliant import EagerExpr
from narwhals._expression_parsing import evaluate_output_names_and_aliases
from narwhals._expression_parsing import ExprKind, evaluate_output_names_and_aliases
from narwhals._utils import (
Implementation,
generate_temporary_column_name,
Expand Down Expand Up @@ -113,11 +114,8 @@ def _reuse_series_extra_kwargs(
return {"_return_py_scalar": False} if returns_scalar else {}

def over(self, partition_by: Sequence[str], order_by: Sequence[str]) -> Self:
if (
partition_by
and self._metadata is not None
and not self._metadata.is_scalar_like
):
meta = self._metadata
if partition_by and meta is not None and not meta.is_scalar_like:
msg = "Only aggregation or literal operations are supported in grouped `over` context for PyArrow."
raise NotImplementedError(msg)

Expand All @@ -131,12 +129,20 @@ def func(df: ArrowDataFrame) -> Sequence[ArrowSeries]:
df = df.with_row_index(token, order_by=None).sort(
*order_by, descending=False, nulls_last=False
)
result = self(df.drop([token], strict=True))
results = self(df.drop([token], strict=True))
if meta is not None and meta.last_node is ExprKind.ORDERABLE_AGGREGATION:
# Orderable aggregations require `order_by` columns and results in a
# scalar output (well actually in a length 1 series).
# Therefore we need to broadcast the results to the original size, since
# `over` is not a length changing operation.
size = len(df)
return [s._with_native(pa.repeat(s.item(), size)) for s in results]
Comment on lines 130 to 135
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok before this PR, we need to support

df = nw.from_native(pl.DataFrame({'a': [1,2,3,4,None,None,2,None,2], 'b': [1,1,1,1,1,1,2,2,2]})).lazy('duckdb').collect('pandas')
print(df.with_columns(
    nw.col('a').diff().mean().over(order_by='b')
))

which currently raises for both pandas and pyarrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relation between that and this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it requires the same kind of solution

the fact that the it's orderable shouldn't be relevant, and it's not enough to just look at the last node

Copy link
Member Author

@dangotbanned dangotbanned Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tried that example out natively in polars

I'm getting the same result from both of these:

pl.col("a").diff().mean()
pl.col("a").diff().mean().over(order_by="b")

If I change the input data in either "a" or "b", the result of "a" is always the mean broadcast to length

Note

Update: I didn't test it here, but over does have an impact if you use .over(order_by="a")
But the result is still broadcast

Show repro

import polars as pl

import narwhals as nw

data_orig = {"a": [1, 2, 3, 4, None, None, 2, None, 2], "b": [1, 1, 1, 1, 1, 1, 2, 2, 2]}
data_b_non_asc = {
    "a": [1, 2, 3, 4, None, None, 2, None, 2],
    "b": [1, 5, 1, 1, 1, 1, 2, 2, 2],
}
data_a_varied = {
    "a": [1, 2, 5, 4, None, None, 2, 12, 2],
    "b": [1, 1, 1, 1, 3, 1, 3, 2, 2],
}
datasets = {
    "Original": data_orig,
    "`b` non-ascending": data_b_non_asc,
    "`a` varied": data_a_varied,
}

diff = pl.col("a").diff()
diff_mean = diff.mean()
diff_mean_order_b = diff_mean.over(order_by="b")

native = pl.LazyFrame(data_orig)

with pl.Config(tbl_hide_dataframe_shape=True):
    for name, data in datasets.items():
        native = pl.LazyFrame(data)
        underline = "-" * len(name)
        print(name, underline, sep="\n")
        print(diff, native.with_columns(diff).collect(), sep="\n")
        print(diff_mean, native.with_columns(diff_mean).collect(), sep="\n")
        print(
            diff_mean_order_b, native.with_columns(diff_mean_order_b).collect(), sep="\n"
        )

Show output

Original
--------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
`b` non-ascending
-----------------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 5   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 5   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 5   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
`a` varied
----------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 3    ┆ 1   β”‚
β”‚ -1   ┆ 1   β”‚
β”‚ null ┆ 3   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 3   β”‚
β”‚ 10   ┆ 2   β”‚
β”‚ -10  ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 3   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 3   β”‚
β”‚ 0.6 ┆ 2   β”‚
β”‚ 0.6 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a     ┆ b   β”‚
β”‚ ---   ┆ --- β”‚
β”‚ f64   ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═════║
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 3   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 3   β”‚
β”‚ -1.75 ┆ 2   β”‚
β”‚ -1.75 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nw.col('a').diff().mean().over(order_by='b')

@MarcoGorelli was this based on something you've used in polars before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a look through this recent PR:

I was surprised that diff doesn't seem to have any ordering requirements πŸ€”

Some select bits from it though:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple example where it makes a difference:

In [13]: df = pl.DataFrame({'a': [1, 2, 3], 'b': [0, 2, 1]})

In [14]: df.with_columns(c=pl.col('a').diff().mean())
Out[14]:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   ┆ c                        β”‚
β”‚ --- ┆ --- ┆ ---                      β”‚
β”‚ i64 ┆ i64 ┆ f64                      β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ══════════════════════════║
β”‚ 1   ┆ 0   ┆ 1.00000000000000000000e0 β”‚
β”‚ 2   ┆ 2   ┆ 1.00000000000000000000e0 β”‚
β”‚ 3   ┆ 1   ┆ 1.00000000000000000000e0 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In [15]: df.with_columns(c=pl.col('a').diff().mean().over(order_by='b'))
Out[15]:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   ┆ c                         β”‚
β”‚ --- ┆ --- ┆ ---                       β”‚
β”‚ i64 ┆ i64 ┆ f64                       β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═══════════════════════════║
β”‚ 1   ┆ 0   ┆ 5.00000000000000000000e-1 β”‚
β”‚ 2   ┆ 2   ┆ 5.00000000000000000000e-1 β”‚
β”‚ 3   ┆ 1   ┆ 5.00000000000000000000e-1 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.00000000000000000000e0

What were you up to needing this much precision? πŸ˜„


# TODO(marco): is there a way to do this efficiently without
# doing 2 sorts? Here we're sorting the dataframe and then
# again calling `sort_indices`. `ArrowSeries.scatter` would also sort.
sorting_indices = pc.sort_indices(df.get_column(token).native)
return [s._with_native(s.native.take(sorting_indices)) for s in result]
return [s._with_native(s.native.take(sorting_indices)) for s in results]
else:

def func(df: ArrowDataFrame) -> Sequence[ArrowSeries]:
Expand Down
76 changes: 60 additions & 16 deletions narwhals/_arrow/group_by.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from narwhals._arrow.utils import cast_to_comparable_string_types, extract_py_scalar
from narwhals._compliant import EagerGroupBy
from narwhals._expression_parsing import evaluate_output_names_and_aliases
from narwhals._utils import generate_temporary_column_name
from narwhals._utils import generate_temporary_column_name, requires

if TYPE_CHECKING:
from collections.abc import Iterator, Mapping, Sequence
Expand Down Expand Up @@ -39,12 +39,23 @@ class ArrowGroupBy(EagerGroupBy["ArrowDataFrame", "ArrowExpr", "Aggregation"]):
"count": "count",
"all": "all",
"any": "any",
"first": "first",
"last": "last",
}
_REMAP_UNIQUE: ClassVar[Mapping[UniqueKeepStrategy, Aggregation]] = {
"any": "min",
"first": "min",
"last": "max",
}
_OPTION_COUNT_ALL: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(
("len", "n_unique")
)
_OPTION_COUNT_VALID: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("count",))
_OPTION_ORDERED: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(
("first", "last")
)
_OPTION_VARIANCE: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("std", "var"))
_OPTION_SCALAR: ClassVar[frozenset[NarwhalsAggregation]] = frozenset(("any", "all"))

def __init__(
self,
Expand All @@ -60,12 +71,58 @@ def __init__(
self._grouped = pa.TableGroupBy(self.compliant.native, self._keys)
self._drop_null_keys = drop_null_keys

def _configure_agg(
self, grouped: pa.TableGroupBy, expr: ArrowExpr, /
) -> tuple[pa.TableGroupBy, Aggregation, AggregateOptions | None]:
option: AggregateOptions | None = None
function_name = self._leaf_name(expr)
if function_name in self._OPTION_VARIANCE:
ddof = expr._scalar_kwargs.get("ddof", 1)
option = pc.VarianceOptions(ddof=ddof)
elif function_name in self._OPTION_COUNT_ALL:
option = pc.CountOptions(mode="all")
elif function_name in self._OPTION_COUNT_VALID:
option = pc.CountOptions(mode="only_valid")
elif function_name in self._OPTION_SCALAR:
option = pc.ScalarAggregateOptions(min_count=0)
elif function_name in self._OPTION_ORDERED:
grouped, option = self._ordered_agg(grouped, function_name)
return grouped, self._remap_expr_name(function_name), option
Comment on lines +74 to +90
Copy link
Member Author

@dangotbanned dangotbanned Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible follow-up

Do another pass on this, since it was written before the pandas refactor - and solves a similar problem in a different way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to upstream some version of that to main (or this PR if blocking) if there's interest

It works on all versions of pyarrow that we support and doesn't require touching pyarrow internals like this does


def _ordered_agg(
self, grouped: pa.TableGroupBy, name: NarwhalsAggregation, /
) -> tuple[pa.TableGroupBy, AggregateOptions]:
"""The default behavior of `pyarrow` raises when `first` or `last` are used.

You'd see an error like:

ArrowNotImplementedError: Using ordered aggregator in multiple threaded execution is not supported

We need to **disable** multi-threading to use them, but the ability to do so
wasn't possible before `14.0.0` ([pyarrow-36709])

[pyarrow-36709]: https://github.com/apache/arrow/issues/36709
"""
backend_version = self.compliant._backend_version
if backend_version >= (14, 0) and grouped._use_threads:
native = self.compliant.native
grouped = pa.TableGroupBy(native, grouped.keys, use_threads=False)
elif backend_version < (14, 0): # pragma: no cover
msg = (
f"Using `{name}()` in a `group_by().agg(...)` context is only available in 'pyarrow>=14.0.0', "
f"found version {requires._unparse_version(backend_version)!r}.\n\n"
f"See https://github.com/apache/arrow/issues/36709"
)
raise NotImplementedError(msg)
return grouped, pc.ScalarAggregateOptions(skip_nulls=False)

def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame:
self._ensure_all_simple(exprs)
aggs: list[tuple[str, Aggregation, AggregateOptions | None]] = []
expected_pyarrow_column_names: list[str] = self._keys.copy()
new_column_names: list[str] = self._keys.copy()
exclude = (*self._keys, *self._output_key_names)
grouped = self._grouped

for expr in exprs:
output_names, aliases = evaluate_output_names_and_aliases(
Expand All @@ -83,20 +140,7 @@ def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame:
aggs.append((self._keys[0], "count", pc.CountOptions(mode="all")))
continue

function_name = self._leaf_name(expr)
if function_name in {"std", "var"}:
assert "ddof" in expr._scalar_kwargs # noqa: S101
option: Any = pc.VarianceOptions(ddof=expr._scalar_kwargs["ddof"])
elif function_name in {"len", "n_unique"}:
option = pc.CountOptions(mode="all")
elif function_name == "count":
option = pc.CountOptions(mode="only_valid")
elif function_name in {"all", "any"}:
option = pc.ScalarAggregateOptions(min_count=0)
else:
option = None

function_name = self._remap_expr_name(function_name)
grouped, function_name, option = self._configure_agg(grouped, expr)
new_column_names.extend(aliases)
expected_pyarrow_column_names.extend(
[f"{output_name}_{function_name}" for output_name in output_names]
Expand All @@ -105,7 +149,7 @@ def agg(self, *exprs: ArrowExpr) -> ArrowDataFrame:
[(output_name, function_name, option) for output_name in output_names]
)

result_simple = self._grouped.aggregate(aggs)
result_simple = grouped.aggregate(aggs)

# Rename columns, being very careful
expected_old_names_indices: dict[str, list[int]] = collections.defaultdict(list)
Expand Down
9 changes: 9 additions & 0 deletions narwhals/_arrow/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,15 @@ def filter(self, predicate: ArrowSeries | list[bool | None]) -> Self:
other_native = predicate
return self._with_native(self.native.filter(other_native))

def first(self, *, _return_py_scalar: bool = True) -> PythonLiteral:
result = self.native[0] if len(self.native) else None
return maybe_extract_py_scalar(result, _return_py_scalar)

def last(self, *, _return_py_scalar: bool = True) -> PythonLiteral:
ca = self.native
result = ca[height - 1] if (height := len(ca)) else None
return maybe_extract_py_scalar(result, _return_py_scalar)

def mean(self, *, _return_py_scalar: bool = True) -> float:
return maybe_extract_py_scalar(pc.mean(self.native), _return_py_scalar)

Expand Down
14 changes: 13 additions & 1 deletion narwhals/_compliant/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
LazyExprT,
NativeExprT,
)
from narwhals._utils import _StoresCompliant
from narwhals._utils import _StoresCompliant, not_implemented
from narwhals.dependencies import get_numpy, is_numpy_array

if TYPE_CHECKING:
Expand Down Expand Up @@ -142,6 +142,8 @@ def cum_min(self, *, reverse: bool) -> Self: ...
def cum_max(self, *, reverse: bool) -> Self: ...
def cum_prod(self, *, reverse: bool) -> Self: ...
def is_in(self, other: Any) -> Self: ...
def first(self) -> Self: ...
def last(self) -> Self: ...
def rank(self, method: RankMethod, *, descending: bool) -> Self: ...
def replace_strict(
self,
Expand Down Expand Up @@ -892,6 +894,12 @@ def exp(self) -> Self:
def sqrt(self) -> Self:
return self._reuse_series("sqrt")

def first(self) -> Self:
return self._reuse_series("first", returns_scalar=True)

def last(self) -> Self:
return self._reuse_series("last", returns_scalar=True)

@property
def cat(self) -> EagerExprCatNamespace[Self]:
return EagerExprCatNamespace(self)
Expand Down Expand Up @@ -922,6 +930,10 @@ class LazyExpr( # type: ignore[misc]
CompliantExpr[CompliantLazyFrameT, NativeExprT],
Protocol[CompliantLazyFrameT, NativeExprT],
):
# NOTE: See https://github.com/narwhals-dev/narwhals/issues/2526#issuecomment-3019303816
first: not_implemented = not_implemented()
last: not_implemented = not_implemented()

def _with_alias_output_names(self, func: AliasNames | None, /) -> Self: ...
def alias(self, name: str) -> Self:
def fn(names: Sequence[str]) -> Sequence[str]:
Expand Down
3 changes: 3 additions & 0 deletions narwhals/_compliant/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
MultiIndexSelector,
NonNestedLiteral,
NumericLiteral,
PythonLiteral,
RankMethod,
RollingInterpolationMethod,
SizedMultiIndexSelector,
Expand Down Expand Up @@ -193,6 +194,8 @@ def fill_null(
limit: int | None,
) -> Self: ...
def filter(self, predicate: Any) -> Self: ...
def first(self) -> PythonLiteral: ...
def last(self) -> PythonLiteral: ...
def gather_every(self, n: int, offset: int) -> Self: ...
def head(self, n: int) -> Self: ...
def is_between(
Expand Down
2 changes: 2 additions & 0 deletions narwhals/_compliant/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,8 @@ class ScalarKwargs(TypedDict, total=False):
"quantile",
"all",
"any",
"first",
"last",
]
"""`Expr` methods we aim to support in `DepthTrackingGroupBy`.

Expand Down
16 changes: 15 additions & 1 deletion narwhals/_pandas_like/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from typing import TYPE_CHECKING

from narwhals._compliant import EagerExpr
from narwhals._expression_parsing import evaluate_output_names_and_aliases
from narwhals._expression_parsing import ExprKind, evaluate_output_names_and_aliases
from narwhals._pandas_like.group_by import PandasLikeGroupBy
from narwhals._pandas_like.series import PandasLikeSeries
from narwhals._utils import generate_temporary_column_name
Expand Down Expand Up @@ -227,6 +227,20 @@ def func(df: PandasLikeDataFrame) -> Sequence[PandasLikeSeries]:
*order_by, descending=False, nulls_last=False
)
results = self(df.drop([token], strict=True))
if (
meta := self._metadata
) is not None and meta.last_node is ExprKind.ORDERABLE_AGGREGATION:
# Orderable aggregations require `order_by` columns and result in a
# scalar output (well actually in a length 1 series).
# Therefore we need to broadcast the result to the original size, since
# `over` is not a length changing operation.
index = df.native.index
ns = self._implementation.to_native_namespace()
return [
s._with_native(ns.Series(s.item(), index=index, name=s.name))
for s in results
]

sorting_indices = df.get_column(token)
for s in results:
s._scatter_in_place(sorting_indices, s)
Expand Down
Loading
Loading