feat: support `cum_sum` for lazy backends by MarcoGorelli · Pull Request #2132 · narwhals-dev/narwhals

MarcoGorelli · 2025-03-02T16:34:08Z

Demo of this work:

For eager backends, just like now, you can keep using cum_sum liberally, there are no new restrictions introduced by this PR:

In [16]: df
Out[16]: 
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|     a  b  c  i   |
|  0  a  1  5  0   |
|  1  a  2  4  1   |
|  2  b  3  3  2   |
|  3  b  5  2  4   |
|  4  b  3  1  3   |
└──────────────────┘

In [17]: df.with_columns(b_cum_sum = (nw.col('b')**2).cum_sum())
Out[17]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          5|
|2  b  3  3  2         14|
|3  b  5  2  4         39|
|4  b  3  1  3         48|
└────────────────────────┘

You can also optionally specify _order_by (for now private, whilst we build out the functionality):

In [18]: df.with_columns(b_cum_sum = (nw.col('b')**2).cum_sum().over(_order_by='i'))
Out[18]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          5|
|2  b  3  3  2         14|
|3  b  5  2  4         48|
|4  b  3  1  3         23|
└────────────────────────┘

You can also partition by a column, but for pandas there's the usual limitation that only elementary expressions are supported:

In [19]: df.with_columns(b_cum_sum = nw.col('b').cum_sum().over('a', _order_by='i'))
Out[19]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          3|
|2  b  3  3  2          3|
|3  b  5  2  4         11|
|4  b  3  1  3          6|
└────────────────────────┘

And now, for the new functionality which this unlocks. Here is where we start enabling new things. For lazy backends, the above is supported, but specifying _order_by is required. Example using SQLFrame:

In [29]: lf
Out[29]: 
┌────────────────────────────────────────────────────────────────────┐
|                         Narwhals LazyFrame                         |
|--------------------------------------------------------------------|
|<sqlframe.duckdb.dataframe.DuckDBDataFrame object at 0x7f4f12317260>|
└────────────────────────────────────────────────────────────────────┘

In [30]: lf.to_native().show()
+---+---+---+---+
| a | b | c | i |
+---+---+---+---+
| a | 1 | 5 | 0 |
| a | 2 | 4 | 1 |
| b | 3 | 3 | 2 |
| b | 5 | 2 | 4 |
| b | 3 | 1 | 3 |
+---+---+---+---+

In [31]: lf.with_columns(b_cum_sum = nw.col('b').cum_sum().over(_order_by='i')).to_native().show()
+---+---+---+---+-----------+
| a | b | c | i | b_cum_sum |
+---+---+---+---+-----------+
| a | 1 | 5 | 0 |     1     |
| a | 2 | 4 | 1 |     3     |
| b | 3 | 3 | 2 |     6     |
| b | 3 | 1 | 3 |     9     |
| b | 5 | 2 | 4 |     14    |
+---+---+---+---+-----------+

In [32]: lf.with_columns(b_cum_sum = nw.col('b').cum_sum().over('a', _order_by='i')).to_native().show()
+---+---+---+---+-----------+
| a | b | c | i | b_cum_sum |
+---+---+---+---+-----------+
| a | 1 | 5 | 0 |     1     |
| a | 2 | 4 | 1 |     3     |
| b | 3 | 3 | 2 |     3     |
| b | 3 | 1 | 3 |     6     |
| b | 5 | 2 | 4 |     11    |
+---+---+---+---+-----------+

The implementation for spark-like looks like this:

https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L536-L550

then over just applies that window function with the given window:

https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L496-L506

At the Narwhals level, we enforce that, for lazyframes, window functions like cum_sum must be immediately followed by over with _order_by specified

We should be able to adapt this fairly straightforwardly to also cover:

rolling_*
diff
shift
is_*_distinct
cum_*
rank

All of these should be supportable immediately for SQLFrame / PySpark / Polars Lazy. Missing lazy backends are:

Dask: blocked by groupby-transform-cumsum returns results which look wrong dask/dask#11806
DuckDB: they don't have WindowExpression in their Python API. But, SQLFrame backed by DuckDB should be fine 👍

MarcoGorelli · 2025-03-05T15:38:43Z

thanks @FBruzzesi , excellent review!

have fixed the partition_by='' case

for nulls in order_by, have fixed + tested. if there's duplicates in order_by, then results aren't stable (we don't provide stable sorting guarantees) so i'm not sure there's much we should do there

narwhals/_arrow/expr.py

narwhals/_pandas_like/expr.py

dangotbanned · 2025-03-05T17:35:22Z

narwhals/_arrow/expr.py

+    def over(
+        self: Self,
+        partition_by: Sequence[str],
+        kind: ExprKind,
+        order_by: Sequence[str] | None,
+    ) -> Self:
+        if partition_by and not is_scalar_like(kind):
+            msg = "Only aggregation or literal operations are supported in grouped `over` context for PyArrow."
            raise NotImplementedError(msg)

-        def func(df: ArrowDataFrame) -> list[ArrowSeries]:
-            output_names, aliases = evaluate_output_names_and_aliases(self, df, [])
-            if overlap := set(output_names).intersection(keys):
-                # E.g. `df.select(nw.all().sum().over('a'))`. This is well-defined,
-                # we just don't support it yet.
-                msg = (
-                    f"Column names {overlap} appear in both expression output names and in `over` keys.\n"
-                    "This is not yet supported."
+        if not partition_by:
+            assert order_by is not None  # help type checkers  # noqa: S101
+
+            # This is something like `nw.col('a').cum_sum().order_by(key)`
+            # which we can always easily support, as it doesn't require grouping.
+            def func(df: ArrowDataFrame) -> Sequence[ArrowSeries]:
+                token = generate_temporary_column_name(8, df.columns)
+                df = df.with_row_index(token).sort(
+                    *order_by, descending=False, nulls_last=False
+                )
+                result = self(df)
+                # TODO(marco): is there a way to do this efficiently without
+                # doing 2 sorts? Here we're sorting the dataframe and then
+                # again calling `sort_indices`. We can't use the same trick
+                # we use in pandas as PyArrow arrays are immutable.


@MarcoGorelli I feel like I'm missing something 🤔

What is the relation between ArrowExpr.over and cum_sum for lazy backends?

sure, thanks for asking

in nw.LazyFrame, cum_sum must be followed by over - there's some examples here #2132 (comment)

narwhals/_spark_like/expr.py

Porting over (#2051), didn't realize this was delclared twice until (#2132)

…by is not specified

* chore(typing): Add typing for `SparkLikeExpr` properties Porting over (#2051), didn't realize this was delclared twice until (#2132) * chore: fix typing and simplify `SparkLikeExprStringNamespace.to_datetime` Resolves (https://github.com/narwhals-dev/narwhals/actions/runs/13682912007/job/38259412675?pr=2152) * rename function * use single chars in set * fix: Remove timezone offset replacement * test: Adds `test_to_datetime_tz_aware` Resolves #2152 (comment) * test: possibly fix `pyarrow` in ci? Maybe this was just a TZDATA issue locally? https://github.com/narwhals-dev/narwhals/actions/runs/13699734154/job/38310256617?pr=2152 * test: xfail polars `3.8`, fix false positive pyarrow https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310487932?pr=2152 https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310488783?pr=2152 * test: narrower xfail, tz-less expected? Not even sure what `pyarrow` is doing here https://github.com/narwhals-dev/narwhals/actions/runs/13700021595/job/38311197947?pr=2152 * test: account for `pyarrow` version changes https://github.com/narwhals-dev/narwhals/actions/runs/13700267075/job/38312036397?pr=215 * test: maybe fix `pyspark` https://github.com/narwhals-dev/narwhals/actions/runs/13700361438/job/38312364899?pr=2152 * revert: go back to typing fixes only Addresses #2152 (review) * chore: ignore `format` shadowing #2152 (review) * keep logic the same I hope #2152 (comment) --------- Co-authored-by: Edoardo Abati <29585319+EdAbati@users.noreply.github.com>

MarcoGorelli · 2025-03-08T14:25:38Z

right, let's go ahead with this, thanks all for reviews!

MarcoGorelli added 30 commits February 25, 2025 15:49

chore: rename changes_length to filtration

8a55317

hey this works

4cda373

Merge remote-tracking branch 'upstream/main' into start-order-dependence

bc6fb60

Merge remote-tracking branch 'upstream/main' into start-order-dependence

78320eb

passes for sqlframme

ba8fc2a

pandas working too!

4251f55

robustness

bb144b8

wip

3f154e3

wip

1437b6d

wip

3344eb7

support for polars

ccc135d

avoid doing 2 sorts in pandas implementation

945dde8

tpch

e2d9cd6

Merge remote-tracking branch 'upstream/main' into start-order-dependence

64051d3

amazing

5c2b2e2

Merge remote-tracking branch 'upstream/main' into start-order-dependence

4abe290

do it for pyarrow too

8a9b1b4

simplify

70e23ac

wip

543e686

fixup pandas reverse

0a556a5

xfail some dask

9449066

old polars

5f35d22

fixup

87e9dfe

silence modins noise

d02c9eb

old pandas compat

d597d5a

old pandas compat

bff5e5b

arrow fix

b1491ba

wip

cdb4eb1

coverage

e954e77

Merge remote-tracking branch 'upstream/main' into start-order-dependence

eab25f9

MarcoGorelli added 3 commits March 5, 2025 15:47

cross-version compat

ba8dab1

more cross-version compat

91978fe

dask

4710b1d

MarcoGorelli commented Mar 5, 2025

View reviewed changes

narwhals/_arrow/expr.py Outdated Show resolved Hide resolved

MarcoGorelli commented Mar 5, 2025

View reviewed changes

narwhals/_pandas_like/expr.py Outdated Show resolved Hide resolved

MarcoGorelli commented Mar 5, 2025

View reviewed changes

narwhals/_pandas_like/expr.py Outdated Show resolved Hide resolved

dangotbanned reviewed Mar 5, 2025

View reviewed changes

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved

dangotbanned reviewed Mar 5, 2025

View reviewed changes

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved

dangotbanned added a commit that referenced this pull request Mar 5, 2025

chore(typing): Add typing for SparkLikeExpr properties

47e768d

Porting over (#2051), didn't realize this was delclared twice until (#2132)

dangotbanned mentioned this pull request Mar 5, 2025

chore(typing): SparkLikeExpr properties #2152

Merged

10 tasks

MarcoGorelli added 11 commits March 6, 2025 10:26

Merge remote-tracking branch 'upstream/main' into start-order-dependence

200015c

factor out scatter_in_place

8ff5635

coverage

d40dc23

correct comment

5457898

old pandas fix

d3e0573

Merge remote-tracking branch 'upstream/main' into start-order-dependence

335cc8b

use windowfunction type alias

a516fef

simplify pandas implementation a bit

7f15b9e

make sure that non-elementary expressions are supported if partition_…

8bf77f4

…by is not specified

skip for old polars

c1c6de2

include missing file

824d1b8

MarcoGorelli marked this pull request as ready for review March 7, 2025 16:36

xfail

802fa09

MarcoGorelli and others added 2 commits March 8, 2025 13:44

Merge branch 'main' into start-order-dependence

67700a1

fix typing

60ca848

MarcoGorelli merged commit b508e63 into narwhals-dev:main Mar 8, 2025
41 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `cum_sum` for lazy backends#2132

feat: support `cum_sum` for lazy backends#2132
MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom
MarcoGorelli:start-order-dependence

MarcoGorelli commented Mar 2, 2025 •

edited

Loading

Uh oh!

MarcoGorelli commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dangotbanned Mar 5, 2025

Uh oh!

MarcoGorelli Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

MarcoGorelli commented Mar 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MarcoGorelli commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dangotbanned Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MarcoGorelli commented Mar 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MarcoGorelli commented Mar 2, 2025 •

edited

Loading