Skip to content

feat: support cum_sum for lazy backends#2132

Merged
MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom
MarcoGorelli:start-order-dependence
Mar 8, 2025
Merged

feat: support cum_sum for lazy backends#2132
MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom
MarcoGorelli:start-order-dependence

Conversation

@MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Mar 2, 2025

Demo of this work:

For eager backends, just like now, you can keep using cum_sum liberally, there are no new restrictions introduced by this PR:

In [16]: df
Out[16]: 
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|     a  b  c  i   |
|  0  a  1  5  0   |
|  1  a  2  4  1   |
|  2  b  3  3  2   |
|  3  b  5  2  4   |
|  4  b  3  1  3   |
└──────────────────┘

In [17]: df.with_columns(b_cum_sum = (nw.col('b')**2).cum_sum())
Out[17]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          5|
|2  b  3  3  2         14|
|3  b  5  2  4         39|
|4  b  3  1  3         48|
└────────────────────────┘

You can also optionally specify _order_by (for now private, whilst we build out the functionality):

In [18]: df.with_columns(b_cum_sum = (nw.col('b')**2).cum_sum().over(_order_by='i'))
Out[18]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          5|
|2  b  3  3  2         14|
|3  b  5  2  4         48|
|4  b  3  1  3         23|
└────────────────────────┘

You can also partition by a column, but for pandas there's the usual limitation that only elementary expressions are supported:

In [19]: df.with_columns(b_cum_sum = nw.col('b').cum_sum().over('a', _order_by='i'))
Out[19]: 
┌────────────────────────┐
|   Narwhals DataFrame   |
|------------------------|
|   a  b  c  i  b_cum_sum|
|0  a  1  5  0          1|
|1  a  2  4  1          3|
|2  b  3  3  2          3|
|3  b  5  2  4         11|
|4  b  3  1  3          6|
└────────────────────────┘

And now, for the new functionality which this unlocks. Here is where we start enabling new things. For lazy backends, the above is supported, but specifying _order_by is required. Example using SQLFrame:

In [29]: lf
Out[29]: 
┌────────────────────────────────────────────────────────────────────┐
|                         Narwhals LazyFrame                         |
|--------------------------------------------------------------------|
|<sqlframe.duckdb.dataframe.DuckDBDataFrame object at 0x7f4f12317260>|
└────────────────────────────────────────────────────────────────────┘

In [30]: lf.to_native().show()
+---+---+---+---+
| a | b | c | i |
+---+---+---+---+
| a | 1 | 5 | 0 |
| a | 2 | 4 | 1 |
| b | 3 | 3 | 2 |
| b | 5 | 2 | 4 |
| b | 3 | 1 | 3 |
+---+---+---+---+

In [31]: lf.with_columns(b_cum_sum = nw.col('b').cum_sum().over(_order_by='i')).to_native().show()
+---+---+---+---+-----------+
| a | b | c | i | b_cum_sum |
+---+---+---+---+-----------+
| a | 1 | 5 | 0 |     1     |
| a | 2 | 4 | 1 |     3     |
| b | 3 | 3 | 2 |     6     |
| b | 3 | 1 | 3 |     9     |
| b | 5 | 2 | 4 |     14    |
+---+---+---+---+-----------+

In [32]: lf.with_columns(b_cum_sum = nw.col('b').cum_sum().over('a', _order_by='i')).to_native().show()
+---+---+---+---+-----------+
| a | b | c | i | b_cum_sum |
+---+---+---+---+-----------+
| a | 1 | 5 | 0 |     1     |
| a | 2 | 4 | 1 |     3     |
| b | 3 | 3 | 2 |     3     |
| b | 3 | 1 | 3 |     6     |
| b | 5 | 2 | 4 |     11    |
+---+---+---+---+-----------+

The implementation for spark-like looks like this:

https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L536-L550

then over just applies that window function with the given window:

https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L496-L506

At the Narwhals level, we enforce that, for lazyframes, window functions like cum_sum must be immediately followed by over with _order_by specified

We should be able to adapt this fairly straightforwardly to also cover:

  • rolling_*
  • diff
  • shift
  • is_*_distinct
  • cum_*
  • rank

All of these should be supportable immediately for SQLFrame / PySpark / Polars Lazy. Missing lazy backends are:

@MarcoGorelli
Copy link
Member Author

thanks @FBruzzesi , excellent review!

have fixed the partition_by='' case

for nulls in order_by, have fixed + tested. if there's duplicates in order_by, then results aren't stable (we don't provide stable sorting guarantees) so i'm not sure there's much we should do there

Comment on lines 429 to 453
def over(
self: Self,
partition_by: Sequence[str],
kind: ExprKind,
order_by: Sequence[str] | None,
) -> Self:
if partition_by and not is_scalar_like(kind):
msg = "Only aggregation or literal operations are supported in grouped `over` context for PyArrow."
raise NotImplementedError(msg)

def func(df: ArrowDataFrame) -> list[ArrowSeries]:
output_names, aliases = evaluate_output_names_and_aliases(self, df, [])
if overlap := set(output_names).intersection(keys):
# E.g. `df.select(nw.all().sum().over('a'))`. This is well-defined,
# we just don't support it yet.
msg = (
f"Column names {overlap} appear in both expression output names and in `over` keys.\n"
"This is not yet supported."
if not partition_by:
assert order_by is not None # help type checkers # noqa: S101

# This is something like `nw.col('a').cum_sum().order_by(key)`
# which we can always easily support, as it doesn't require grouping.
def func(df: ArrowDataFrame) -> Sequence[ArrowSeries]:
token = generate_temporary_column_name(8, df.columns)
df = df.with_row_index(token).sort(
*order_by, descending=False, nulls_last=False
)
result = self(df)
# TODO(marco): is there a way to do this efficiently without
# doing 2 sorts? Here we're sorting the dataframe and then
# again calling `sort_indices`. We can't use the same trick
# we use in pandas as PyArrow arrays are immutable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I feel like I'm missing something 🤔

What is the relation between ArrowExpr.over and cum_sum for lazy backends?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks for asking

in nw.LazyFrame, cum_sum must be followed by over - there's some examples here #2132 (comment)

dangotbanned added a commit that referenced this pull request Mar 5, 2025
Porting over (#2051), didn't realize this was delclared twice until (#2132)
@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 7, 2025 16:36
MarcoGorelli pushed a commit that referenced this pull request Mar 7, 2025
* chore(typing): Add typing for `SparkLikeExpr` properties

Porting over (#2051), didn't realize this was delclared twice until (#2132)

* chore: fix typing and simplify `SparkLikeExprStringNamespace.to_datetime`

Resolves (https://github.com/narwhals-dev/narwhals/actions/runs/13682912007/job/38259412675?pr=2152)

* rename function

* use single chars in set

* fix: Remove timezone offset replacement

* test: Adds `test_to_datetime_tz_aware`

Resolves #2152 (comment)

* test: possibly fix `pyarrow` in ci?

Maybe this was just a TZDATA issue locally?
https://github.com/narwhals-dev/narwhals/actions/runs/13699734154/job/38310256617?pr=2152

* test: xfail polars `3.8`, fix false positive pyarrow

https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310487932?pr=2152
https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310488783?pr=2152

* test: narrower xfail, tz-less expected?

Not even sure what `pyarrow` is doing here https://github.com/narwhals-dev/narwhals/actions/runs/13700021595/job/38311197947?pr=2152

* test: account for `pyarrow` version changes

https://github.com/narwhals-dev/narwhals/actions/runs/13700267075/job/38312036397?pr=215

* test: maybe fix `pyspark`

https://github.com/narwhals-dev/narwhals/actions/runs/13700361438/job/38312364899?pr=2152

* revert: go back to typing fixes only

Addresses #2152 (review)

* chore: ignore `format` shadowing

#2152 (review)

* keep logic the same I hope

#2152 (comment)

---------

Co-authored-by: Edoardo Abati <29585319+EdAbati@users.noreply.github.com>
@MarcoGorelli
Copy link
Member Author

right, let's go ahead with this, thanks all for reviews!

@MarcoGorelli MarcoGorelli merged commit b508e63 into narwhals-dev:main Mar 8, 2025
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants