feat: support cum_sum for lazy backends#2132
Merged
MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom Mar 8, 2025
Merged
feat: support cum_sum for lazy backends#2132MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom
cum_sum for lazy backends#2132MarcoGorelli merged 71 commits intonarwhals-dev:mainfrom
Conversation
Member
Author
|
thanks @FBruzzesi , excellent review! have fixed the for nulls in |
MarcoGorelli
commented
Mar 5, 2025
MarcoGorelli
commented
Mar 5, 2025
MarcoGorelli
commented
Mar 5, 2025
dangotbanned
reviewed
Mar 5, 2025
narwhals/_arrow/expr.py
Outdated
Comment on lines
429
to
453
| def over( | ||
| self: Self, | ||
| partition_by: Sequence[str], | ||
| kind: ExprKind, | ||
| order_by: Sequence[str] | None, | ||
| ) -> Self: | ||
| if partition_by and not is_scalar_like(kind): | ||
| msg = "Only aggregation or literal operations are supported in grouped `over` context for PyArrow." | ||
| raise NotImplementedError(msg) | ||
|
|
||
| def func(df: ArrowDataFrame) -> list[ArrowSeries]: | ||
| output_names, aliases = evaluate_output_names_and_aliases(self, df, []) | ||
| if overlap := set(output_names).intersection(keys): | ||
| # E.g. `df.select(nw.all().sum().over('a'))`. This is well-defined, | ||
| # we just don't support it yet. | ||
| msg = ( | ||
| f"Column names {overlap} appear in both expression output names and in `over` keys.\n" | ||
| "This is not yet supported." | ||
| if not partition_by: | ||
| assert order_by is not None # help type checkers # noqa: S101 | ||
|
|
||
| # This is something like `nw.col('a').cum_sum().order_by(key)` | ||
| # which we can always easily support, as it doesn't require grouping. | ||
| def func(df: ArrowDataFrame) -> Sequence[ArrowSeries]: | ||
| token = generate_temporary_column_name(8, df.columns) | ||
| df = df.with_row_index(token).sort( | ||
| *order_by, descending=False, nulls_last=False | ||
| ) | ||
| result = self(df) | ||
| # TODO(marco): is there a way to do this efficiently without | ||
| # doing 2 sorts? Here we're sorting the dataframe and then | ||
| # again calling `sort_indices`. We can't use the same trick | ||
| # we use in pandas as PyArrow arrays are immutable. |
Member
There was a problem hiding this comment.
@MarcoGorelli I feel like I'm missing something 🤔
What is the relation between ArrowExpr.over and cum_sum for lazy backends?
Member
Author
There was a problem hiding this comment.
sure, thanks for asking
in nw.LazyFrame, cum_sum must be followed by over - there's some examples here #2132 (comment)
dangotbanned
reviewed
Mar 5, 2025
dangotbanned
reviewed
Mar 5, 2025
10 tasks
…by is not specified
MarcoGorelli
pushed a commit
that referenced
this pull request
Mar 7, 2025
* chore(typing): Add typing for `SparkLikeExpr` properties Porting over (#2051), didn't realize this was delclared twice until (#2132) * chore: fix typing and simplify `SparkLikeExprStringNamespace.to_datetime` Resolves (https://github.com/narwhals-dev/narwhals/actions/runs/13682912007/job/38259412675?pr=2152) * rename function * use single chars in set * fix: Remove timezone offset replacement * test: Adds `test_to_datetime_tz_aware` Resolves #2152 (comment) * test: possibly fix `pyarrow` in ci? Maybe this was just a TZDATA issue locally? https://github.com/narwhals-dev/narwhals/actions/runs/13699734154/job/38310256617?pr=2152 * test: xfail polars `3.8`, fix false positive pyarrow https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310487932?pr=2152 https://github.com/narwhals-dev/narwhals/actions/runs/13699804987/job/38310488783?pr=2152 * test: narrower xfail, tz-less expected? Not even sure what `pyarrow` is doing here https://github.com/narwhals-dev/narwhals/actions/runs/13700021595/job/38311197947?pr=2152 * test: account for `pyarrow` version changes https://github.com/narwhals-dev/narwhals/actions/runs/13700267075/job/38312036397?pr=215 * test: maybe fix `pyspark` https://github.com/narwhals-dev/narwhals/actions/runs/13700361438/job/38312364899?pr=2152 * revert: go back to typing fixes only Addresses #2152 (review) * chore: ignore `format` shadowing #2152 (review) * keep logic the same I hope #2152 (comment) --------- Co-authored-by: Edoardo Abati <29585319+EdAbati@users.noreply.github.com>
Member
Author
|
right, let's go ahead with this, thanks all for reviews! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Demo of this work:
For eager backends, just like now, you can keep using
cum_sumliberally, there are no new restrictions introduced by this PR:You can also optionally specify
_order_by(for now private, whilst we build out the functionality):You can also partition by a column, but for pandas there's the usual limitation that only elementary expressions are supported:
And now, for the new functionality which this unlocks. Here is where we start enabling new things. For lazy backends, the above is supported, but specifying
_order_byis required. Example using SQLFrame:The implementation for spark-like looks like this:
https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L536-L550
then
overjust applies that window function with the given window:https://github.com/MarcoGorelli/narwhals/blob/b9d4529a5756ef178e0b89cf244e786c63d2a0c8/narwhals/_spark_like/expr.py#L496-L506
At the Narwhals level, we enforce that, for lazyframes, window functions like
cum_summust be immediately followed byoverwith_order_byspecifiedWe should be able to adapt this fairly straightforwardly to also cover:
All of these should be supportable immediately for SQLFrame / PySpark / Polars Lazy. Missing lazy backends are: