Skip to content

feat: add rank for Lazy backends#2310

Merged
MarcoGorelli merged 7 commits intonarwhals-dev:mainfrom
raisadz:feat/rank_lazy_backends
Mar 29, 2025
Merged

feat: add rank for Lazy backends#2310
MarcoGorelli merged 7 commits intonarwhals-dev:mainfrom
raisadz:feat/rank_lazy_backends

Conversation

@raisadz
Copy link
Contributor

@raisadz raisadz commented Mar 28, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Only min and dense methods are implemented for DuckDB and PySpark without .over().

@raisadz raisadz marked this pull request as ready for review March 28, 2025 19:29
@FBruzzesi
Copy link
Member

Hey @raisadz thanks for the PR. I didn't have the time to look at this, I just wanted to drop a comment and ping @marvinl803 who also took a look at rank and maybe can share some insights for the remaining methods 🙌🏼

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this 🙌 !

it's occurred to me that we class rank as ExprKind.WINDOW and that, after narwhals.stable.v1, it won't be supported to use rank if not followed by over(order_by=...)

Here it works because the import is import narwhals.stable.v1 as nw, but would raise for import narwhals as nw

🤔 gonna have to think about a solution here, not super-sure

maybe, just maybe, we may want to introduce nw.rank instead of Expr.rank, because then that would compose more naturally with over. And for eager backends, nw.col('a').rank() does the same thing as nw.rank().over('a')

🤔 i'm not totally sure, will do a bit more thinking on this one, but i think we can come up with something

@MarcoGorelli
Copy link
Member

Having slept on this, I'd suggest:

nw.rank().over(order_by='a')  # equivalent to pl.col('a').rank()

An alternative could be nw.rank('a', partition_by='b'), but then what happens if a user writes nw.rank('a', partition_by='b').over('b')? I think in general, any expression that uses over internally is problematic

The only solution I can currently think of is: nw.rank().over(order_by='a'), and deprecate Expr.rank (but keep it around in stable.v1 - this is especially important here because, IIRC, we already have downstream libraries using nw.Expr.rank)

@dangotbanned
Copy link
Member

#2310 (comment)

@MarcoGorelli I haven't looked into this more than your two comments, so apologies in advance if this seems too simplified of a view.

IMO, the current narwhals internal implementation of the polars API shouldn't lead the decision on the narwhals public API.
If that is the case here, the general concept (if you're unfamiliar) is called a leaky abstraction.


I'm not sure how likely this would be, but another concern might be - what if polars adds a pl.rank in the future that has different behaviour?
The list of sugar functions is still growing

@MarcoGorelli
Copy link
Member

agree, i'm also not keen on implementation contraints determining the api, ideally it should be the opposite...

will think about this more, but maybe we could get to:

✅ allowed:

  • nw.col('a').rank()
  • nw.col('a').rank().over('b')

❌ disallowed:

  • nw.col('a').rank().over('b', order_by='c')

Then there's also:

  • nw.col('a').rank().abs().over('b', order_by='c')

that requires a decision

This involves more refactors, but...it's probably worth it

@dangotbanned
Copy link
Member

#2310 (comment)

@MarcoGorelli that seems consistent with the pl.Expr.rank docs 👍

I don't know if this already plays a factor in the narwhals API.
Do you think about aiming to cover sequences of expressions that are demonstrated in the polars docs?
That seems like a good compass to me - even if we can't always get there

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 29, 2025

Yeah i think it's good to aim for as much expressivity as Polars offers, with the restriction that for LazyFrame we only allow relational expressions - so, lf.select(nw.col('a').drop_nulls(), nw.col('b').drop_nulls()) will never be allowed, as it may result in tuple which were never part of the original relation.
The rest should in theory just be limited by how clear our mental model and implemenation is

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @raisadz - I think we can ship this, we'll sort out .over behaviour later, just left a minor comment

Comment on lines +715 to +718
sql = (
f"CASE WHEN {_input} IS NULL THEN NULL "
f"ELSE {func_name}() OVER ({order_by_sql}) END"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we flip this round (case when not-null), like you did for the spark-like case?

@MarcoGorelli
Copy link
Member

i'll address #2310 (comment) now, gonna start by getting this in though

@MarcoGorelli MarcoGorelli merged commit 8a79617 into narwhals-dev:main Mar 29, 2025
28 checks passed
dangotbanned added a commit that referenced this pull request Mar 30, 2025
@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Mar 31, 2025
MarcoGorelli added a commit that referenced this pull request Apr 4, 2025
* feat(typing): Add `FromNative` protocol

* chore(typing): Add `FromNative` to `CompliantSeries`

- Adding `._is_native` made `TypeVar` invariant
- Realistically, it always was, but underspecified

* feat: Implement for `(Arrow|PandasLike)Series`

* feat: Implement for `PolarsSeries`

+ get some coverage

* chore: `ArrowSeries` coverage

* chore: `PandasLikeSeries` partial coverage

* ignore coverage for now ...

* feat(typing): Add `CompliantDataFrame.from_native`

* feat: Implement for `ArrowDataFrame`

Also coverage for `ArrowSeries`

* feat: Implement for `PandasLikeDataFrame`

Loads of coverage for both `PandasLike`

* feat: Implement for `PolarsDataFrame`

* refactor: Found one more

* chore(typing): Fix missing `SQLExpression` ignore

Was missed in (#2310)

* feat: Implement `EagerNamespace.from_native`

- `Polars*` will also need to handle `LazyFrame`
- `Lazy*` has other constraints

* feat: Add `Polars(Namespace|LazyFrame).from_native`

Probably need to add a `LazyNamespace` protocol for `LazyOnly`

* chore: Ignore coverage `PandasLikeDataFrame._is_native`

Nowhere to use it yet, current stuff uses the more precise `self.native.__class__`

* feat: Add all `CompliantLazyFrame.from_native`

* feat: Add `LazyNamespace.from_native`

* refactor: Get some lazy coverage

https://github.com/narwhals-dev/narwhals/actions/runs/14157697512/job/39659084059?pr=2315

* refactor: More `polars` coverage

https://github.com/narwhals-dev/narwhals/actions/runs/14158424987/job/39660662342?pr=2315

* refactor: reuse `is_spark_like_dataframe`

* Update narwhals/_compliant/namespace.py

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

---------

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement remaining order-dependent lazy operations

4 participants