Skip to content

feat(expr-ir): Finish* implementing ArrowExpr#3325

Merged
dangotbanned merged 216 commits intooh-nodesfrom
expr-ir/plz-finish-arrow-expr
Dec 14, 2025
Merged

feat(expr-ir): Finish* implementing ArrowExpr#3325
dangotbanned merged 216 commits intooh-nodesfrom
expr-ir/plz-finish-arrow-expr

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Nov 25, 2025

Related issues

Tracking

Description

This PR is a bit of a mixed bag.

I've really tried to focus on getting an implementation for all of the methods in the Expr namespace.
Man there a lot of them now 😅

I have a pretty long list, so things are grouped and highlighted to what was the most interesting to me.

Tip

✨ - New feature (either for pyarrow as a backend or narwhals itself)
💾 - Refactor from main (particularly trying to avoid unconditional-dependence on numpy)

Show top-level functions

Show Expr methods

  • ceil
  • clip, clip_lower, clip_upper
  • drop_nulls
  • exp
  • fill_nan
  • fill_null
  • fill_null_with_strategy (💾)
  • floor
  • gather_every (deprecated)
  • hist (✨)
    • Almost all numpy usage avoided
    • Returns as struct when needed
  • is_{duplicated,unique} (💾, ✨)
    • Supports over(*partition_by)
  • is_not_null (✨)
  • is_not_nan (✨)
  • {kurtosis,skew}, (💾, ✨)
    • Supports over(*partition_by) and group_by
  • rolling_expr (💾, cc francesco)
    • rolling_sum
    • rolling_mean
    • rolling_var
    • rolling_std
  • log
  • map_batches(is_elementwise=..., returns_scalar=...) (✨)
  • mode(keep: ModeKeepStrategy) (💾)
  • replace_strict
  • replace_strict(default=...)
  • round
  • sample (deprecated)
    • Avoids numpy when with_replacement=True
  • sqrt
  • unique

Show Expr*Namespace methods

  • cat.get_categories (💾)
  • list.contains (✨)
    • Also supports scalar Expr
  • list.get
  • list.join (✨)
  • list.len
  • list.unique (✨)
  • str.contains
  • str.len_chars
  • str.to_{upper,lower,title}case
  • str.replace(value: IntoExpr) (✨)
  • str.replace_all(value: IntoExpr) (✨)
  • str.slice
    • {head,tail} are sugar at narwhals-level
  • str.split
  • str.{starts,ends}_with (💾)
  • str.strip_chars
  • str.zfill (💾)
  • struct.field

Show Series methods

*Not a complete list*, some misc others were added since they were used in the test suite on `main`:

  • All binary ops
  • cum_*
  • explode (✨)
  • fill_null(_with_strategy)
  • gather_every
  • rolling_* (💾, which led to)
    • diff(n=...) (✨)
    • shift(fill_value=...) (✨)
      • Used in an earlier version (f0d182d)
  • sample (💾)
  • zip_with

Show Series*Namespace methods

Choosing to leave most of these out for now; only adding things that are unique to Series and/or are depended on by the current implementation:

  • struct.field
  • struct.unnest (✨, used in Series.hist)
  • struct.schema (✨, used in unnest)

Show DataFrame methods

Show internal implementations

While working on the above, some functions I found easier to express when composed of other parts of the polars API - which narwhals doesn't support yet.
In some cases I factored out their usage - but it was a fun experiment

  • str.find (44a9d1e)
  • str.pad_start
  • str.splitn
  • str.join
  • implode
  • eq_missing

Show missing vs main

  • Expr.{head,tail} (deprecated)
    • could add slice?
  • is_close
  • Expr.str.to_date(time)
  • Expr.dt (whole namespace)

A general theme this all follows is that implementations of functions go in <backend>.functions.py.
They can then be used by Series, Expr (and Scalar) directly or when composing other functions - without narwhals wrapper overhead

What's next?

I've been fighting the urge to rewrite how this version of CompliantExpr works, pretty hard throughout this PR.

Most of what is in (https://github.com/narwhals-dev/narwhals/blob/e68d9ab9b12562848602e7a0d2f7baf80bc0576a/narwhals/_plan/arrow/expr.py) is general visitor logic which would be a slog to repeat in every backend.
It works and I'm finding it easy to reason about, but I 100% plan to put in some more design work.
Just needed to endure the pain of doing it the long way, and get a feel for where things work and where they don't 🙂

LogicalPlan is the next (likely) big item I have my eye on

Quite odd behavior for scalar lol
Will come back to this later to shrink
Discovered while adding `kurtosis` test which had an empty series
Adding the `skew` test revealed this "edge case"
The rest will allow them to be used in `group_by`
Indirectly adds support for `over` too, but haven't added tests yet
- Still needs tests
- Also unsure what the scalar behavior should be for `mode_all`
I wanna try rewriting this without `numpy` after getting the tests in place
Got quite a few more ideas to experiment with
Each of these are expensive + this version is simpler
Managed to write it with one less `if_else`, but the readability suffered so this will do
I've tried adding the `not_implemented` 3 times now and kept forgetting why it wasn't there yet
`ArrowSeries.struct.unnest` depends on this for backcompat
I'd rather this was covered in all cases
TIL: `pyarrow.compute.and_not` exists
@dangotbanned dangotbanned marked this pull request as ready for review December 14, 2025 14:04
@dangotbanned dangotbanned merged commit 1618650 into oh-nodes Dec 14, 2025
37 of 38 checks passed
@dangotbanned dangotbanned deleted the expr-ir/plz-finish-arrow-expr branch December 14, 2025 14:23
dangotbanned added a commit that referenced this pull request Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal pyarrow Issue is related to pyarrow backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant