Skip to content

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented May 10, 2025

Will close #2526

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@dangotbanned dangotbanned added the enhancement New feature or request label May 10, 2025
@dangotbanned dangotbanned changed the title feat(DRAFT): Adds Expr.first() feat(DRAFT): Adds (Expr|Series).first() May 10, 2025
@dangotbanned
Copy link
Member Author

dangotbanned commented May 10, 2025

Anyone feel free to hop on this - just thought I'd get something up for every backend quickly πŸ™‚

Lack of coverage is expected for now (https://github.com/narwhals-dev/narwhals/actions/runs/14948882535/job/41995794107)

Comment on lines +810 to +819
Examples:
>>> import polars as pl
>>> import narwhals as nw
>>>
>>> s_native = pl.Series([1, 2, 3])
>>> s_nw = nw.from_native(s_native, series_only=True)
>>> s_nw.first()
1
>>> s_nw.filter(s_nw > 5).first() is None
True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the None example, but this was the only way I saw to get a repr 😞

I think it's important to have an example for that case though - since pandas and pyarrow would raise an index error normally

Comment on lines 130 to 137
results = self(df.drop([token], strict=True))
if meta is not None and meta.last_node is ExprKind.ORDERABLE_AGGREGATION:
# Orderable aggregations require `order_by` columns and results in a
# scalar output (well actually in a length 1 series).
# Therefore we need to broadcast the results to the original size, since
# `over` is not a length changing operation.
size = len(df)
return [s._with_native(pa.repeat(s.item(), size)) for s in results]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok before this PR, we need to support

df = nw.from_native(pl.DataFrame({'a': [1,2,3,4,None,None,2,None,2], 'b': [1,1,1,1,1,1,2,2,2]})).lazy('duckdb').collect('pandas')
print(df.with_columns(
    nw.col('a').diff().mean().over(order_by='b')
))

which currently raises for both pandas and pyarrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relation between that and this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it requires the same kind of solution

the fact that the it's orderable shouldn't be relevant, and it's not enough to just look at the last node

Copy link
Member Author

@dangotbanned dangotbanned Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tried that example out natively in polars

I'm getting the same result from both of these:

pl.col("a").diff().mean()
pl.col("a").diff().mean().over(order_by="b")

If I change the input data in either "a" or "b", the result of "a" is always the mean broadcast to length

Note

Update: I didn't test it here, but over does have an impact if you use .over(order_by="a")
But the result is still broadcast

Show repro

import polars as pl

import narwhals as nw

data_orig = {"a": [1, 2, 3, 4, None, None, 2, None, 2], "b": [1, 1, 1, 1, 1, 1, 2, 2, 2]}
data_b_non_asc = {
    "a": [1, 2, 3, 4, None, None, 2, None, 2],
    "b": [1, 5, 1, 1, 1, 1, 2, 2, 2],
}
data_a_varied = {
    "a": [1, 2, 5, 4, None, None, 2, 12, 2],
    "b": [1, 1, 1, 1, 3, 1, 3, 2, 2],
}
datasets = {
    "Original": data_orig,
    "`b` non-ascending": data_b_non_asc,
    "`a` varied": data_a_varied,
}

diff = pl.col("a").diff()
diff_mean = diff.mean()
diff_mean_order_b = diff_mean.over(order_by="b")

native = pl.LazyFrame(data_orig)

with pl.Config(tbl_hide_dataframe_shape=True):
    for name, data in datasets.items():
        native = pl.LazyFrame(data)
        underline = "-" * len(name)
        print(name, underline, sep="\n")
        print(diff, native.with_columns(diff).collect(), sep="\n")
        print(diff_mean, native.with_columns(diff_mean).collect(), sep="\n")
        print(
            diff_mean_order_b, native.with_columns(diff_mean_order_b).collect(), sep="\n"
        )

Show output

Original
--------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
`b` non-ascending
-----------------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 5   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β”‚ null ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 5   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 5   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 1   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β”‚ 1.0 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
`a` varied
----------
col("a").diff([dyn int: 1])
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a    ┆ b   β”‚
β”‚ ---  ┆ --- β”‚
β”‚ i64  ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═════║
β”‚ null ┆ 1   β”‚
β”‚ 1    ┆ 1   β”‚
β”‚ 3    ┆ 1   β”‚
β”‚ -1   ┆ 1   β”‚
β”‚ null ┆ 3   β”‚
β”‚ null ┆ 1   β”‚
β”‚ null ┆ 3   β”‚
β”‚ 10   ┆ 2   β”‚
β”‚ -10  ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean()
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ f64 ┆ i64 β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 3   β”‚
β”‚ 0.6 ┆ 1   β”‚
β”‚ 0.6 ┆ 3   β”‚
β”‚ 0.6 ┆ 2   β”‚
β”‚ 0.6 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a     ┆ b   β”‚
β”‚ ---   ┆ --- β”‚
β”‚ f64   ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═════║
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 3   β”‚
β”‚ -1.75 ┆ 1   β”‚
β”‚ -1.75 ┆ 3   β”‚
β”‚ -1.75 ┆ 2   β”‚
β”‚ -1.75 ┆ 2   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nw.col('a').diff().mean().over(order_by='b')

@MarcoGorelli was this based on something you've used in polars before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a look through this recent PR:

I was surprised that diff doesn't seem to have any ordering requirements πŸ€”

Some select bits from it though:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple example where it makes a difference:

In [13]: df = pl.DataFrame({'a': [1, 2, 3], 'b': [0, 2, 1]})

In [14]: df.with_columns(c=pl.col('a').diff().mean())
Out[14]:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   ┆ c                        β”‚
β”‚ --- ┆ --- ┆ ---                      β”‚
β”‚ i64 ┆ i64 ┆ f64                      β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ══════════════════════════║
β”‚ 1   ┆ 0   ┆ 1.00000000000000000000e0 β”‚
β”‚ 2   ┆ 2   ┆ 1.00000000000000000000e0 β”‚
β”‚ 3   ┆ 1   ┆ 1.00000000000000000000e0 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In [15]: df.with_columns(c=pl.col('a').diff().mean().over(order_by='b'))
Out[15]:
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   ┆ c                         β”‚
β”‚ --- ┆ --- ┆ ---                       β”‚
β”‚ i64 ┆ i64 ┆ f64                       β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═══════════════════════════║
β”‚ 1   ┆ 0   ┆ 5.00000000000000000000e-1 β”‚
β”‚ 2   ┆ 2   ┆ 5.00000000000000000000e-1 β”‚
β”‚ 3   ┆ 1   ┆ 5.00000000000000000000e-1 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.00000000000000000000e0

What were you up to needing this much precision? πŸ˜„

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, i think i'm finally happy to ship this

thanks both for having got this started!

@dangotbanned
Copy link
Member Author

@MarcoGorelli I've just been trying out some tests from before I removed the initial lazy support (4618d01)

I think the docstring for first, last needs to be clearer on what is/isn't allowed

import narwhals as nw

data = {"a": [1, 1, 2, 2], "b": ["foo", None, None, "baz"]}
df = nw.from_dict(data, backend="polars")

This is fine

>>> df.select(nw.col("a", "b").first())

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|Narwhals DataFrame|
|------------------|
|  shape: (1, 2)   |
|  β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”   |
|  β”‚ a   ┆ b   β”‚   |
|  β”‚ --- ┆ --- β”‚   |
|  β”‚ i64 ┆ str β”‚   |
|  β•žβ•β•β•β•β•β•ͺ═════║   |
|  β”‚ 1   ┆ foo β”‚   |
|  β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜   |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This isn't allowed, and the error message points you in the wrong direction:

>>> df.lazy().select(nw.col("a", "b").first()).collect()
InvalidOperationError: Order-dependent expressions are not supported for use in LazyFrame.

Hint: To make the expression valid, use `.over` with `order_by` specified.

For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:

   nw.col('price').cum_sum()

 with:

   nw.col('price').cum_sum().over(order_by='date')
                            ^^^^^^^^^^^^^^^^^^^^^^

See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.
>>> df.with_row_index("i").lazy().select(
    nw.col("a", "b").first().over(order_by="i")
).collect()

ShapeError: Series b, length 1 doesn't match the DataFrame height of 4

If you want expression: col("b").first().over(partition_by: [1], order_by: col("i")) to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

So we have this circular thing where we want order_by, but polars wants first πŸ˜‚


I do understand you've rejected sort_by (#2534 (comment)), but it does solve this use-case (if we ever venture down that road again)

data = {"a": [1, 1, 2, 2], "b": ["foo", None, None, "baz"]}
df = pl.DataFrame(data).with_row_index("i").sort("i", descending=True)
>>> df.lazy().select(pl.col("a", "b").sort_by("i").first()).collect()

shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ a   ┆ b   β”‚
β”‚ --- ┆ --- β”‚
β”‚ i64 ┆ str β”‚
β•žβ•β•β•β•β•β•ͺ═════║
β”‚ 1   ┆ foo β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

@MarcoGorelli
Copy link
Member

I think this might be a bug in Polars, will check, but thanks for spotting it

@dangotbanned
Copy link
Member Author

dangotbanned commented Oct 3, 2025

I think this might be a bug in Polars, will check, but thanks for spotting it

Thanks marco!

Just to be 100% clear, I'm not trying to relitigate sort_by
(#2528 (comment)) is just highlighting a UX issue

I am quite keen to see a proposed API for {min,max}_by and maybe discuss in another issue (#2526 (comment)) πŸ™‚

@MarcoGorelli
Copy link
Member

Reported here: pola-rs/polars#24756

It's a fairly simple workaround, fortunately (just use pl.repeat(1, pl.len()) instead of pl.lit(1) in over)

@MarcoGorelli MarcoGorelli merged commit 053390d into main Oct 4, 2025
31 of 33 checks passed
@MarcoGorelli MarcoGorelli deleted the expr-first branch October 4, 2025 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enh]: add {Expr,Series}.{first,last}

4 participants