-
Notifications
You must be signed in to change notification settings - Fork 170
feat: Adds {Expr,Series}.{first,last}
#2528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Towards (#2526)
- Less sure about this one - `head(1)` also seemed like an option
Expr.first()(Expr|Series).first()
|
Anyone feel free to hop on this - just thought I'd get something up for every backend quickly π Lack of coverage is expected for now (https://github.com/narwhals-dev/narwhals/actions/runs/14948882535/job/41995794107) |
| Examples: | ||
| >>> import polars as pl | ||
| >>> import narwhals as nw | ||
| >>> | ||
| >>> s_native = pl.Series([1, 2, 3]) | ||
| >>> s_nw = nw.from_native(s_native, series_only=True) | ||
| >>> s_nw.first() | ||
| 1 | ||
| >>> s_nw.filter(s_nw > 5).first() is None | ||
| True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the None example, but this was the only way I saw to get a repr π
I think it's important to have an example for that case though - since pandas and pyarrow would raise an index error normally
Still need to add `dask`, `duckdb` equivalent of (bd4ab89)
| results = self(df.drop([token], strict=True)) | ||
| if meta is not None and meta.last_node is ExprKind.ORDERABLE_AGGREGATION: | ||
| # Orderable aggregations require `order_by` columns and results in a | ||
| # scalar output (well actually in a length 1 series). | ||
| # Therefore we need to broadcast the results to the original size, since | ||
| # `over` is not a length changing operation. | ||
| size = len(df) | ||
| return [s._with_native(pa.repeat(s.item(), size)) for s in results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok before this PR, we need to support
df = nw.from_native(pl.DataFrame({'a': [1,2,3,4,None,None,2,None,2], 'b': [1,1,1,1,1,1,2,2,2]})).lazy('duckdb').collect('pandas')
print(df.with_columns(
nw.col('a').diff().mean().over(order_by='b')
))
which currently raises for both pandas and pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relation between that and this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it requires the same kind of solution
the fact that the it's orderable shouldn't be relevant, and it's not enough to just look at the last node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tried that example out natively in polars
I'm getting the same result from both of these:
pl.col("a").diff().mean()
pl.col("a").diff().mean().over(order_by="b")If I change the input data in either "a" or "b", the result of "a" is always the mean broadcast to length
Note
Update: I didn't test it here, but over does have an impact if you use .over(order_by="a")
But the result is still broadcast
Show repro
import polars as pl
import narwhals as nw
data_orig = {"a": [1, 2, 3, 4, None, None, 2, None, 2], "b": [1, 1, 1, 1, 1, 1, 2, 2, 2]}
data_b_non_asc = {
"a": [1, 2, 3, 4, None, None, 2, None, 2],
"b": [1, 5, 1, 1, 1, 1, 2, 2, 2],
}
data_a_varied = {
"a": [1, 2, 5, 4, None, None, 2, 12, 2],
"b": [1, 1, 1, 1, 3, 1, 3, 2, 2],
}
datasets = {
"Original": data_orig,
"`b` non-ascending": data_b_non_asc,
"`a` varied": data_a_varied,
}
diff = pl.col("a").diff()
diff_mean = diff.mean()
diff_mean_order_b = diff_mean.over(order_by="b")
native = pl.LazyFrame(data_orig)
with pl.Config(tbl_hide_dataframe_shape=True):
for name, data in datasets.items():
native = pl.LazyFrame(data)
underline = "-" * len(name)
print(name, underline, sep="\n")
print(diff, native.with_columns(diff).collect(), sep="\n")
print(diff_mean, native.with_columns(diff_mean).collect(), sep="\n")
print(
diff_mean_order_b, native.with_columns(diff_mean_order_b).collect(), sep="\n"
)Show output
Original
--------
col("a").diff([dyn int: 1])
ββββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β i64 β
ββββββββͺββββββ‘
β null β 1 β
β 1 β 1 β
β 1 β 1 β
β 1 β 1 β
β null β 1 β
β null β 1 β
β null β 2 β
β null β 2 β
β null β 2 β
ββββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean()
βββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββͺββββββ‘
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 2 β
β 1.0 β 2 β
β 1.0 β 2 β
βββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
βββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββͺββββββ‘
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 2 β
β 1.0 β 2 β
β 1.0 β 2 β
βββββββ΄ββββββ
`b` non-ascending
-----------------
col("a").diff([dyn int: 1])
ββββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β i64 β
ββββββββͺββββββ‘
β null β 1 β
β 1 β 5 β
β 1 β 1 β
β 1 β 1 β
β null β 1 β
β null β 1 β
β null β 2 β
β null β 2 β
β null β 2 β
ββββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean()
βββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββͺββββββ‘
β 1.0 β 1 β
β 1.0 β 5 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 2 β
β 1.0 β 2 β
β 1.0 β 2 β
βββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
βββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββͺββββββ‘
β 1.0 β 1 β
β 1.0 β 5 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 1 β
β 1.0 β 2 β
β 1.0 β 2 β
β 1.0 β 2 β
βββββββ΄ββββββ
`a` varied
----------
col("a").diff([dyn int: 1])
ββββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β i64 β
ββββββββͺββββββ‘
β null β 1 β
β 1 β 1 β
β 3 β 1 β
β -1 β 1 β
β null β 3 β
β null β 1 β
β null β 3 β
β 10 β 2 β
β -10 β 2 β
ββββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean()
βββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββͺββββββ‘
β 0.6 β 1 β
β 0.6 β 1 β
β 0.6 β 1 β
β 0.6 β 1 β
β 0.6 β 3 β
β 0.6 β 1 β
β 0.6 β 3 β
β 0.6 β 2 β
β 0.6 β 2 β
βββββββ΄ββββββ
col("a").diff([dyn int: 1]).mean().over(partition_by: [dyn int: 1], order_by: col("b"))
βββββββββ¬ββββββ
β a β b β
β --- β --- β
β f64 β i64 β
βββββββββͺββββββ‘
β -1.75 β 1 β
β -1.75 β 1 β
β -1.75 β 1 β
β -1.75 β 1 β
β -1.75 β 3 β
β -1.75 β 1 β
β -1.75 β 3 β
β -1.75 β 2 β
β -1.75 β 2 β
βββββββββ΄ββββββThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nw.col('a').diff().mean().over(order_by='b')
@MarcoGorelli was this based on something you've used in polars before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a look through this recent PR:
I was surprised that diff doesn't seem to have any ordering requirements π€
Some select bits from it though:
FunctionWindowExprOutputOrder- All the rules on aggregations
- Note-worthy:
first,lastandimplode(#2660)
- Note-worthy:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple example where it makes a difference:
In [13]: df = pl.DataFrame({'a': [1, 2, 3], 'b': [0, 2, 1]})
In [14]: df.with_columns(c=pl.col('a').diff().mean())
Out[14]:
shape: (3, 3)
βββββββ¬ββββββ¬βββββββββββββββββββββββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β f64 β
βββββββͺββββββͺβββββββββββββββββββββββββββ‘
β 1 β 0 β 1.00000000000000000000e0 β
β 2 β 2 β 1.00000000000000000000e0 β
β 3 β 1 β 1.00000000000000000000e0 β
βββββββ΄ββββββ΄βββββββββββββββββββββββββββ
In [15]: df.with_columns(c=pl.col('a').diff().mean().over(order_by='b'))
Out[15]:
shape: (3, 3)
βββββββ¬ββββββ¬ββββββββββββββββββββββββββββ
β a β b β c β
β --- β --- β --- β
β i64 β i64 β f64 β
βββββββͺββββββͺββββββββββββββββββββββββββββ‘
β 1 β 0 β 5.00000000000000000000e-1 β
β 2 β 2 β 5.00000000000000000000e-1 β
β 3 β 1 β 5.00000000000000000000e-1 β
βββββββ΄ββββββ΄ββββββββββββββββββββββββββββThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.00000000000000000000e0
What were you up to needing this much precision? π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@MarcoGorelli I've just been trying out some tests from before I removed the initial lazy support (4618d01) I think the docstring for import narwhals as nw
data = {"a": [1, 1, 2, 2], "b": ["foo", None, None, "baz"]}
df = nw.from_dict(data, backend="polars")This is fine >>> df.select(nw.col("a", "b").first())
ββββββββββββββββββββ
|Narwhals DataFrame|
|------------------|
| shape: (1, 2) |
| βββββββ¬ββββββ |
| β a β b β |
| β --- β --- β |
| β i64 β str β |
| βββββββͺββββββ‘ |
| β 1 β foo β |
| βββββββ΄ββββββ |
ββββββββββββββββββββThis isn't allowed, and the error message points you in the wrong direction: >>> df.lazy().select(nw.col("a", "b").first()).collect()
InvalidOperationError: Order-dependent expressions are not supported for use in LazyFrame.
Hint: To make the expression valid, use `.over` with `order_by` specified.
For example, if you wrote `nw.col('price').cum_sum()` and you have a column
`'date'` which orders your data, then replace:
nw.col('price').cum_sum()
with:
nw.col('price').cum_sum().over(order_by='date')
^^^^^^^^^^^^^^^^^^^^^^
See https://narwhals-dev.github.io/narwhals/concepts/order_dependence/.>>> df.with_row_index("i").lazy().select(
nw.col("a", "b").first().over(order_by="i")
).collect()
ShapeError: Series b, length 1 doesn't match the DataFrame height of 4
If you want expression: col("b").first().over(partition_by: [1], order_by: col("i")) to be broadcasted, ensure it is a scalar (for instance by adding '.first()').So we have this circular thing where we want I do understand you've rejected data = {"a": [1, 1, 2, 2], "b": ["foo", None, None, "baz"]}
df = pl.DataFrame(data).with_row_index("i").sort("i", descending=True)
>>> df.lazy().select(pl.col("a", "b").sort_by("i").first()).collect()
shape: (1, 2)
βββββββ¬ββββββ
β a β b β
β --- β --- β
β i64 β str β
βββββββͺββββββ‘
β 1 β foo β
βββββββ΄ββββββ |
|
I think this might be a bug in Polars, will check, but thanks for spotting it |
Thanks marco! Just to be 100% clear, I'm not trying to relitigate I am quite keen to see a proposed API for |
|
Reported here: pola-rs/polars#24756 It's a fairly simple workaround, fortunately (just use |

Will close #2526
What type of PR is this? (check all applicable)
Related issues
{Expr,Series}.{first,last}Β #2526blockedassisted by chore: SimplifyPandasLikeGroupByΒ #2680{Expr,Series}.{first,last}Β #2528 (comment)Checklist
If you have comments or can explain your changes, please do so below
polars.Series.first1.10.0Series.firstExprusage rules{Expr,Series}.{first,last}Β #2526 (comment)){Expr,Series}.{first,last}Β #2528 (comment))group_by().agg(...)polarspyarrow(thread)pandaspandasapply special case (thread)applyis a trade-off for consistencyapplywarning(Expr|Series).last()Hopefully, will be simple once all the quirks ofit was πfirst()are handlednarwhals-level methodsfirstlastSeries.firstSeries.lastExpr.first,Expr.lastExpr.(first|last))