[Enh]: Add `Series|Expr.replace` #1223

FBruzzesi · 2024-10-18T21:43:36Z

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

This would enable plotly to do custom sorting without filtering + concatenating:

- nw.concat(
-     [df.filter(nw.col(names) == value) for value in order],
-     how="vertical"
- )
+ (df.with_columns(
+     __custom_sort_col = nw.col(names).replace({v:i for i,v in enumerate(order)}
+     )
+ .sort("__custom_sort_col")
+ .drop("__custom_sort_col")
+ )

(code snippet)

Please describe the purpose of the new feature or describe the problem to solve.

Replicate polars Expr|Series.replace

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-10-19T07:52:32Z

thanks @FBruzzesi !

I think the Polars-native solution would be:

df.sort(pl.col(names).replace({x: i for i, x in enumerate(order)}))

, without setting any temporary columns

that would require both DataFrame.sort taking expressions, and Expr.replace - i'll take a look

FBruzzesi · 2024-10-19T10:21:25Z

Thanks for the feedback! Yes I would imagine that polars can get around that using expressions as sort key - yet we currently don't support expressions in these contexts, and I have the impression that it may not be trivial to allow that in the current framework we have?

MarcoGorelli · 2024-10-21T13:06:35Z

yup, definitely not trivial...but I think you're right, replace (or rather, replace_all?) is what we need here - and definitely an improvement over concat

MarcoGorelli · 2024-10-21T15:50:40Z

what would you think about doing this using a join? Polars does a join under the hood to do this anyway

https://github.com/pola-rs/polars/blob/dbbd93fae922b94cb1b5e700472a75b5a975fc44/crates/polars-ops/src/series/ops/replace.rs#L203-L233

Example:

@nw.narwhalify(eager_only=True)
def sort_by_custom_order(df, key, order):
    order_key = generate_unique_token(8, df.columns)
    order_df = nw.from_dict(
        {key: order, order_key: range(len(order))},
        native_namespace=nw.get_native_namespace(df),
    )
    return df.join(order_df, on=key, how="left").sort(order_key).drop(order_key)

which, in the Plotly context, you could call as

args["data_frame"] = sort_by_custom_order(df, names, order)

Demo:

import polars as pl
import pandas as pd
import narwhals.stable.v1 as nw
from narwhals.utils import generate_unique_token
import pyarrow as pa

data = {'a': ['foo', 'bar', 'foo', 'foo', 'bar', 'quox', 'foo'], 'b': [1, 3,2,6,3,3,4]}
order = ['foo', 'quox', 'bar']

@nw.narwhalify(eager_only=True)
def sort_by_custom_order(df, key, order):
    order_key = generate_unique_token(8, df.columns)
    order_df = nw.from_dict({key: order, order_key: range(len(order))}, native_namespace=nw.get_native_namespace(df))
    return df.join(order_df, on=key, how='left').sort(order_key).drop(order_key)

print(sort_by_custom_order(pd.DataFrame(data), 'a', order))
print(sort_by_custom_order(pl.DataFrame(data), 'a', order))
print(sort_by_custom_order(pa.table(data), 'a', order))

outputs

      a  b
0   foo  1
2   foo  2
3   foo  6
6   foo  4
5  quox  3
1   bar  3
4   bar  3
shape: (7, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ str  ┆ i64 │
╞══════╪═════╡
│ foo  ┆ 1   │
│ foo  ┆ 2   │
│ foo  ┆ 6   │
│ foo  ┆ 4   │
│ quox ┆ 3   │
│ bar  ┆ 3   │
│ bar  ┆ 3   │
└──────┴─────┘
pyarrow.Table
a: string
b: int64
----
a: [["foo","foo","foo","foo","quox","bar","bar"]]
b: [[1,2,6,4,3,3,3]]

MarcoGorelli · 2024-10-21T16:03:43Z

🤔 nevermind, the join strategy seems to be slower than the concat strategy from the plotly pr 😳

MarcoGorelli · 2024-10-21T18:54:00Z

I make a branch in which I roughly implemented replace and replace_strict, and it looks like for both pandas and Polars, your concat-solution is actually the fastest 🙌

import polars as pl
import pandas as pd
import narwhals.stable.v1 as nw
from narwhals.utils import generate_unique_token
import pyarrow as pa
import numpy as np
rng = np.random.default_rng(1)

pd.set_option('future.no_silent_downcasting', True)

data = {'a': ['foo', 'bar', 'foo', 'foo', 'bar', 'quox', 'foo'], 'b': [1, 3,2,6,3,3,4]}
order = ['foo', 'quox', 'bar']

@nw.narwhalify(eager_only=True)
def func(df, key, order):
    order_key = generate_unique_token(8, df.columns)
    order_df = nw.from_dict({key: order, order_key: range(len(order))}, native_namespace=nw.get_native_namespace(df))
    return df.join(order_df, on=key, how='left').sort(order_key).drop(order_key)

@nw.narwhalify
def func2(df, key, order):
    return nw.concat(
            [df.filter(nw.col(key) == value) for value in order], how="vertical"
        )

@nw.narwhalify
def func3(df, key, order):
    token = generate_unique_token(8, df.columns)
    return df.with_columns(nw.col(key).replace_strict({x: i for i, x in enumerate(order)}, return_dtype=nw.UInt8).alias(token)).sort(token).drop(token)

print(func(pd.DataFrame(data), 'a', order))
print(func(pl.DataFrame(data), 'a', order))
print(func2(pd.DataFrame(data), 'a', order))
print(func2(pl.DataFrame(data), 'a', order))
print(func3(pd.DataFrame(data), 'a', order))
print(func3(pl.DataFrame(data), 'a', order))

bigdata = {'a': rng.integers(0, 3, size=100_000), 'b': rng.integers(0, 3, size=100_000), 'c': rng.integers(0, 3, size=100_000)}
order = [1, 0, 2]

In [26]: %timeit _ =  func(pd.DataFrame(bigdata), 'a', order)
8.33 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %timeit _ =  func2(pd.DataFrame(bigdata), 'a', order)
3.13 ms ± 334 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [28]: %timeit _ =  func3(pd.DataFrame(bigdata), 'a', order)
7.19 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %timeit _ =  func(pl.DataFrame(bigdata), 'a', order)
12.8 ms ± 200 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: %timeit _ =  func2(pl.DataFrame(bigdata), 'a', order)
1.79 ms ± 74.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [31]: %timeit _ =  func3(pl.DataFrame(bigdata), 'a', order)
12.7 ms ± 232 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, Polars doesn't rechunk when it concat, so it may end up being slower later on. We can do more comprehensive timing tests of a full plotting function with all 3 methodologies to see which one is better in the full context

interestingly enough, any of these approaches is faster than the original index-based solution in plotly:

In [18]: %timeit _ = pd.DataFrame(bigdata).set_index('a').loc[order].reset_index()
19.8 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is pleasantly surprising to me, I was expecting that we would be degrading performance here - nice!

FBruzzesi · 2024-10-21T20:59:18Z

Thanks Marco, that's definitly unexpected.
Should we also consider to use group_by/partition_by instead of consecutive filtering? (I won't be really able to take a look before Wed/Thu)

FBruzzesi · 2024-11-09T15:37:00Z

replace_strict has been merged. Are we planning to add replace as well, or should we close this issue?

FBruzzesi added the enhancement New feature or request label Oct 18, 2024

FBruzzesi mentioned this issue Oct 18, 2024

feat: make plotly-express dataframe agnostic via narwhals plotly/plotly.py#4790

Merged

5 tasks

MarcoGorelli added the high priority label Oct 19, 2024

MarcoGorelli added Medium priority and removed high priority labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: Add `Series|Expr.replace` #1223

[Enh]: Add `Series|Expr.replace` #1223

FBruzzesi commented Oct 18, 2024 •

edited

Loading

MarcoGorelli commented Oct 19, 2024

FBruzzesi commented Oct 19, 2024 •

edited

Loading

MarcoGorelli commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 •

edited

Loading

MarcoGorelli commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 •

edited

Loading

FBruzzesi commented Oct 21, 2024

FBruzzesi commented Nov 9, 2024

[Enh]: Add Series|Expr.replace #1223

[Enh]: Add Series|Expr.replace #1223

Comments

FBruzzesi commented Oct 18, 2024 • edited Loading

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Please describe the purpose of the new feature or describe the problem to solve.

Suggest a solution if possible.

If you have tried alternatives, please describe them below.

Additional information that may help us understand your needs.

MarcoGorelli commented Oct 19, 2024

FBruzzesi commented Oct 19, 2024 • edited Loading

MarcoGorelli commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 • edited Loading

MarcoGorelli commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 • edited Loading

FBruzzesi commented Oct 21, 2024

FBruzzesi commented Nov 9, 2024

[Enh]: Add `Series|Expr.replace` #1223

[Enh]: Add `Series|Expr.replace` #1223

FBruzzesi commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 19, 2024 •

edited

Loading

MarcoGorelli commented Oct 21, 2024 •

edited

Loading

MarcoGorelli commented Oct 21, 2024 •

edited

Loading