-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: "carefully" allow for dask Expr that modify index #743
base: main
Are you sure you want to change the base?
Conversation
result = df.select(nw.col("a").drop_nulls(), nw.col("d").drop_nulls()) | ||
expected = {"a": [1.0, 2.0], "d": [6, 6]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly this broadcast is not working as drop_nulls does not return a scalar. I would consider this an edge case and focus on the broader support
thanks for trying this - i'll test it out and see if there's a perf impact |
narwhals/_dask/dataframe.py
Outdated
|
||
col_order = list(new_series.keys()) | ||
|
||
left_most_series = next( # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is guaranteed to not end up in StopIteration
error as if everything was a scalar the previous block would have been entered and returned
we've got the notebooks in |
Hey @MarcoGorelli, I am giving another thought on this feature (which I would still love to see), here is a simple idea to have partial support without loss of performance:
What do you think? |
@MarcoGorelli I am tagging this as ready for review as I re-worked it a bit more. The TL;DR is:
Yet before developing further, I would like some feedback on how likable this approach is and if we want to move forward with it ππΌ |
narwhals/_dask/expr.py
Outdated
def head(self: Self, n: int) -> Self: | ||
return self._from_call( | ||
lambda _input, _n: _input.head(_n, compute=False), | ||
"head", | ||
n, | ||
returns_scalar=False, | ||
modifies_index=True, | ||
) | ||
|
||
def tail(self: Self, n: int) -> Self: | ||
return self._from_call( | ||
lambda _input, _n: _input.tail(_n, compute=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... head
has a npartitions
param which can be set to -1 and scan all partitions, while tail
does not. This means that if we have multiple partitions, then this implementation of tail may not be what we expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen cases where users actually want more than 1 partition if they call head or tail tbh, Yes this is technically an issue, but not something I've encountered in the wild
thanks @FBruzzesi ! to be honest I don't know about using such private methods, it makes me feel slightly uneasy - @phofl do you have time/interest in taking a look? specifically the I think that for sql engines (like duckdb, which hopefully we can get to eventually) operations like |
narwhals/_dask/expr.py
Outdated
result = _input.to_frame(name=name).sort_values( | ||
by=name, ascending=ascending, na_position=na_position | ||
)[name] | ||
return de._collection.Series(de._expr.AssignIndex(result, _input.index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can second @MarcoGorelli here, please don't do this. We are sorting a Series (i.e. a single column from the df), correct?
I would:
tmp = _input.reset_index().sort_values(...)
result = tmp[_input.name]
result.index = tmp["use the index name of _input"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you want to keep the Index of the input?
If yes, this is fundamentally a bad idea in Dask, it will shoot you in the foot all over the place. You have zero guarantees that the partitions keep their lengths when sorting (it is a lot more likely that they do not), so this is bound to fail in all kinds of places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @phofl thanks for taking the time.
We are sorting a Series (i.e. a single column from the df), correct?
Yes indeed, but with the final goal to potentially add it as a new column to the original dataframe, and that's where the index misalignment gets into the game.
I will try the approach you are suggesting, which is not too far off from what is already implemented, and see if everything else falls into place ππΌ
Edit: it just ends up raising:
AssertionError: value needs to be aligned with the index
Traceback
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[60], line 3
1 tmp = df_dd["a"].to_frame(name="a").sort_values("a")
2 result = tmp["a"]
----> 3 result.index = df_dd["a"].index
655 @index.setter
656 def index(self, value):
--> 657 assert expr.are_co_aligned(
658 self.expr, value.expr
659 ), "value needs to be aligned with the index"
660 _expr = expr.AssignIndex(self, value)
661 self._expr = _expr
AssertionError: value needs to be aligned with the index
Hi @phofl, apologies to call you in the mix once more. I have a few questions in order to make this work and guarantee that we don't end up with a
|
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below.
Pretty dangerous stuff to workaround the dask index.
To assess that the implementation is working as expected, I implemented both
sort
(different index but same length) anddrop_nulls
(different index due to different length)