chore: Simplify PandasLikeDataFrame|DaskLazyFrame.join method#2511
chore: Simplify PandasLikeDataFrame|DaskLazyFrame.join method#2511
PandasLikeDataFrame|DaskLazyFrame.join method#2511Conversation
|
@FBruzzesi would you be open to flipping the method names around? Most of the private ones that are related share a common prefix. I do understand why e.g ( DataFrame.join(how="anti") -> CompliantDataFrame._join_anti
DataFrame.join(how="left") -> CompliantDataFrame._join_leftLooking at it that way, we're just embedding the
|
|
@dangotbanned moved naming to |
Thanks @FBruzzesi will try to do a proper review today 🙏 |
This should provide the narrowing needed for `@overload`(s)
| if how == "cross": | ||
| if left_on is not None or right_on is not None or on is not None: | ||
| msg = "Can not pass `left_on`, `right_on` or `on` keys for cross join" | ||
| raise ValueError(msg) | ||
| result = compliant.join( | ||
| other, how=how, left_on=None, right_on=None, suffix=suffix | ||
| ) | ||
| ) | ||
| elif on is None: | ||
| if left_on is None or right_on is None: | ||
| msg = f"Either (`left_on` and `right_on`) or `on` keys should be specified for {how}." | ||
| raise ValueError(msg) | ||
| if len(left_on) != len(right_on): | ||
| msg = "`left_on` and `right_on` must have the same length." | ||
| raise ValueError(msg) | ||
| result = compliant.join( | ||
| other, how=how, left_on=left_on, right_on=right_on, suffix=suffix | ||
| ) | ||
| else: | ||
| if left_on is not None or right_on is not None: | ||
| msg = f"If `on` is specified, `left_on` and `right_on` should be None for {how}." | ||
| raise ValueError(msg) | ||
| result = compliant.join( | ||
| other, how=how, left_on=on, right_on=on, suffix=suffix | ||
| ) | ||
| return self._with_compliant(result) |
There was a problem hiding this comment.
We've now got 3 distinct paths, which come from:
how: Literal['cross'], on: None, left_on: None, right_on: Nonehow: Literal['inner', 'left', 'full', 'semi', 'anti'], on: None, left_on: list[str], right_on: list[str]how: Literal['inner', 'left', 'full', 'semi', 'anti'], on: list[str], left_on: None, right_on: None
Maybe we can utilize this to avoid the double-dip validation like in (#2000 (comment))?
There was a problem hiding this comment.
Edit
I need to remember my own advice 🤦♂️
me yesterday on discord 😄
Oh I see!
So a few different ways to solve:
- Mutually exclusive
@overloads
i. So you'd specifyNoneis only allowed for both*_onwhenhow="cross"
ii. However these can be tricky to get right, especially when there's a lot of parameters- Raise an error instead of asserting when an invariant is broken (529c196)
i. This is the easiest solution, but often feels like you're doing something unnecessary- Split incompatible signatures into distinct methods/functions
i. This is my preferred solution and what I'm usually hinting towards when I suggest adding multiple@classmethods
There was a problem hiding this comment.
A follow-up could look at option 3, which would be making some CompliantJoin class, then having:
CompliantJoin.leftCompliantJoin.inner- ...
Maybe the calls from BaseFrame would look like:
compliant.join(**kwds).left()
compliant.join(**kwds).inner()Or just have the one join, which makes the other calls internally, to keep it simple from the outside:
compliant.join(**kwds)
self.left()Would probably be easier to share code between pandas and dask that way as well 🙂
There was a problem hiding this comment.
A follow-up could look at option 3, which would be making some
CompliantJoinclass, then having:
CompliantJoin.leftCompliantJoin.inner- ...
Maybe the calls from
BaseFramewould look like:compliant.join(**kwds).left() compliant.join(**kwds).inner()Or just have the one
join, which makes the other calls internally, to keep it simple from the outside:compliant.join(**kwds) self.left()Would probably be easier to share code between
pandasanddaskthat way as well 🙂
This seems like an interesting idea but I am not sure I get it right and/or I would end up with the expected changes.
Especially:
Where would CompliantJoin exist and how would it interact with other compliant classes?
and
Maybe the calls from
BaseFramewould look like:compliant.join(**kwds).left() compliant.join(**kwds).inner()
Would this require some special path for polars?
There was a problem hiding this comment.
@FBruzzesi you could think of it like these:
CompliantGroupByCompliantWhen- feat: Improve
DataFrame.__getitem__consistency #2393
Or similar to how the Expr, Series namespace classes work.
Essentially, we define some common parts and then refine from there per-backend
Would this require some special path for polars?
Yeah most likely, polars gets to skip our abstractions in a lot of places and this would be one of those cases I'd assume.
Anyways, this is very much just the kernel of an idea 😅
The goal would be sharing more code between backends 😎
There was a problem hiding this comment.
Should we... keep that as a follow up then? 😅
The goal would be sharing more code between backends 😎
Yep this is much loud and clear! Pandas-like and Dask can definitly share some functionalities eventually :)
There was a problem hiding this comment.
Should we... keep that as a follow up then? 😅
Yeah for sure!
I started with that in (#2511 (comment)) but clearly got carried away with my rambling 😄
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks @FBruzzesi - looking a lot better! 🥳
I've got a few suggestions on the pandas one, but I think the same would apply for dask as well
Thanks @dangotbanned - great work you've done! |
|
I was thinking about simplifying I thought I'd take a look at the They're simpler, and accept more parameters and types 🤔
Maybe they simplified it at some point? |
| def _join_semi( | ||
| self, other: Self, *, left_on: Sequence[str], right_on: Sequence[str] | ||
| ) -> Self: | ||
| other_native = self._join_filter_rename( | ||
| other=other, | ||
| columns_to_select=list(right_on), | ||
| columns_mapping=dict(zip(right_on, left_on)), | ||
| ) |
There was a problem hiding this comment.
This is probably my bad for the huge suggestion in (#2511 (comment)) 😅
So these lines I wrote as:
other_native = self._join_filter_rename(other, left_on, right_on)And then had the conversion handled inside that method:
...
return rename(
select_columns_by_name(
other.native, list(right_on), backend_version, implementation
),
columns=dict(zip(right_on, left_on)),
...
)With that small tweak, we avoid needing to keep this part synced with the same lines here 😎
narwhals/narwhals/_pandas_like/dataframe.py
Lines 668 to 672 in 622a578
There was a problem hiding this comment.
@dangotbanned please forgive me - I didn't get your point here 🥲
| def _join_filter_rename( | ||
| self, other: Self, columns_to_select: list[str], columns_mapping: dict[str, str] | ||
| ) -> pd.DataFrame: | ||
| """Helper function to avoid creating extra columns and row duplication. |
There was a problem hiding this comment.
@FBruzzesi in (#2511 (comment)) I'm saying that this version is more duplicated than what I had in the original suggestion, which was:
def _join_filter_rename(
self, other: Self, left_on: Sequence[str], right_on: Sequence[str]
) -> pd.DataFrame:
"""Rename to avoid creating extra columns in `"anti"`, `"semi"` join, and avoids potential rows duplication."""
implementation = self._implementation
backend_version = self._backend_version
return rename(
select_columns_by_name(
other.native, list(right_on), backend_version, implementation
),
columns=dict(zip(right_on, left_on)),
implementation=implementation,
backend_version=backend_version,
).drop_duplicates()The list(right_on) and dict(zip(right_on, left_on)) parts are defined in one place - rather than repeated each time you call _join_filter_rename
There was a problem hiding this comment.
Oh I see, it has been a while since this PR, I didn't spot it in the other comment.
I see what you mean, it a similar conundrum to #2495, the signature with columns_to_select and columns_mapping in my opinion is much more intuitive to understand what happens within the method without jumping into it, on the other hand, yes, we end up creating and passing these objects externally of the method itself.
I really don't know 🤔
|
@FBruzzesi you may have noticed I remembered about this PR in (#2511 (reference)) and then quickly lost track 😅 Feel free to ping me if I forget to review tomorrow! |
Was added after this PR started
|
Mostly finished on the review, just need to add a comment Long awaited review will be with you shortly 😉 |
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks for the PR @FBruzzesi (and bearing with me 😄)
I've got a couple of new notes, which are what the Request changes are related to.
Regarding (#2511 (comment)), I still have a preference for my earlier suggestion.
But, both are improvements over what we have on main - so feel free to keep that part as-is 🙂
|
Thanks for all the support @dangotbanned
but as of:
for now yes I would rather keep it as is. My mind has not changed from the previous comment on it 😂 |



What type of PR is this? (check all applicable)
Related issues
*DataFrame.join#2488Checklist
If you have comments or can explain your changes, please do so below
Reduces the scope by creating
_<join_strategy>_joinmethod, for each join strategy, to be called injoin