chore: always use check_columns_exist where possible#2495
chore: always use check_columns_exist where possible#2495dangotbanned merged 27 commits intonarwhals-dev:mainfrom
check_columns_exist where possible#2495Conversation
ColumnNotFoundErrorcheck_columns_exist where possible
|
Removed the dependency on #2493 , this is now ready for review :) |
- Avoid collecting into `list` or `set` until strictly needed - Accept `_StoresColumns` objects in more places - Use a consistent positional pattern of `frame, subset` - Fix `FBT001` - Prepare error message entirely in `ColumnNotFoundError`
Following merge (narwhals-dev@124f5a3)
|
@EdAbati I've been experimenting with #2534 a bit locally and realised some of the changes I was making to
I was finding it difficult to make those as suggestions, so I hope collecting them into 1 commit makes it easier to revert if you prefer what was there before 🙂 I made some notes on the commit description as well |
|
Thanks for the update! I think most of thechanges looks great to me. I'm not 100% sure about the signature of def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
...When used it will become: - check_columns_exist(self, subset)
+ check_columns_exist(subset, available=self.columns)
- check_columns_exist(frame, to_drop)
+ check_columns_exist(to_drop, available=frame.columns)
check_columns_exist(
- df.columns.tolist(), # type: ignore[attr-defined]
- column_names, # type: ignore[arg-type]
+ column_names,
+ available=df.columns.tolist(),
)But this is highly subjective, maybe @FBruzzesi or @MarcoGorelli can break the tie :) On a similar note, don't you think forcing positional args in Anyway all minor points. If the manjority likes this version let's go for it |
|
Hey everyone! I didn't go through the detailed history of what happened here
I strongly agree with this statement: considering how easy it would be to swap the two arguments I would prefer to have them as keyword only. And passing column names directly instead of the entire dataframe would make it more generic and re-usable.
Same consideration - I rather have them keyword only to be explicit in this case |
|
Hey @EdAbati, @FBruzzesi You'd be surprised how many thoughts your comments sparked for me 😄 As I mentioned in (#2495 (comment)), these changes are kind of a merging of ideas. Just wanted to check in and say I hear you 🙂 Hopefully will be able to collect my thoughts properly later today. |
|
i haven't checked everything but - so long as |
|
Note I spiralled a bit (you were warned) but hopefully there's some good info in here 😄
|
|
Hey @dangotbanned thanks for the thoughtful reasoning walkthrough. I agree with most of the points you made and will try to add my two cents. The reason for suggesting a verbose approach is that what we have is a utility function, which is much more generic than the use case - the answer we get is from the following question "is X a subset of Y?" - so to me it was important to make sure who is X and who is Y. Regarding Now to the most important piece: "What do now?" - yes I really like that approach, and by moving from utility function to a method, it's much clearer who is what - so thank for the clarity and the suggestion, I think it's really great |
|
Thanks @FBruzzesi 😍 (#2495 (comment))
Reading this back now I realised I waffled quite a lot and missed the point I wanted to land on 😭 What I proposed in (#2495 (comment)) was what I had in mind while writing (4d00461) Important I didn't mean we need to do any of that in this PR That was just the direction I wanted to move in afterwards. I'll leave this to anyone else to merge - just signing off that I'm happy with the PR as-is - or we can revert some parts. Thanks for working on this @EdAbati and @FBruzzesi for reading my blog post 😉 |
On this regard - I think it's worth addressing it here at this point since the discussion started and the idea is quite clear.
Edit: Never mind, it might not be so straightforward after all. There are many cases for which:
|
|
Hi all sorry for the late reply. I love the idea of having this as a method in I will have some time tonight to update this PR to see how straightforward it is. If you prefer to merge this for today's release I'll do it in a follow up :) |
EdAbati
left a comment
There was a problem hiding this comment.
Ok this was suuuuper delayed. :( I implemented some of the suggestions above.
Let me know what you think and if in general you thinkthis PR improves things compared to before :D
narwhals/_compliant/dataframe.py
Outdated
| def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError | None: | ||
| return check_columns_exist(subset, available=list(self.columns)) |
There was a problem hiding this comment.
added this method at compliant level as @dangotbanned suggested.
I like it way more now
There was a problem hiding this comment.
Nice one @EdAbati!
One of the bits that seems to have been missed is the conversion here on available:
available=list(self.columns)utils.check_columns_exist only requires list[str] due to the annotation changing since (4d00461) to pass the type checker.
Signature as of (750b133)
Code block
def check_columns_exist(
subset: Sequence[str], /, *, available: list[str] # <-- 😢
) -> ColumnNotFoundError | None:
if missing := set(subset).difference(available): # <-- Iterable[str]
return ColumnNotFoundError.from_missing_and_available_column_names(
missing, available # <-- Sequence[str]
)
return NoneOne way to think about typing is in terms of set logic.
Here available has two constraints:
- The set of types accepted by
set.differenceIterable[str]
- The set of types accepted by
ColumnNotFoundError.from_missing_and_available_column_names(available=...)Sequence[str]
Since Sequence[str] is narrower than Iterable[str] (think subset but more fancy 😉) we can use it to satisfy both constraints
There was a problem hiding this comment.
great point and awesome explanation. updated :)
| def check_columns_exist( | ||
| subset: Sequence[str], /, *, available: list[str] | ||
| ) -> ColumnNotFoundError | None: |
There was a problem hiding this comment.
I kept also this function for the exeptions like narwhals/_pandas_like/utils.py
(Plus a function is easier to test then the method.)
Changed slightly the signature, using one short kwargs. 😇 a middle ground between the previous 2 solutions
There was a problem hiding this comment.
Changed slightly the signature, using one short
kwargs. 😇 a middle ground between the previous 2 solutions
No qualms from me 😄
@EdAbati should we punt the rest of (#2495 (comment)) to another issue?
I really wanna stress that I'm happy with getting these changes merged first! 😅
There was a problem hiding this comment.
Happy to merge now if everyone is happy :)
There was a problem hiding this comment.
to answer at @MarcoGorelli 's comment
so long as .columns doesn't called in cases where the code could pass without it
This PR doesn't introduce additional calls to .columns but here and in duckdb we already called .column to check the input in unique and in drop
If they are not needed they should be removed in a separate PR. Hopefully they are easier to spot now? (i.e. if we remove _check_columns_exist for Compliant Lazy the code will scream in multiple ways and various lines)
|
Thanks @EdAbati, I've opened (#2613) to track (#2495 (comment)) |
The type here is dependent on `check_columns_exist` See #2495 (comment)
* chore(typing): Widen `_utils` signatures In light of (#2706 (comment)) Splitting out a change from https://github.com/narwhals-dev/narwhals/blob/7bb0d0df3d2f75bc903aa9fc2abf0ecc5a61f432/narwhals/_utils.py#L1194 With that, I noticed a few others that also didn't need all the features of `Sequence` * fix(typing): Update propagated signatures The type here is dependent on `check_columns_exist` See #2495 (comment) * chore(typing): Remove now-unused type ignore `list[str] | _1DArray` is assignable to `Collection[str]` Previously, failed on `Sequence[str]` due to `numpy`

What type of PR is this? (check all applicable)
Related issues
ColumnNotFoundforduckdbandpyspark#2493Checklist
If you have comments or can explain your changes, please do so below