chore: always use `check_columns_exist` where possible by EdAbati · Pull Request #2495 · narwhals-dev/narwhals

EdAbati · 2025-05-05T07:02:41Z

What type of PR is this? (check all applicable)

Related issues

Followup of fix: unify ColumnNotFound for duckdb and pyspark #2493

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

narwhals/_pandas_like/expr.py

narwhals/utils.py

…underror

EdAbati · 2025-05-12T19:42:53Z

Removed the dependency on #2493 , this is now ready for review :)

- Avoid collecting into `list` or `set` until strictly needed - Accept `_StoresColumns` objects in more places - Use a consistent positional pattern of `frame, subset` - Fix `FBT001` - Prepare error message entirely in `ColumnNotFoundError`

Following merge (narwhals-dev@124f5a3)

dangotbanned · 2025-05-13T14:27:17Z

@EdAbati I've been experimenting with #2534 a bit locally and realised some of the changes I was making to drop, parse_columns_to_drop might make sense to use here.

(4d00461)

I was finding it difficult to make those as suggestions, so I hope collecting them into 1 commit makes it easier to revert if you prefer what was there before 🙂

I made some notes on the commit description as well

EdAbati · 2025-05-15T06:44:10Z

Thanks for the update!

I think most of thechanges looks great to me.

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:

def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

When used it will become:

- check_columns_exist(self, subset)
+ check_columns_exist(subset, available=self.columns)

- check_columns_exist(frame, to_drop)
+ check_columns_exist(to_drop, available=frame.columns)

check_columns_exist(
-     df.columns.tolist(),  # type: ignore[attr-defined]
-     column_names,  # type: ignore[arg-type]
+     column_names,
+     available=df.columns.tolist(),
)

But this is highly subjective, maybe @FBruzzesi or @MarcoGorelli can break the tie :)

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky? One (if doesn't use check_columns_exist) can accidentally write ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

Anyway all minor points. If the manjority likes this version let's go for it

FBruzzesi · 2025-05-15T07:56:54Z

Hey everyone! I didn't go through the detailed history of what happened here

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:
def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

I strongly agree with this statement: considering how easy it would be to swap the two arguments I would prefer to have them as keyword only. And passing column names directly instead of the entire dataframe would make it more generic and re-usable.

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky? One (if doesn't use check_columns_exist) can accidentally write ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

Same consideration - I rather have them keyword only to be explicit in this case

dangotbanned · 2025-05-15T10:50:48Z

Hey @EdAbati, @FBruzzesi

You'd be surprised how many thoughts your comments sparked for me 😄

As I mentioned in (#2495 (comment)), these changes are kind of a merging of ideas.
I now have some extra context with these 3 commits:

Just wanted to check in and say I hear you 🙂

Hopefully will be able to collect my thoughts properly later today.
But the main part is about encoding more into types/classes/methods vs utility functions that pass around state and have to be repeated in every implementation (which is where the concerns that lead to keyword-only args come from IMO)

MarcoGorelli · 2025-05-15T12:23:28Z

i haven't checked everything but - so long as .columns doesn't called in cases where the code could pass without it - then I'm happy to trust/defer to you all for this

dangotbanned · 2025-05-15T14:46:35Z

Note

I spiralled a bit (you were warned) but hopefully there's some good info in here 😄

`check_columns_exist`

@EdAbati

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:

Signature

def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

When used it will become:

Diff

- check_columns_exist(self, subset)
+ check_columns_exist(subset, available=self.columns)

- check_columns_exist(frame, to_drop)
+ check_columns_exist(to_drop, available=frame.columns)

check_columns_exist(
-     df.columns.tolist(),  # type: ignore[attr-defined]
-     column_names,  # type: ignore[arg-type]
+     column_names,
+     available=df.columns.tolist(),
)

So if we put aside the keyword vs positional-only (and their positions) for a moment.
The more interesting question to me is:

Why does it accept list[str] at all?

In all but one place, the argument passed will take the _StoresColumns path.

_StoresColumns

narwhals/narwhals/utils.py

Lines 152 to 154 in 0264f98

    
           class _StoresColumns(Protocol): 
        
               @property 
        
               def columns(self) -> Sequence[str]: ...

That happy path includes each of these and all their descendents:

BaseFrame
CompliantDataFrame
CompliantLazyFrame

So - for every one of those - we don't need to repeat thing.columns every time we call check_columns_exist because we're passing in a thing that provides them for us on access.

The ugly duckling

Our odd-one-out is also the only variant that is not calling with the context of a Compliant* class.

Important

This is the part I think needs changing

_pandas_like.utils.select_columns_by_name

https://github.com/EdAbati/narwhals/blob/f7e0bdfb237336fb3d8a3d036a7b436c419d1710/narwhals/_pandas_like/utils.py#L646-L678

def select_columns_by_name(
    df: T,
    column_names: list[str] | _1DArray,  # NOTE: Cannot be a tuple!
    backend_version: tuple[int, ...],
    implementation: Implementation,
) -> T:
    """Select columns by name.

    Prefer this over `df.loc[:, column_names]` as it's
    generally more performant.
    """
    if len(column_names) == df.shape[1] and all(column_names == df.columns):  # type: ignore[attr-defined]
        return df
    if (df.columns.dtype.kind == "b") or (  # type: ignore[attr-defined]
        implementation is Implementation.PANDAS and backend_version < (1, 5)
    ):
        # See https://github.com/narwhals-dev/narwhals/issues/1349#issuecomment-2470118122
        # for why we need this
        if error := check_columns_exist(
            df.columns.tolist(),  # type: ignore[attr-defined]
            column_names,  # type: ignore[arg-type]
        ):
            raise error
        return df.loc[:, column_names]  # type: ignore[attr-defined]
    try:
        return df[column_names]  # type: ignore[index]
    except KeyError as e:
        if error := check_columns_exist(
            df.columns.tolist(),  # type: ignore[attr-defined]
            column_names,  # type: ignore[arg-type]
        ):
            raise error from e
        raise

From my understanding, we have this as a utility function so that it can be reused in both PandasLikeDataFrame and DaskLazyFrame.
However, that has multiple downsides 😞:

(1)

We need to pass context out of the Compliant* class and then back again to create a new one.
So, selecting columns requires 4 arguments instead of the 1 we'd need if it were a method (and that context is already stored on the instance)

All of this boilerplate is repeated 11 times, and ...

It worsens the more the pattern is repeated and stacked

narwhals/narwhals/_pandas_like/dataframe.py

Lines 631 to 642 in 0264f98

    
           # rename to avoid creating extra columns in join 
        
           other_native = rename( 
        
               select_columns_by_name( 
        
                   other.native, 
        
                   list(right_on), 
        
                   self._backend_version, 
        
                   self._implementation, 
        
               ), 
        
               columns=dict(zip(right_on, left_on)),  # type: ignore[arg-type] 
        
               implementation=self._implementation, 
        
               backend_version=self._backend_version, 
        
           ).drop_duplicates()

(2)

The DaskLazyFrame case also needs to check every time if it is Implementation.PANDAS.

That should never be true, but we're checking that each time we call

DaskLazyFrame.simple_select
DaskLazyFrame.select
DaskLazyFrame.join(how="anti"|"semi")
DaskLazyFrame.with_row_index
DaskExpr.is_first_distinct
DaskExpr.is_last_distinct

(3)

We're again repeating code for the 2x check_columns_exist calls inside select_columns_by_name.

With all the # type: ignore(s) all of this gets somewhat tricky to read.

`ColumnNotFoundError.from_missing_and_available_column_names`

@EdAbati

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky?
One (if doesn't use check_columns_exist) can accidentally write
ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of
ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

@FBruzzesi

Same consideration - I rather have them keyword only to be explicit in this case

This one I'm a bit confused about, check_columns_exist is the only place that calls ColumnNotFoundError.from_missing_and_available_column_names.
I think that was one of the main changes this PR added?

If we were to replace the positions, the code won't make it past the type checker ...

since available_columns rejects a set

Previously both accepted a list[str] as positional args and this is still true.

Note

Since this is only used in one place - I'm not too fussed either way 🤷‍♂️

My personal take is that both of these are waaaaaaaaaaay too verbose 😅

        return ColumnNotFoundError.from_missing_and_available_column_names(
            missing_columns=sorted(missing), available_columns=list(available_columns)
        )

        return ColumnNotFoundError.from_missing_and_available_column_names(
            missing, available_columns
        )

What do now?

I think some more changes like (5f367a5) would mean we can simplify things a lot more.
This could be a starting point I suppose 🙂

class CompliantDataFrame:
    def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError | None:
        available = self.columns
        if missing := set(subset).difference(available):
            return ColumnNotFoundError.from_missing_and_available_column_names(
                missing, available
            )
        return None

class EagerDataFrame(CompliantDataFrame):
    def simple_select(self, *column_names: str, validate_column_names: bool = True) -> Self: ...

class PandasLikeDataFrame(EagerDataFrame):
    def simple_select(
        self, *column_names: str, validate_column_names: bool = True # currently always False, could use that as a default instead?
    ) -> Self:
        subset = list(column_names)
        df: pd.DataFrame = self.native
        if len(subset) == df.shape[1] and all(subset == df.columns):  # type: ignore[attr-defined]
            result = df
        elif (df.columns.dtype.kind == "b") or (
            self._implementation.is_pandas() and self._backend_version < (1, 5)
        ):
            # See https://github.com/narwhals-dev/narwhals/issues/1349#issuecomment-2470118122
            # for why we need this
            if error := self._check_columns_exist(subset):
                raise error
            result = df.loc[:, subset]
        else:
            try:
                result = df[subset]
            except KeyError as e:
                if error := self._check_columns_exist(subset):
                    raise error from e
                raise
        return self._with_native(result, validate_column_names=validate_column_names)

FBruzzesi · 2025-05-17T13:39:45Z

Hey @dangotbanned thanks for the thoughtful reasoning walkthrough. I agree with most of the points you made and will try to add my two cents.

The reason for suggesting a verbose approach is that what we have is a utility function, which is much more generic than the use case - the answer we get is from the following question "is X a subset of Y?" - so to me it was important to make sure who is X and who is Y.

Regarding ColumnNotFoundError.from_missing_and_available_column_names - ok that's fair to avoid repeating ourselves since the order is already in the method name.

Now to the most important piece: "What do now?" - yes I really like that approach, and by moving from utility function to a method, it's much clearer who is what - so thank for the clarity and the suggestion, I think it's really great

dangotbanned · 2025-05-17T15:54:17Z

Thanks @FBruzzesi 😍 (#2495 (comment))

Now to the most important piece: "What do now?" - yes I really like that approach, and by moving from utility function to a method, it's much clearer who is what - so thank for the clarity and the suggestion, I think it's really great

Reading this back now I realised I waffled quite a lot and missed the point I wanted to land on 😭

What I proposed in (#2495 (comment)) was what I had in mind while writing (4d00461)

Important

I didn't mean we need to do any of that in this PR

That was just the direction I wanted to move in afterwards.

I'll leave this to anyone else to merge - just signing off that I'm happy with the PR as-is - or we can revert some parts.
But (in my mind) either way is a stepping stone to (What do now?) 🙂

Thanks for working on this @EdAbati and @FBruzzesi for reading my blog post 😉

FBruzzesi · 2025-05-17T17:25:44Z

I didn't mean we need to do any of that in this PR

That was just the direction I wanted to move in afterwards.

On this regard - I think it's worth addressing it here at this point since the discussion started and the idea is quite clear.
Currently we are in a situation for which:

The codebase is in a transitional situation (according to Dan)
It's not verbose/clear enough (according to Edo and I)

Edit: Never mind, it might not be so straightforward after all. There are many cases for which:

check_columns_exist is not used at the compliant level but at narwhals one
or it is used without a dataframe instance

EdAbati · 2025-05-19T06:25:43Z

Hi all sorry for the late reply.

I love the idea of having this as a method in CompliantDataFrame

I will have some time tonight to update this PR to see how straightforward it is. If you prefer to merge this for today's release I'll do it in a follow up :)

…underror

EdAbati

Ok this was suuuuper delayed. :( I implemented some of the suggestions above.

Let me know what you think and if in general you thinkthis PR improves things compared to before :D

EdAbati · 2025-05-22T19:07:51Z

narwhals/_compliant/dataframe.py

+    def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError | None:
+        return check_columns_exist(subset, available=list(self.columns))


added this method at compliant level as @dangotbanned suggested.

I like it way more now

Nice one @EdAbati!

One of the bits that seems to have been missed is the conversion here on available:

available=list(self.columns)

utils.check_columns_exist only requires list[str] due to the annotation changing since (4d00461) to pass the type checker.

Signature as of (750b133)

Best practices

Code block

def check_columns_exist( subset: Sequence[str], /, *, available: list[str] # <-- 😢 ) -> ColumnNotFoundError | None: if missing := set(subset).difference(available): # <-- Iterable[str] return ColumnNotFoundError.from_missing_and_available_column_names( missing, available # <-- Sequence[str] ) return None

One way to think about typing is in terms of set logic.

Here available has two constraints:

The set of types accepted by set.difference

Iterable[str]

The set of types accepted by ColumnNotFoundError.from_missing_and_available_column_names(available=...)

Sequence[str]

Since Sequence[str] is narrower than Iterable[str] (think subset but more fancy 😉) we can use it to satisfy both constraints

great point and awesome explanation. updated :)

Thanks @EdAbati

EdAbati · 2025-05-22T19:13:21Z

narwhals/utils.py

+def check_columns_exist(
+    subset: Sequence[str], /, *, available: list[str]
+) -> ColumnNotFoundError | None:


I kept also this function for the exeptions like narwhals/_pandas_like/utils.py
(Plus a function is easier to test then the method.)

Changed slightly the signature, using one short kwargs. 😇 a middle ground between the previous 2 solutions

Changed slightly the signature, using one short kwargs. 😇 a middle ground between the previous 2 solutions

No qualms from me 😄

@EdAbati should we punt the rest of (#2495 (comment)) to another issue?

I really wanna stress that I'm happy with getting these changes merged first! 😅

Happy to merge now if everyone is happy :)

EdAbati · 2025-05-22T19:20:20Z

narwhals/_spark_like/dataframe.py

to answer at @MarcoGorelli 's comment

so long as .columns doesn't called in cases where the code could pass without it

This PR doesn't introduce additional calls to .columns but here and in duckdb we already called .column to check the input in unique and in drop

If they are not needed they should be removed in a separate PR. Hopefully they are easier to spot now? (i.e. if we remove _check_columns_exist for Compliant Lazy the code will scream in multiple ways and various lines)

…underror

dangotbanned · 2025-05-26T11:48:57Z

Thanks @EdAbati, I've opened (#2613) to track (#2495 (comment))

The type here is dependent on `check_columns_exist` See #2495 (comment)

* chore(typing): Widen `_utils` signatures In light of (#2706 (comment)) Splitting out a change from https://github.com/narwhals-dev/narwhals/blob/7bb0d0df3d2f75bc903aa9fc2abf0ecc5a61f432/narwhals/_utils.py#L1194 With that, I noticed a few others that also didn't need all the features of `Sequence` * fix(typing): Update propagated signatures The type here is dependent on `check_columns_exist` See #2495 (comment) * chore(typing): Remove now-unused type ignore `list[str] | _1DArray` is assignable to `Collection[str]` Previously, failed on `Sequence[str]` due to `numpy`

EdAbati added 6 commits May 4, 2025 22:45

unify ColumnNotFound

6f7a574

revert

8fe45e6

use from_missing_and_available_column_names in check_column_exists

30a7a54

rename check_columns_exists

c785d75

add from_error

32d7f0e

cleanup errors

144691e

EdAbati commented May 5, 2025

View reviewed changes

narwhals/_pandas_like/expr.py Outdated Show resolved Hide resolved

grammar

1218afb

EdAbati changed the title ~~chore: cleanup ColumnNotFoundError~~ chore: always use check_columns_exist where possible May 5, 2025

EdAbati mentioned this pull request May 5, 2025

fix: unify ColumnNotFound for duckdb and pyspark #2493

Merged

10 tasks

Merge branch 'main' into cleanup_columnnotfounderror

5993d11

dangotbanned reviewed May 5, 2025

View reviewed changes

narwhals/utils.py Outdated Show resolved Hide resolved

dangotbanned added the internal label May 6, 2025

EdAbati added 7 commits May 12, 2025 09:00

refactor

3483e50

Merge remote-tracking branch 'upstream/main' into cleanup_columnnotfo…

47dc9c8

…underror

revert sparklike and duckdb changes

9935753

update with raise

6ccb801

restore test

93c6fb5

missing raise

ae92a45

Merge branch 'main' into cleanup_columnnotfounderror

e48ad5a

EdAbati marked this pull request as ready for review May 12, 2025 19:42

EdAbati added pyspark Issue is related to pyspark backend pyspark-connect labels May 12, 2025

dangotbanned self-requested a review May 13, 2025 13:24

dangotbanned added 3 commits May 13, 2025 14:57

Merge remote-tracking branch 'upstream/main' into pr/EdAbati/2495

124f5a3

fix: Unbreak ibis

9b7d594

Following merge (narwhals-dev@124f5a3)

Merge branch 'main' into cleanup_columnnotfounderror

f7e0bdf

Merge branch 'main' into cleanup_columnnotfounderror

6ba5ab8

dangotbanned approved these changes May 17, 2025

View reviewed changes

EdAbati added 4 commits May 22, 2025 08:15

Merge remote-tracking branch 'upstream/main' into cleanup_columnnotfo…

7de7de7

…underror

_check_columns_exist as method

750b133

Merge remote-tracking branch 'upstream/main' into cleanup_columnnotfo…

b71693c

…underror

coverage happy

4268c3f

EdAbati commented May 22, 2025

View reviewed changes

FBruzzesi mentioned this pull request May 24, 2025

chore: Simplify PandasLikeDataFrame|DaskLazyFrame.join method #2511

Merged

10 tasks

EdAbati and others added 3 commits May 26, 2025 08:53

fix type

add92eb

Merge remote-tracking branch 'upstream/main' into cleanup_columnnotfo…

0b0f7ad

…underror

Merge branch 'main' into cleanup_columnnotfounderror

16b9430

dangotbanned mentioned this pull request May 26, 2025

Reducing _pandas_like.utils #2613

Open

dangotbanned merged commit c952d58 into narwhals-dev:main May 26, 2025
34 checks passed

EdAbati deleted the cleanup_columnnotfounderror branch May 26, 2025 11:51

dangotbanned added a commit that referenced this pull request Jun 21, 2025

fix(typing): Update propagated signatures

2aba1c8

The type here is dependent on `check_columns_exist` See #2495 (comment)

		def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError \| None:
		return check_columns_exist(subset, available=list(self.columns))

Conversation

EdAbati commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

Uh oh!

Uh oh!

EdAbati commented May 12, 2025

Uh oh!

dangotbanned commented May 13, 2025

Uh oh!

EdAbati commented May 15, 2025

Uh oh!

FBruzzesi commented May 15, 2025

Uh oh!

dangotbanned commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented May 15, 2025

Uh oh!

dangotbanned commented May 15, 2025

check_columns_exist

The ugly duckling

(1)

(2)

(3)

ColumnNotFoundError.from_missing_and_available_column_names

What do now?

Uh oh!

FBruzzesi commented May 17, 2025

Uh oh!

dangotbanned commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdAbati commented May 19, 2025

Uh oh!

EdAbati left a comment

Choose a reason for hiding this comment

Uh oh!

EdAbati May 22, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned May 24, 2025

Choose a reason for hiding this comment

Signature as of (750b133)

Uh oh!

EdAbati May 26, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned May 26, 2025

Choose a reason for hiding this comment

Uh oh!

EdAbati May 22, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned May 24, 2025

Choose a reason for hiding this comment

Uh oh!

EdAbati May 26, 2025

Choose a reason for hiding this comment

Uh oh!

EdAbati May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented May 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

EdAbati commented May 5, 2025 •

edited

Loading

dangotbanned commented May 15, 2025 •

edited

Loading

`check_columns_exist`

`ColumnNotFoundError.from_missing_and_available_column_names`

dangotbanned commented May 17, 2025 •

edited

Loading

FBruzzesi commented May 17, 2025 •

edited

Loading

Signature as of (`750b133`)

EdAbati May 22, 2025 •

edited

Loading