Skip to content

chore: always use check_columns_exist where possible#2495

Merged
dangotbanned merged 27 commits intonarwhals-dev:mainfrom
EdAbati:cleanup_columnnotfounderror
May 26, 2025
Merged

chore: always use check_columns_exist where possible#2495
dangotbanned merged 27 commits intonarwhals-dev:mainfrom
EdAbati:cleanup_columnnotfounderror

Conversation

@EdAbati
Copy link
Collaborator

@EdAbati EdAbati commented May 5, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@EdAbati EdAbati changed the title chore: cleanup ColumnNotFoundError chore: always use check_columns_exist where possible May 5, 2025
@EdAbati
Copy link
Collaborator Author

EdAbati commented May 12, 2025

Removed the dependency on #2493 , this is now ready for review :)

@EdAbati EdAbati marked this pull request as ready for review May 12, 2025 19:42
@EdAbati EdAbati added pyspark Issue is related to pyspark backend pyspark-connect labels May 12, 2025
@dangotbanned dangotbanned self-requested a review May 13, 2025 13:24
- Avoid collecting into `list` or `set` until strictly needed
- Accept `_StoresColumns` objects in more places
- Use a consistent positional pattern of `frame, subset`
- Fix `FBT001`
- Prepare error message entirely in `ColumnNotFoundError`
@dangotbanned
Copy link
Member

@EdAbati I've been experimenting with #2534 a bit locally and realised some of the changes I was making to drop, parse_columns_to_drop might make sense to use here.

I was finding it difficult to make those as suggestions, so I hope collecting them into 1 commit makes it easier to revert if you prefer what was there before 🙂

I made some notes on the commit description as well

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 15, 2025

Thanks for the update!

I think most of thechanges looks great to me.

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:

def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

When used it will become:

- check_columns_exist(self, subset)
+ check_columns_exist(subset, available=self.columns)

- check_columns_exist(frame, to_drop)
+ check_columns_exist(to_drop, available=frame.columns)

check_columns_exist(
-     df.columns.tolist(),  # type: ignore[attr-defined]
-     column_names,  # type: ignore[arg-type]
+     column_names,
+     available=df.columns.tolist(),
)

But this is highly subjective, maybe @FBruzzesi or @MarcoGorelli can break the tie :)

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky? One (if doesn't use check_columns_exist) can accidentally write ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

Anyway all minor points. If the manjority likes this version let's go for it

@FBruzzesi
Copy link
Member

Hey everyone! I didn't go through the detailed history of what happened here

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:

def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

I strongly agree with this statement: considering how easy it would be to swap the two arguments I would prefer to have them as keyword only. And passing column names directly instead of the entire dataframe would make it more generic and re-usable.

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky? One (if doesn't use check_columns_exist) can accidentally write ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

Same consideration - I rather have them keyword only to be explicit in this case

@dangotbanned
Copy link
Member

dangotbanned commented May 15, 2025

Hey @EdAbati, @FBruzzesi

You'd be surprised how many thoughts your comments sparked for me 😄

As I mentioned in (#2495 (comment)), these changes are kind of a merging of ideas.
I now have some extra context with these 3 commits:

Just wanted to check in and say I hear you 🙂

Hopefully will be able to collect my thoughts properly later today.
But the main part is about encoding more into types/classes/methods vs utility functions that pass around state and have to be repeated in every implementation (which is where the concerns that lead to keyword-only args come from IMO)

@MarcoGorelli
Copy link
Member

i haven't checked everything but - so long as .columns doesn't called in cases where the code could pass without it - then I'm happy to trust/defer to you all for this

@dangotbanned
Copy link
Member

Note

I spiralled a bit (you were warned) but hopefully there's some good info in here 😄

check_columns_exist

@EdAbati

I'm not 100% sure about the signature of check_columns_exist, I personally tend to prefer more explicit kwargs. And IMO having the 1st arg that could be either a frame or a list of cols could be a bit less straightforward to read if one doesn't look at the signature. My vote is on something like:

Signature

def check_columns_exist(columns: Sequence[str], /, *, available: Sequence[str]) -> ColumnNotFoundError | None:
    ...

When used it will become:

Diff

- check_columns_exist(self, subset)
+ check_columns_exist(subset, available=self.columns)

- check_columns_exist(frame, to_drop)
+ check_columns_exist(to_drop, available=frame.columns)

check_columns_exist(
-     df.columns.tolist(),  # type: ignore[attr-defined]
-     column_names,  # type: ignore[arg-type]
+     column_names,
+     available=df.columns.tolist(),
)

So if we put aside the keyword vs positional-only (and their positions) for a moment.
The more interesting question to me is:

Why does it accept list[str] at all?

In all but one place, the argument passed will take the _StoresColumns path.

_StoresColumns

narwhals/narwhals/utils.py

Lines 152 to 154 in 0264f98

class _StoresColumns(Protocol):
@property
def columns(self) -> Sequence[str]: ...

That happy path includes each of these and all their descendents:

  • BaseFrame
  • CompliantDataFrame
  • CompliantLazyFrame

So - for every one of those - we don't need to repeat thing.columns every time we call check_columns_exist because we're passing in a thing that provides them for us on access.

The ugly duckling

Our odd-one-out is also the only variant that is not calling with the context of a Compliant* class.

Important

This is the part I think needs changing

_pandas_like.utils.select_columns_by_name

https://github.com/EdAbati/narwhals/blob/f7e0bdfb237336fb3d8a3d036a7b436c419d1710/narwhals/_pandas_like/utils.py#L646-L678

def select_columns_by_name(
    df: T,
    column_names: list[str] | _1DArray,  # NOTE: Cannot be a tuple!
    backend_version: tuple[int, ...],
    implementation: Implementation,
) -> T:
    """Select columns by name.

    Prefer this over `df.loc[:, column_names]` as it's
    generally more performant.
    """
    if len(column_names) == df.shape[1] and all(column_names == df.columns):  # type: ignore[attr-defined]
        return df
    if (df.columns.dtype.kind == "b") or (  # type: ignore[attr-defined]
        implementation is Implementation.PANDAS and backend_version < (1, 5)
    ):
        # See https://github.com/narwhals-dev/narwhals/issues/1349#issuecomment-2470118122
        # for why we need this
        if error := check_columns_exist(
            df.columns.tolist(),  # type: ignore[attr-defined]
            column_names,  # type: ignore[arg-type]
        ):
            raise error
        return df.loc[:, column_names]  # type: ignore[attr-defined]
    try:
        return df[column_names]  # type: ignore[index]
    except KeyError as e:
        if error := check_columns_exist(
            df.columns.tolist(),  # type: ignore[attr-defined]
            column_names,  # type: ignore[arg-type]
        ):
            raise error from e
        raise

From my understanding, we have this as a utility function so that it can be reused in both PandasLikeDataFrame and DaskLazyFrame.
However, that has multiple downsides 😞:

(1)

We need to pass context out of the Compliant* class and then back again to create a new one.
So, selecting columns requires 4 arguments instead of the 1 we'd need if it were a method (and that context is already stored on the instance)

All of this boilerplate is repeated 11 times, and ...

It worsens the more the pattern is repeated and stacked

# rename to avoid creating extra columns in join
other_native = rename(
select_columns_by_name(
other.native,
list(right_on),
self._backend_version,
self._implementation,
),
columns=dict(zip(right_on, left_on)), # type: ignore[arg-type]
implementation=self._implementation,
backend_version=self._backend_version,
).drop_duplicates()

(2)

The DaskLazyFrame case also needs to check every time if it is Implementation.PANDAS.

That should never be true, but we're checking that each time we call

  • DaskLazyFrame.simple_select
  • DaskLazyFrame.select
  • DaskLazyFrame.join(how="anti"|"semi")
  • DaskLazyFrame.with_row_index
  • DaskExpr.is_first_distinct
  • DaskExpr.is_last_distinct

(3)

We're again repeating code for the 2x check_columns_exist calls inside select_columns_by_name.

With all the # type: ignore(s) all of this gets somewhat tricky to read.

ColumnNotFoundError.from_missing_and_available_column_names

@EdAbati

On a similar note, don't you think forcing positional args in ColumnNotFoundError.from_missing_and_available_column_names could be a bit risky?
One (if doesn't use check_columns_exist) can accidentally write
ColumnNotFoundError.from_missing_and_available_column_names(available, missing) instead of
ColumnNotFoundError.from_missing_and_available_column_names(missing, available)

@FBruzzesi

Same consideration - I rather have them keyword only to be explicit in this case

This one I'm a bit confused about, check_columns_exist is the only place that calls ColumnNotFoundError.from_missing_and_available_column_names.
I think that was one of the main changes this PR added?

If we were to replace the positions, the code won't make it past the type checker ...

since available_columns rejects a set

image

Previously both accepted a list[str] as positional args and this is still true.

Note

Since this is only used in one place - I'm not too fussed either way 🤷‍♂️

My personal take is that both of these are waaaaaaaaaaay too verbose 😅

        return ColumnNotFoundError.from_missing_and_available_column_names(
            missing_columns=sorted(missing), available_columns=list(available_columns)
        )
        return ColumnNotFoundError.from_missing_and_available_column_names(
            missing, available_columns
        )


What do now?

I think some more changes like (5f367a5) would mean we can simplify things a lot more.
This could be a starting point I suppose 🙂

class CompliantDataFrame:
    def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError | None:
        available = self.columns
        if missing := set(subset).difference(available):
            return ColumnNotFoundError.from_missing_and_available_column_names(
                missing, available
            )
        return None

class EagerDataFrame(CompliantDataFrame):
    def simple_select(self, *column_names: str, validate_column_names: bool = True) -> Self: ...

class PandasLikeDataFrame(EagerDataFrame):
    def simple_select(
        self, *column_names: str, validate_column_names: bool = True # currently always False, could use that as a default instead?
    ) -> Self:
        subset = list(column_names)
        df: pd.DataFrame = self.native
        if len(subset) == df.shape[1] and all(subset == df.columns):  # type: ignore[attr-defined]
            result = df
        elif (df.columns.dtype.kind == "b") or (
            self._implementation.is_pandas() and self._backend_version < (1, 5)
        ):
            # See https://github.com/narwhals-dev/narwhals/issues/1349#issuecomment-2470118122
            # for why we need this
            if error := self._check_columns_exist(subset):
                raise error
            result = df.loc[:, subset]
        else:
            try:
                result = df[subset]
            except KeyError as e:
                if error := self._check_columns_exist(subset):
                    raise error from e
                raise
        return self._with_native(result, validate_column_names=validate_column_names)

@FBruzzesi
Copy link
Member

Hey @dangotbanned thanks for the thoughtful reasoning walkthrough. I agree with most of the points you made and will try to add my two cents.

The reason for suggesting a verbose approach is that what we have is a utility function, which is much more generic than the use case - the answer we get is from the following question "is X a subset of Y?" - so to me it was important to make sure who is X and who is Y.

Regarding ColumnNotFoundError.from_missing_and_available_column_names - ok that's fair to avoid repeating ourselves since the order is already in the method name.

Now to the most important piece: "What do now?" - yes I really like that approach, and by moving from utility function to a method, it's much clearer who is what - so thank for the clarity and the suggestion, I think it's really great

@dangotbanned
Copy link
Member

dangotbanned commented May 17, 2025

Thanks @FBruzzesi 😍 (#2495 (comment))

Now to the most important piece: "What do now?" - yes I really like that approach, and by moving from utility function to a method, it's much clearer who is what - so thank for the clarity and the suggestion, I think it's really great

Reading this back now I realised I waffled quite a lot and missed the point I wanted to land on 😭

What I proposed in (#2495 (comment)) was what I had in mind while writing (4d00461)

Important

I didn't mean we need to do any of that in this PR

That was just the direction I wanted to move in afterwards.

I'll leave this to anyone else to merge - just signing off that I'm happy with the PR as-is - or we can revert some parts.
But (in my mind) either way is a stepping stone to (What do now?) 🙂

Thanks for working on this @EdAbati and @FBruzzesi for reading my blog post 😉

@FBruzzesi
Copy link
Member

FBruzzesi commented May 17, 2025

I didn't mean we need to do any of that in this PR

That was just the direction I wanted to move in afterwards.

On this regard - I think it's worth addressing it here at this point since the discussion started and the idea is quite clear.
Currently we are in a situation for which:

  • The codebase is in a transitional situation (according to Dan)
  • It's not verbose/clear enough (according to Edo and I)

Edit: Never mind, it might not be so straightforward after all. There are many cases for which:

  • check_columns_exist is not used at the compliant level but at narwhals one
  • or it is used without a dataframe instance

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 19, 2025

Hi all sorry for the late reply.

I love the idea of having this as a method in CompliantDataFrame

I will have some time tonight to update this PR to see how straightforward it is. If you prefer to merge this for today's release I'll do it in a follow up :)

Copy link
Collaborator Author

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this was suuuuper delayed. :( I implemented some of the suggestions above.

Let me know what you think and if in general you thinkthis PR improves things compared to before :D

Comment on lines 275 to 276
def _check_columns_exist(self, subset: Sequence[str]) -> ColumnNotFoundError | None:
return check_columns_exist(subset, available=list(self.columns))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this method at compliant level as @dangotbanned suggested.

I like it way more now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @EdAbati!

One of the bits that seems to have been missed is the conversion here on available:

available=list(self.columns)

utils.check_columns_exist only requires list[str] due to the annotation changing since (4d00461) to pass the type checker.

Signature as of (750b133)

Best practices

Code block

def check_columns_exist(
    subset: Sequence[str], /, *, available: list[str]  # <-- 😢
) -> ColumnNotFoundError | None:
    if missing := set(subset).difference(available):  # <-- Iterable[str]
        return ColumnNotFoundError.from_missing_and_available_column_names(
            missing, available  # <-- Sequence[str]
        )
    return None

One way to think about typing is in terms of set logic.

Here available has two constraints:

  1. The set of types accepted by set.difference
    • Iterable[str]
  2. The set of types accepted by ColumnNotFoundError.from_missing_and_available_column_names(available=...)
    • Sequence[str]

Since Sequence[str] is narrower than Iterable[str] (think subset but more fancy 😉) we can use it to satisfy both constraints

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point and awesome explanation. updated :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EdAbati

Comment on lines 1540 to 1542
def check_columns_exist(
subset: Sequence[str], /, *, available: list[str]
) -> ColumnNotFoundError | None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept also this function for the exeptions like narwhals/_pandas_like/utils.py
(Plus a function is easier to test then the method.)

Changed slightly the signature, using one short kwargs. 😇 a middle ground between the previous 2 solutions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed slightly the signature, using one short kwargs. 😇 a middle ground between the previous 2 solutions

No qualms from me 😄

@EdAbati should we punt the rest of (#2495 (comment)) to another issue?

I really wanna stress that I'm happy with getting these changes merged first! 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to merge now if everyone is happy :)

Copy link
Collaborator Author

@EdAbati EdAbati May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to answer at @MarcoGorelli 's comment

so long as .columns doesn't called in cases where the code could pass without it

This PR doesn't introduce additional calls to .columns but here and in duckdb we already called .column to check the input in unique and in drop

If they are not needed they should be removed in a separate PR. Hopefully they are easier to spot now? (i.e. if we remove _check_columns_exist for Compliant Lazy the code will scream in multiple ways and various lines)

@dangotbanned
Copy link
Member

Thanks @EdAbati, I've opened (#2613) to track (#2495 (comment))

@dangotbanned dangotbanned merged commit c952d58 into narwhals-dev:main May 26, 2025
34 checks passed
@EdAbati EdAbati deleted the cleanup_columnnotfounderror branch May 26, 2025 11:51
dangotbanned added a commit that referenced this pull request Jun 21, 2025
The type here is dependent on `check_columns_exist`
See #2495 (comment)
MarcoGorelli pushed a commit that referenced this pull request Jun 25, 2025
* chore(typing): Widen `_utils` signatures

In light of (#2706 (comment))

Splitting out a change from https://github.com/narwhals-dev/narwhals/blob/7bb0d0df3d2f75bc903aa9fc2abf0ecc5a61f432/narwhals/_utils.py#L1194

With that, I noticed a few others that also didn't need all the features of `Sequence`

* fix(typing): Update propagated signatures

The type here is dependent on `check_columns_exist`
See #2495 (comment)

* chore(typing): Remove now-unused type ignore

`list[str] | _1DArray` is assignable to `Collection[str]`
Previously, failed on `Sequence[str]` due to `numpy`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal pyspark Issue is related to pyspark backend pyspark-connect

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants