Skip to content

fix: unify ColumnNotFound for duckdb and pyspark#2493

Merged
MarcoGorelli merged 65 commits intonarwhals-dev:mainfrom
EdAbati:unify-column-not-found-error
Jul 16, 2025
Merged

fix: unify ColumnNotFound for duckdb and pyspark#2493
MarcoGorelli merged 65 commits intonarwhals-dev:mainfrom
EdAbati:unify-column-not-found-error

Conversation

@EdAbati
Copy link
Collaborator

@EdAbati EdAbati commented May 4, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 4, 2025

I think I can make some other clean-up of repetitive code. I'll try tomorrow morning

@EdAbati EdAbati marked this pull request as ready for review May 5, 2025 07:04
@EdAbati
Copy link
Collaborator Author

EdAbati commented May 5, 2025

I made a followup PR #2495 with the cleanup :)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this! just got a comment on the .columns usage

def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]:
return [col(name) for name in evaluate_column_names(df)]
col_names = evaluate_column_names(df)
missing_columns = [c for c in col_names if c not in df.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one

Copy link
Collaborator Author

@EdAbati EdAbati May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, I was not aware 😕

What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)

Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong

I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

@MarcoGorelli could we add that to (#805) and put more of a focus towards it? 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:

In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})

In [49]: rel = duckdb.table('df')
100% ▕████████████████████████████████████████████████████████████▏

In [50]: rel1 = duckdb.sql("""pivot rel on a""")

In [51]: %timeit rel.columns
385 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [52]: %timeit rel1.columns
585 μs ± 3.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )

The same consideration should apply to spark-like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do those timings compare to other operations/metadata lookups on the same tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.alias for example is completely non-value-dependent, so that stays fast

In [60]: %timeit rel.alias
342 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [61]: %timeit rel1.alias
393 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@EdAbati EdAbati added pyspark Issue is related to pyspark backend pyspark-connect error reporting labels May 6, 2025
try:
return self._with_native(self.native.select(*new_columns_list))
except AnalysisException as e:
msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're written is great - even though we can't access them, we can still try to be helpful

return df

if constructor_id == "polars[lazy]":
msg = r"^e|\"(e|f)\""
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was msg = "e|f". Now it is a bit stricter

Comment on lines 105 to 106
with pytest.raises(ColumnNotFoundError, match=msg):
df.select(nw.col("fdfa"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this was not tested for polars

constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest
) -> None:
constructor_id = str(request.node.callspec.id)
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")):
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlframe and pystpark.connect raise errors at collect. 😕

I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳

Do you have an idea on how to deal with these?

@EdAbati EdAbati changed the title fix: unify ColumnNotFound for duckdb and pyspark/sqlframe fix: unify ColumnNotFound for duckdb and pyspark May 9, 2025
@EdAbati
Copy link
Collaborator Author

EdAbati commented Jun 12, 2025

Found some time to update this. (sorry for the late reply)

@EdAbati IIRC there's some import-related functions in _spark_like.utils that may be helpful for you?

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

@MarcoGorelli is there anything that you think is missing now? :)

@MarcoGorelli
Copy link
Member

thanks! i think the logic looks right, the tests look a little complex but maybe that's ok. will get back to this shortly

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @EdAbati ! looks good to me

just a couple of merge conflicts and a suggestion on the tests, but then i'd say we can ship it 🚢

Comment on lines 315 to 318
elif "constructor_lazy" in metafunc.fixturenames:
metafunc.parametrize(
"constructor_lazy", lazy_constructors, ids=lazy_constructors_ids
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel slightly uneasy about adding an extra constructor just for one test. and if we need to add it, then maybe constructors could be a union of eager_constructors and lazy_constructors, rather than making all 3?

if possible, i'd suggest to leave as-is for now and see if it's possible to use constructor in the test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated now. I thought test_missing_columns was going to be less readable, but it doesn't make a lot of difference.

@MarcoGorelli
Copy link
Member

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

Yeah that's fine - I think in general it's ok to aim for "we try to unify what we can, but there may be some differences that we have no control over"

Copy link
Collaborator Author

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay again 🥲 let me know if there is still something that looks off, I have some time today to update

and thank you for the review


Constructor: TypeAlias = Callable[[Any], "NativeLazyFrame | NativeFrame | DataFrameLike"]
ConstructorEager: TypeAlias = Callable[[Any], "NativeFrame | DataFrameLike"]
ConstructorLazy: TypeAlias = Callable[[Any], "NativeLazyFrame"]
Copy link
Collaborator Author

@EdAbati EdAbati Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli do you think we should delete this too?
I think it actually makes the LAZY_CONSTRUCTORS: dict[str, ConstructorLazy] a bit more accurate/stricter

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @EdAbati ! i like the "sqlframe" check you've used in the tests, perhaps that deserves to be its own separate utility (in a separate pr) if you fancy?

just left a comment on the xfails but i think then we can ship it

Comment on lines 87 to 88
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]", "ibis")):
# These backend raise errors at collect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it an issue that they raise errors at collect?

because below, we do

    if "polars_lazy" in str(constructor) and isinstance(df, nw.LazyFrame):
        # In the lazy case, Polars only errors when we call `collect`
        with pytest.raises(ColumnNotFoundError, match=msg):
            df.with_columns(d=nw.col("c") + 1).collect()

perhaps we could change that to just be

    if isinstance(df, nw.LazyFrame):
        # In the lazy case, Polars only errors when we call `collect`
        with pytest.raises(ColumnNotFoundError, match=msg):
            df.with_columns(d=nw.col("c") + 1).collect()

?

Copy link
Collaborator Author

@EdAbati EdAbati Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

actually the comment here was misleading, fixed it.
At collect sqlframe will re-surface the error of the backend it uses, it won't be a ColumnNotFoundError (yet?). ibis was introduced after I started working on this I think (soooorry it took soo long 🥲), maybe we can work on ibis in a follow up?

Also "pyspark[connect]" should not have been a xfail. now it is tested. Thanks for the catch :)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @EdAbati !

@MarcoGorelli MarcoGorelli merged commit 7e905f3 into narwhals-dev:main Jul 16, 2025
34 checks passed
@EdAbati
Copy link
Collaborator Author

EdAbati commented Jul 16, 2025

Thank you @MarcoGorelli for the review and for the patience with this one 🥲❤️

@EdAbati EdAbati deleted the unify-column-not-found-error branch July 16, 2025 19:15
@dangotbanned
Copy link
Member

@EdAbati @MarcoGorelli does this error seem related to this PR?

I haven't touched anything I swear 😂

@EdAbati
Copy link
Collaborator Author

EdAbati commented Jul 17, 2025

does this error seem related to this PR?

yeah It looks like the "stricter" regex (that start with ^) are not ok for polars 0.20.22 :(

I have some time to update this tonight, I'll make a PR if no one does it before me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

error reporting pyspark Issue is related to pyspark backend pyspark-connect

Projects

None yet

Development

Successfully merging this pull request may close these issues.

error reporting: unify "column not found" error message for DuckDB / spark-like

3 participants