fix: unify ColumnNotFound for duckdb and pyspark#2493
fix: unify ColumnNotFound for duckdb and pyspark#2493MarcoGorelli merged 65 commits intonarwhals-dev:mainfrom
ColumnNotFound for duckdb and pyspark#2493Conversation
|
I think I can make some other clean-up of repetitive code. I'll try tomorrow morning |
|
I made a followup PR #2495 with the cleanup :) |
MarcoGorelli
left a comment
There was a problem hiding this comment.
thanks for working on this! just got a comment on the .columns usage
narwhals/_duckdb/expr.py
Outdated
| def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]: | ||
| return [col(name) for name in evaluate_column_names(df)] | ||
| col_names = evaluate_column_names(df) | ||
| missing_columns = [c for c in col_names if c not in df.columns] |
There was a problem hiding this comment.
df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one
There was a problem hiding this comment.
Ah interesting, I was not aware 😕
What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)
Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong
I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)
There was a problem hiding this comment.
df.columnscomes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
@MarcoGorelli could we add that to (#805) and put more of a focus towards it? 🙏
There was a problem hiding this comment.
I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:
In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})
In [49]: rel = duckdb.table('df')
100% ▕████████████████████████████████████████████████████████████▏
In [50]: rel1 = duckdb.sql("""pivot rel on a""")
In [51]: %timeit rel.columns
385 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [52]: %timeit rel1.columns
585 μs ± 3.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )
The same consideration should apply to spark-like
There was a problem hiding this comment.
How do those timings compare to other operations/metadata lookups on the same tables?
There was a problem hiding this comment.
.alias for example is completely non-value-dependent, so that stays fast
In [60]: %timeit rel.alias
342 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [61]: %timeit rel1.alias
393 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
narwhals/_spark_like/dataframe.py
Outdated
| try: | ||
| return self._with_native(self.native.select(*new_columns_list)) | ||
| except AnalysisException as e: | ||
| msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?" |
There was a problem hiding this comment.
Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?
There was a problem hiding this comment.
I think what you're written is great - even though we can't access them, we can still try to be helpful
tests/frame/select_test.py
Outdated
| return df | ||
|
|
||
| if constructor_id == "polars[lazy]": | ||
| msg = r"^e|\"(e|f)\"" |
There was a problem hiding this comment.
Before it was msg = "e|f". Now it is a bit stricter
tests/frame/select_test.py
Outdated
| with pytest.raises(ColumnNotFoundError, match=msg): | ||
| df.select(nw.col("fdfa")) |
There was a problem hiding this comment.
before this was not tested for polars
tests/frame/select_test.py
Outdated
| constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest | ||
| ) -> None: | ||
| constructor_id = str(request.node.callspec.id) | ||
| if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")): |
There was a problem hiding this comment.
sqlframe and pystpark.connect raise errors at collect. 😕
I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳
Do you have an idea on how to deal with these?
ColumnNotFound for duckdb and pyspark/sqlframeColumnNotFound for duckdb and pyspark
|
Found some time to update this. (sorry for the late reply)
The problem IMO is that since @MarcoGorelli is there anything that you think is missing now? :) |
|
thanks! i think the logic looks right, the tests look a little complex but maybe that's ok. will get back to this shortly |
MarcoGorelli
left a comment
There was a problem hiding this comment.
thanks @EdAbati ! looks good to me
just a couple of merge conflicts and a suggestion on the tests, but then i'd say we can ship it 🚢
tests/conftest.py
Outdated
| elif "constructor_lazy" in metafunc.fixturenames: | ||
| metafunc.parametrize( | ||
| "constructor_lazy", lazy_constructors, ids=lazy_constructors_ids | ||
| ) |
There was a problem hiding this comment.
I feel slightly uneasy about adding an extra constructor just for one test. and if we need to add it, then maybe constructors could be a union of eager_constructors and lazy_constructors, rather than making all 3?
if possible, i'd suggest to leave as-is for now and see if it's possible to use constructor in the test
There was a problem hiding this comment.
I updated now. I thought test_missing_columns was going to be less readable, but it doesn't make a lot of difference.
Yeah that's fine - I think in general it's ok to aim for "we try to unify what we can, but there may be some differences that we have no control over" |
EdAbati
left a comment
There was a problem hiding this comment.
Sorry for the delay again 🥲 let me know if there is still something that looks off, I have some time today to update
and thank you for the review
|
|
||
| Constructor: TypeAlias = Callable[[Any], "NativeLazyFrame | NativeFrame | DataFrameLike"] | ||
| ConstructorEager: TypeAlias = Callable[[Any], "NativeFrame | DataFrameLike"] | ||
| ConstructorLazy: TypeAlias = Callable[[Any], "NativeLazyFrame"] |
There was a problem hiding this comment.
@MarcoGorelli do you think we should delete this too?
I think it actually makes the LAZY_CONSTRUCTORS: dict[str, ConstructorLazy] a bit more accurate/stricter
MarcoGorelli
left a comment
There was a problem hiding this comment.
thanks @EdAbati ! i like the "sqlframe" check you've used in the tests, perhaps that deserves to be its own separate utility (in a separate pr) if you fancy?
just left a comment on the xfails but i think then we can ship it
tests/frame/with_columns_test.py
Outdated
| if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]", "ibis")): | ||
| # These backend raise errors at collect |
There was a problem hiding this comment.
is it an issue that they raise errors at collect?
because below, we do
if "polars_lazy" in str(constructor) and isinstance(df, nw.LazyFrame):
# In the lazy case, Polars only errors when we call `collect`
with pytest.raises(ColumnNotFoundError, match=msg):
df.with_columns(d=nw.col("c") + 1).collect()perhaps we could change that to just be
if isinstance(df, nw.LazyFrame):
# In the lazy case, Polars only errors when we call `collect`
with pytest.raises(ColumnNotFoundError, match=msg):
df.with_columns(d=nw.col("c") + 1).collect()
?
There was a problem hiding this comment.
Good point!
actually the comment here was misleading, fixed it.
At collect sqlframe will re-surface the error of the backend it uses, it won't be a ColumnNotFoundError (yet?). ibis was introduced after I started working on this I think (soooorry it took soo long 🥲), maybe we can work on ibis in a follow up?
Also "pyspark[connect]" should not have been a xfail. now it is tested. Thanks for the catch :)
|
Thank you @MarcoGorelli for the review and for the patience with this one 🥲❤️ |
|
@EdAbati @MarcoGorelli does this error seem related to this PR? I haven't touched anything I swear 😂 |
yeah It looks like the "stricter" regex (that start with I have some time to update this tonight, I'll make a PR if no one does it before me :) |

What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below