feat: LazyFrame.collect with backend and **kwargs#1734
Conversation
|
Nice, I like the look of this it may help to simplify Lines 77 to 80 in 8c9525a to just
agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?). to check my understanding then, this would be compatibe with the |
Yes exactly!
🤔 now that you mention (and completely unrelated from this PR), we could do the same for pyspark: see SO Ritchie answer
Yes indeed as long as we intend to have the default to be PyArrow |
yes, nice! 🙌 |
|
I think this looks good, just going to give the chance to others to weigh in |
|
@MarcoGorelli I just added |
| from narwhals.utils import Implementation | ||
|
|
||
| return PolarsDataFrame( | ||
| df=self._native_frame.pl(), |
There was a problem hiding this comment.
Unrelated but.. should we change in PolarsDataFrame:
- df: pl.DataFrame,
+ native_dataframe: pl.DataFrame,
narwhals/dataframe.py
Outdated
| polars_kwargs: dict[str, Any] | None = None, | ||
| dask_kwargs: dict[str, Any] | None = None, | ||
| duckdb_kwargs: dict[str, str] | None = None, |
There was a problem hiding this comment.
These could all be TypedDict's 👀
|
Niiiice! 🙌🙌 A couple of questions/idea:
|
|
Hey @EdAbati, thanks for your feedback! that's exactly the purpose of a RFC 👌
Considering the dataframe ecosystem as of today:
Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!
Those are definitly fair concerns to think about! Thanks for pointing those out!
That's when they can branch out, we would just need to add additional def collect(
self: Self,
*,
polars_kwargs: dict[str, Any] | None = None,
dask_kwargs: dict[str, Any] | None = None,
duckdb_kwargs: dict[str, str] | None = None,
**kwargs: Any,
):
from narwhals.utils import Implementation
if self.implementation is Implementation.POLARS and polars_kwargs is not None:
kwargs_ = polars_kwargs
elif ...:
...
else:
kwargs_ = kwargs
return self._dataframe(
self._compliant_frame.collect(**kwargs),
level="full",
)
I would be ok deferring the responsibility to the users. But coming back to personal preference, if I were to use an external library to find that I have to do a lot of branching myself, I wouldn't say it is particularly ergonomic 😅 |
Ah! I was under the impression that there was a request to make Also FYI PySpark 4.0.0 will support |
|
there is indeed a request to be able to collect to specific backends (e.g. someone via work asked to be able to collect duckdb-backed lazyframe into polars-backed dataframe), but I think this would still be backend-specific - e.g. not all lazy backends would necessarily have a way to collect to polars as in, which might not have a way to do Having said that, should we:
small note, but this can now be done as |
|
Been thinking about this a bit more, and in the read functions we have lf.collect(collect_kwargs[lf.implementation])We could also have |
|
TBH, the more I think about this, the more I'd be in favour of just: def collect(self: Self, backend: ModuleType | Implementation | str | None, **kwargs) -> DataFrame[Any]: ...
def lazy(self: Self, backend: ModuleType | Implementation| str | None) -> LazyFrame[Any]:Because then:
df = (
lf.group_by('a', 'b').agg(nw.all().mean().name.suffix('_mean'))
.collect(engine='gpu') # alternative is `.collect(polars_kwargs={'engine': 'gpu'})`
.with_columns(pl.col('a').rolling_mean(2))
.to_native()
collect_kwargs = {
nw.dependencies.get_polars(): {'engine: 'gpu'},
nw.dependencies.get_dask(): {'optimize_graph': False},
nw.dependencies.get_pyarrow(): {'backend': 'pyarrow'},
my_fancy_extension_module: {'my_fancy_kwarg': my_fancy_value},
}
lazy_df = df.lazy().select(nw.all().sum()).collect(**collect_kwargs[nw.get_native_namespace(df)])
lf = lf.group_by('foo').agg(nw.selectors.numeric.mean())
implementation = lf.implementation
df = lf.collect()
train_df, val_df = train_test_split(df)
train_lf = train_df.lazy(implementation)
val_lf = val_df.lazy(implementation)I think we'd be pretty safe to use Regarding
We should probably accept all the above in |
|
Thanks @MarcoGorelli, we can do some brainstorming in the community call tomorrow maybe.
I am ok with this - it might not be the most ergonomic way but it definitely lowers the effort a lot from our side (we would just need to pass
Am I correctly assuming this refers to which custom collect backend we want to allow for duckdb/pyspark? Or you would like it to be more general? Since polars and dask have a native |
|
I was thinking something like:
So:
|
|
I am almost finished with the reimplementation. I have just one question left for now. |
|
I'd say, |
narwhals/_spark_like/dataframe.py
Outdated
| return ArrowDataFrame( | ||
| native_dataframe=pa.Table.from_batches( | ||
| self._native_frame._collect_as_arrow() | ||
| ), |
There was a problem hiding this comment.
I am defaulting to pandas instead of pyarrow because otherwise some tests will fail. Specifically those that result in an empty dataframe - pyarrow from_batches raises that RecordBatches cannot be empty. Maybe once pyspark 4.0 is out this could be better integrated?
There was a problem hiding this comment.
this would technically be a breaking change - can we catch the exception and return an empty dataframe?
There was a problem hiding this comment.
Sure I will take a look. Notice that changing now to pyarrow now is also a breaking change 🙈
There was a problem hiding this comment.
oh right we currently collect into pandas?
🙈
There was a problem hiding this comment.
we've documented pyspark support as "work-in-progress" so we may not need to feel too guilty about changing this 😄
| df=pl.from_arrow( # type: ignore[arg-type] | ||
| pa.Table.from_batches(self._native_frame._collect_as_arrow()) |
|
@MarcoGorelli I was expecting worse from CI after such overhaul 😂 |
LazyFrame.collect kwargsLazyFrame.collect with backend and **kwargs
| mapping = { | ||
| "pandas": Implementation.PANDAS, | ||
| "modin": Implementation.MODIN, | ||
| "cudf": Implementation.CUDF, | ||
| "pyarrow": Implementation.PYARROW, | ||
| "pyspark": Implementation.PYSPARK, | ||
| "polars": Implementation.POLARS, | ||
| "dask": Implementation.DASK, | ||
| "duckdb": Implementation.DUCKDB, | ||
| "ibis": Implementation.IBIS, | ||
| } | ||
| return mapping.get(backend_name, Implementation.UNKNOWN) |
There was a problem hiding this comment.
I was going to suggest this, after adding something similar in (https://github.com/vega/altair/blob/94220be0115e8b13d2ebc686552edf68fd841a54/altair/datasets/_reader.py#L493-L510)
Great to see it in narhwals 🎉
There was a problem hiding this comment.
😄 nice, you're one step ahead
minor comment but I couldn't help spotting the "narhwals" typo in there (line 509) 🙈
MarcoGorelli
left a comment
There was a problem hiding this comment.
Amazing, thanks so much @FBruzzesi ! I pushed a little commit to simplify backend parsing, to default to PyArrow for PySpark, and to handle the (hopefully rare) empty input cases
I noticed something here regarding collecting to different backends: going from pyspark to pandas loses the timezone-awareness of the dtype, but for pyarrow it's preserved:
(Pdb) p self._native_frame
DataFrame[b: timestamp]
(Pdb) p self._native_frame.toPandas()
b
0 2020-01-01 12:34:56
(Pdb) p self._native_frame._collect_as_arrow()
[pyarrow.RecordBatch
b: timestamp[us, tz=UTC]
----
b: [2020-01-01 12:34:56.000000Z]]This helped spot that in to_datetime we're not parsing timezone-naive formats as timezone-naive
To me this kinda confirms that PyArrow is probably a better default
|
Thanks @MarcoGorelli for all the additional improvements! I am very excited to finally ship this one and enabling |
|
thanks all! let's ship it then |

What type of PR is this? (check all applicable)
Related issues
collectfor lazy-only libraries #1479Checklist
If you have comments or can explain your changes, please do so below
This is a proposal for #1479.
As it gets more relevant now due to DuckDB support and to decide how we could collect a DuckDB table.
For polars and dask, collect kwargs would follow native collect and compute respectively. For DuckDB we could come up with our own and document it properly. Specifically I would suggest to let the user decide to which dataframe backend to collect to (
return_type?), with Arrow as default.