feat: `LazyFrame.collect` with backend and **kwargs by FBruzzesi · Pull Request #1734 · narwhals-dev/narwhals

FBruzzesi · 2025-01-06T09:20:39Z

What type of PR is this? (check all applicable)

Related issues

Closes API: collect for lazy-only libraries #1479

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

This is a proposal for #1479.

As it gets more relevant now due to DuckDB support and to decide how we could collect a DuckDB table.

For polars and dask, collect kwargs would follow native collect and compute respectively. For DuckDB we could come up with our own and document it properly. Specifically I would suggest to let the user decide to which dataframe backend to collect to (return_type?), with Arrow as default.

MarcoGorelli · 2025-01-06T09:37:42Z

Nice, I like the look of this

it may help to simplify

narwhals/tests/utils.py

Lines 77 to 80 in 8c9525a

    
           if result.implementation is Implementation.POLARS and os.environ.get( 
        
               "NARWHALS_POLARS_GPU", False 
        
           ):  # pragma: no cover 
        
               result = result.to_native().collect(engine="gpu")

to just result.collect(polars_kwargs=dict(engine="gpu"))?

I would suggest to let the user decide to which dataframe backend to collect to (return_type?), with Arrow as default.

agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?). to check my understanding then, this would be compatibe with the collect in #1725, and we can add extra kwargs later?

FBruzzesi · 2025-01-06T09:46:28Z

to just result.collect(polars_kwargs=dict(engine="gpu"))?

Yes exactly!

agree, I think arrow's a good default for duckdb (also, as far as I can tell, collecting into Polars from duckdb requires pyarrow anyway, suggesting they first collect into pyarrow anyway?).

🤔 now that you mention (and completely unrelated from this PR), we could do the same for pyspark: see SO Ritchie answer

to check my understanding then, this would be compatibe with the collect in #1725, and we can add extra kwargs later?

Yes indeed as long as we intend to have the default to be PyArrow

MarcoGorelli · 2025-01-06T09:53:45Z

we could do the same for pyspark: see SO Ritchie answer

yes, nice! 🙌

MarcoGorelli · 2025-01-06T10:16:28Z

I think this looks good, just going to give the chance to others to weigh in

FBruzzesi · 2025-01-06T15:43:49Z

@MarcoGorelli I just added duckdb_kwargs to specify DuckDB return_type in the last commit. Can revert it if we want to sleep on it

FBruzzesi · 2025-01-06T15:45:14Z

narwhals/_duckdb/dataframe.py

+            from narwhals.utils import Implementation
+
+            return PolarsDataFrame(
+                df=self._native_frame.pl(),


Unrelated but.. should we change in PolarsDataFrame:

- df: pl.DataFrame, + native_dataframe: pl.DataFrame,

FBruzzesi · 2025-01-06T15:46:33Z

narwhals/dataframe.py

+        polars_kwargs: dict[str, Any] | None = None,
+        dask_kwargs: dict[str, Any] | None = None,
+        duckdb_kwargs: dict[str, str] | None = None,


These could all be TypedDict's 👀

EdAbati · 2025-01-09T07:42:30Z

Niiiice! 🙌🙌

A couple of questions/idea:

wouldn't return_type eventually become an argument for every Lazy backend? e.g. one may want to always collect to polars regardless of which Lazy dataframe they are starting from.
What do you think about changing the signature to:
```
def collect(self, return_type, ...) -> ...
```
still not 100% sold to the idea of having one specific kwargs per Lazy backend, I think it would give us a bit less flexibility. The signature would change if we add more lazy backends (e.g. pyspark). Or I think it'd become a problem if:
- we remove a lazy backend, for example moving it to an separate integration library
- people wants to create their own Lazy backends and want to pass args to their collect. (probably a niche use case though)
What do you think?

Also I wouldn't find particularly ugly something like this (but that's personal preference :D ):
```
 lazy_df = df.lazy().select(nw_v1.all().sum())
 if lazy_df.implementation == Implementation.POLARS:
     eager_df = lazy_df.collect(no_optimization=True)
 elif lazy_df.implementation == Implementation.DASK:
     eager_df = lazy_df.collect(optimize_graph=False)
 elif lazy_df.implementation == Implementation.DUCKDB:
     eager_df = lazy_df.collect(return_type="pyarrow")
```

FBruzzesi · 2025-01-09T08:24:10Z

Hey @EdAbati, thanks for your feedback! that's exactly the purpose of a RFC 👌

wouldn't return_type eventually become an argument for every Lazy backend? e.g. one may want to always collect to polars regardless of which Lazy dataframe they are starting from.

Considering the dataframe ecosystem as of today:

Polars would always collect to polars
Dask would always collect to pandas
DuckDB (and maybe Ibis) could collect to polars, arrow and pandas
Spark natively to pandas (but I see the appeal of collecting to others)

Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!

still not 100% sold to the idea of having one specific kwargs per Lazy backend, I think it would give us a bit less flexibility. The signature would change if we add more lazy backends (e.g. pyspark). Or I think it'd become a problem if:

we remove a lazy backend, for example moving it to an separate integration library

Those are definitly fair concerns to think about! Thanks for pointing those out!

people wants to create their own Lazy backends and want to pass args to their collect. (probably a niche use case though)

That's when they can branch out, we would just need to add additional **kwargs to be passed to all .collect if the implementation is outside of what we cover with the dedicated arguments. In code:

def collect(
    self: Self,
    *,
    polars_kwargs: dict[str, Any] | None = None,
    dask_kwargs: dict[str, Any] | None = None,
    duckdb_kwargs: dict[str, str] | None = None,
    **kwargs: Any,
):
    from narwhals.utils import Implementation

    if self.implementation is Implementation.POLARS and polars_kwargs is not None:
        kwargs_ = polars_kwargs
    elif ...:
        ...
    else:
        kwargs_ = kwargs

    return self._dataframe(
        self._compliant_frame.collect(**kwargs),
        level="full",
    )

What do you think?
Also I wouldn't find particularly ugly something like this (but that's personal preference :D ):

 lazy_df = df.lazy().select(nw_v1.all().sum())
 if lazy_df.implementation == Implementation.POLARS:
     eager_df = lazy_df.collect(no_optimization=True)
 elif lazy_df.implementation == Implementation.DASK:
     eager_df = lazy_df.collect(optimize_graph=False)
 elif lazy_df.implementation == Implementation.DUCKDB:
     eager_df = lazy_df.collect(return_type="pyarrow")

I would be ok deferring the responsibility to the users. But coming back to personal preference, if I were to use an external library to find that I have to do a lot of branching myself, I wouldn't say it is particularly ergonomic 😅

EdAbati · 2025-01-11T09:13:29Z

Not sure what we are aim to support in the future 😉 but I would try not to break our head too soon here!

Ah! I was under the impression that there was a request to make LazyFrames able to collect to any backend. Similarly to what the methods to_pandas(), to_arrow etc would do for eager frames. Maybe I misunderstood then :D

Also FYI PySpark 4.0.0 will support toArrow

MarcoGorelli · 2025-01-11T09:24:24Z

there is indeed a request to be able to collect to specific backends (e.g. someone via work asked to be able to collect duckdb-backed lazyframe into polars-backed dataframe), but I think this would still be backend-specific - e.g. not all lazy backends would necessarily have a way to collect to polars

as in, which might not have a way to do lazy_df.collect(eager_backend='polars') and know that it will work for all lazy backends...but if we know that say duckdb supports it, we can do lazy_df.collect(duckdb_kwargs={'eager_backend': 'polars'})

Having said that, should we:

use eager_backend instead of return_dtype? return_type kind of sounds like it affects the return type of the function, although that is always nw.DataFrame
should we document a couple of common kwargs for polars / dask backends? e.g. streaming, engine='gpu'. then for duckdb_kwargs / pyspark_kwargs we can document 'eager_backend'

if lazy_df.implementation == Implementation.POLARS:

small note, but this can now be done as if lazy_df.implementation.is_polars()

MarcoGorelli · 2025-01-16T14:35:20Z

Been thinking about this a bit more, and in the read functions we have **kwargs, perhaps we should be consistent with that? To avoid branching, users could make a collect_kwargs dictionary which maps implementations to kwargs and then do

lf.collect(collect_kwargs[lf.implementation])

We could also have eager_backend as an argument in collect, and lazy_backend in lazy?

MarcoGorelli · 2025-01-28T19:52:34Z

TBH, the more I think about this, the more I'd be in favour of just:

    def collect(self: Self, backend: ModuleType | Implementation | str | None, **kwargs) -> DataFrame[Any]: ...
    def lazy(self: Self, backend: ModuleType | Implementation| str | None) -> LazyFrame[Any]:

Because then:

as a user, I could do

df = (
    lf.group_by('a', 'b').agg(nw.all().mean().name.suffix('_mean'))
    .collect(engine='gpu')  # alternative is `.collect(polars_kwargs={'engine': 'gpu'})`
    .with_columns(pl.col('a').rolling_mean(2))
    .to_native()

as a tool-builder wanting to avoid extensive if/then statement, I could do:

collect_kwargs = {
    nw.dependencies.get_polars(): {'engine: 'gpu'},
    nw.dependencies.get_dask(): {'optimize_graph': False},
    nw.dependencies.get_pyarrow(): {'backend': 'pyarrow'},
    my_fancy_extension_module: {'my_fancy_kwarg': my_fancy_value},
}
 lazy_df = df.lazy().select(nw.all().sum()).collect(**collect_kwargs[nw.get_native_namespace(df)])

it's possible to round-trip lazy-eager-lazy, like

lf = lf.group_by('foo').agg(nw.selectors.numeric.mean())
implementation = lf.implementation
df = lf.collect()
train_df, val_df = train_test_split(df)
train_lf = train_df.lazy(implementation)
val_lf = val_df.lazy(implementation)

I think we'd be pretty safe to use backend, because Polars uses engine and I can't see why they'd add both

Regarding ModuleType | Implementation | str | None, I think we can accept all of:

module (e.g. polars, duckdb, ...)
implementation (e.g. nw.Implementation.POLARS)
str (e.g. 'polars')
None: use the default eager/lazy backend for the given module (we can document sensible defaults)

We should probably accept all the above in from_dict and similars too, and perhaps deprecate native_namespace which is annoyingly long

FBruzzesi · 2025-01-30T12:45:33Z

Thanks @MarcoGorelli, we can do some brainstorming in the community call tomorrow maybe.

as a tool-builder wanting to avoid extensive if/then statement, I could do:

collect_kwargs = {
    nw.dependencies.get_polars(): {'engine: 'gpu'},
    nw.dependencies.get_dask(): {'optimize_graph': False},
    nw.dependencies.get_pyarrow(): {'backend': 'pyarrow'},
    my_fancy_extension_module: {'my_fancy_kwarg': my_fancy_value},
}
 lazy_df = df.lazy().select(nw.all().sum()).collect(**collect_kwargs[nw.get_native_namespace(df)])

I am ok with this - it might not be the most ergonomic way but it definitely lowers the effort a lot from our side (we would just need to pass kwargs along

I think we'd be pretty safe to use backend, because Polars uses engine and I can't see why they'd add both

Regarding ModuleType | Implementation | str | None, I think we can accept all of:

module (e.g. polars, duckdb, ...)

implementation (e.g. nw.Implementation.POLARS)

str (e.g. 'polars')

None: use the default eager/lazy backend for the given module (we can document sensible defaults)

Am I correctly assuming this refers to which custom collect backend we want to allow for duckdb/pyspark? Or you would like it to be more general?

Since polars and dask have a native collects backend, I would rather keep that, but I might be missing some use cases.
For pyspark and duckdb I am totally open to customize their available kwargs to allow for backend (with pyarrow as default?)

MarcoGorelli · 2025-01-30T15:34:29Z

I was thinking something like:

backend not specified: the default one is used, and we document what that means. This would be:
- polars.LazyFrame -> polars.DataFrame
- dask.DataFrame -> pandas.DataFrame
- duckdb.PyRelation -> pyarrow.Table
- pyspark -> probably pyarrow.Table?
backend specified: we collect into the desired eager implementation.

So:

if someone calls .collect('pandas') on a polars.LazyFrame-backed narwhals.LazyFrame, then we give them a pandas-backed narwhals.DataFrame
if someone calls .collect() on a polars.LazyFrame-backed narwhals.LazyFrame, then we give them a polars.DataFrame-backed narwhals.DataFrame

FBruzzesi · 2025-01-30T21:31:37Z

I am almost finished with the reimplementation. I have just one question left for now.
Since PandasLikeDataFrame also implements .collect() method, should we return self for backend=None or default to pandas?

MarcoGorelli · 2025-01-30T21:36:57Z

I'd say, self, so that from_native(df).lazy().collect() if starting from cuDF round-trips, right?

FBruzzesi · 2025-01-30T21:42:14Z

narwhals/_spark_like/dataframe.py

+            return ArrowDataFrame(
+                native_dataframe=pa.Table.from_batches(
+                    self._native_frame._collect_as_arrow()
+                ),


I am defaulting to pandas instead of pyarrow because otherwise some tests will fail. Specifically those that result in an empty dataframe - pyarrow from_batches raises that RecordBatches cannot be empty. Maybe once pyspark 4.0 is out this could be better integrated?

this would technically be a breaking change - can we catch the exception and return an empty dataframe?

Sure I will take a look. Notice that changing now to pyarrow now is also a breaking change 🙈

oh right we currently collect into pandas?

🙈

we've documented pyspark support as "work-in-progress" so we may not need to feel too guilty about changing this 😄

yes correct 🙃

FBruzzesi · 2025-01-30T21:42:53Z

narwhals/_spark_like/dataframe.py

+                df=pl.from_arrow(  # type: ignore[arg-type]
+                    pa.Table.from_batches(self._native_frame._collect_as_arrow())


Ritchie answer in stackoverflow

FBruzzesi · 2025-01-30T21:45:03Z

@MarcoGorelli I was expecting worse from CI after such overhaul 😂

dangotbanned · 2025-02-02T11:53:29Z

narwhals/utils.py

+        mapping = {
+            "pandas": Implementation.PANDAS,
+            "modin": Implementation.MODIN,
+            "cudf": Implementation.CUDF,
+            "pyarrow": Implementation.PYARROW,
+            "pyspark": Implementation.PYSPARK,
+            "polars": Implementation.POLARS,
+            "dask": Implementation.DASK,
+            "duckdb": Implementation.DUCKDB,
+            "ibis": Implementation.IBIS,
+        }
+        return mapping.get(backend_name, Implementation.UNKNOWN)


I was going to suggest this, after adding something similar in (https://github.com/vega/altair/blob/94220be0115e8b13d2ebc686552edf68fd841a54/altair/datasets/_reader.py#L493-L510)

Great to see it in narhwals 🎉

😄 nice, you're one step ahead

minor comment but I couldn't help spotting the "narhwals" typo in there (line 509) 🙈

ooh well spotted, thanks @MarcoGorelli

MarcoGorelli

Amazing, thanks so much @FBruzzesi ! I pushed a little commit to simplify backend parsing, to default to PyArrow for PySpark, and to handle the (hopefully rare) empty input cases

I noticed something here regarding collecting to different backends: going from pyspark to pandas loses the timezone-awareness of the dtype, but for pyarrow it's preserved:

(Pdb) p self._native_frame
DataFrame[b: timestamp]
(Pdb) p self._native_frame.toPandas()
                    b
0 2020-01-01 12:34:56
(Pdb) p self._native_frame._collect_as_arrow()
[pyarrow.RecordBatch
b: timestamp[us, tz=UTC]
----
b: [2020-01-01 12:34:56.000000Z]]

This helped spot that in to_datetime we're not parsing timezone-naive formats as timezone-naive

To me this kinda confirms that PyArrow is probably a better default

FBruzzesi · 2025-02-02T12:52:11Z

Thanks @MarcoGorelli for all the additional improvements! I am very excited to finally ship this one and enabling collect kwargs ✨

MarcoGorelli · 2025-02-02T13:29:24Z

thanks all! let's ship it then

FBruzzesi and others added 3 commits January 6, 2025 10:12

feat: LazyFrame.collect kwargs

d449521

phrasing

1046a9e

Merge branch 'main' into feat/collect-kwargs

8a4ea66

FBruzzesi added the enhancement New feature or request label Jan 6, 2025

forgot to save one file...

6cd4b0b

tests

e9e573c

FBruzzesi added 2 commits January 6, 2025 15:47

Merge branch 'main' into feat/collect-kwargs

53671ce

duckdb_kwargs

aedbff2

FBruzzesi commented Jan 6, 2025

View reviewed changes

Merge branch 'main' into feat/collect-kwargs

23e814d

FBruzzesi added 3 commits January 11, 2025 12:26

Merge branch 'main' into feat/collect-kwargs

3174be4

return_type -> eager_backend

e2459e3

merge main, do not use nw.all

b65586f

quite a refactor

2edd7e1

FBruzzesi commented Jan 30, 2025

View reviewed changes

FBruzzesi changed the title ~~RFC, feat: LazyFrame.collect kwargs~~ feat: LazyFrame.collect with backend and **kwargs Jan 30, 2025

FBruzzesi added the high priority Your PR will be reviewed very quickly if you address this label Jan 30, 2025

FBruzzesi added 5 commits January 30, 2025 23:17

skip old pandas and simplify

892a91c

fail old and filter warnings

9906862

skip for pandas<1

485209c

no cover skip statement

4de9ace

test utils fix

871c0ac

FBruzzesi mentioned this pull request Jan 31, 2025

feat: Add backend argument to lazy #1895

Merged

10 tasks

FBruzzesi and others added 8 commits February 1, 2025 16:58

Merge branch 'main' into feat/collect-kwargs

12ee487

simplify

a5044c4

simplify

6b2ffd1

wip

766f172

Merge remote-tracking branch 'upstream/main' into feat/collect-kwargs

4278b6f

default to PyArrow instead of pandas for PySpark

125f148

restore Self

9230965

coverage

f2bd45d

dangotbanned reviewed Feb 2, 2025

View reviewed changes

MarcoGorelli approved these changes Feb 2, 2025

View reviewed changes

dangotbanned mentioned this pull request Feb 2, 2025

feat(RFC): Adds altair.datasets vega/altair#3631

Closed

6 tasks

MarcoGorelli merged commit 8ca9422 into main Feb 2, 2025
23 checks passed

MarcoGorelli deleted the feat/collect-kwargs branch February 2, 2025 13:29

dangotbanned mentioned this pull request Oct 15, 2025

[Enh]: Track backend between lazy->collect->lazy #3210

Open

		df=pl.from_arrow( # type: ignore[arg-type]
		pa.Table.from_batches(self._native_frame._collect_as_arrow())

Conversation

FBruzzesi commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

MarcoGorelli commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi commented Jan 6, 2025

Uh oh!

MarcoGorelli commented Jan 6, 2025

Uh oh!

MarcoGorelli commented Jan 6, 2025

Uh oh!

FBruzzesi commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdAbati commented Jan 11, 2025

Uh oh!

MarcoGorelli commented Jan 11, 2025

Uh oh!

MarcoGorelli commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBruzzesi commented Jan 30, 2025

Uh oh!

MarcoGorelli commented Jan 30, 2025

Uh oh!

FBruzzesi commented Jan 30, 2025

Uh oh!

MarcoGorelli commented Jan 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi commented Jan 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

FBruzzesi commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Feb 2, 2025

Uh oh!

FBruzzesi commented Jan 6, 2025 •

edited

Loading

MarcoGorelli commented Jan 6, 2025 •

edited

Loading

FBruzzesi commented Jan 6, 2025 •

edited

Loading

EdAbati commented Jan 9, 2025 •

edited

Loading

FBruzzesi commented Jan 9, 2025 •

edited

Loading

MarcoGorelli commented Jan 16, 2025 •

edited

Loading

MarcoGorelli commented Jan 28, 2025 •

edited

Loading

FBruzzesi Feb 1, 2025 •

edited

Loading

FBruzzesi commented Feb 2, 2025 •

edited

Loading