feat: Allow spark-like backends in .lazy(backend=...)#3032
Conversation
narwhals/schema.py
Outdated
| import pyarrow as pa | ||
| import sqlframe.base.types as sqlframe_types | ||
|
|
||
| from narwhals._spark_like.utils import SparkSession |
There was a problem hiding this comment.
If we are annotating session: SparkSession - it needs to be defined as a structural type and not imported from _spark_like.utils into the narwhals level
I'm replying from my phone, so haven't checked what that type actually is yet
There was a problem hiding this comment.
Right so this currently might produce confusing results.
SparkSession is from sqlframe
narwhals/narwhals/_spark_like/utils.py
Line 19 in 68d762a
narwhals/narwhals/_spark_like/utils.py
Line 29 in 68d762a
If someone has ...
- Only
sqlframeinstalled, this works withsqlframe.base.session._BaseSessionand subclasses - Both
sqlframe+pysparkinstalled, doesn't work withpyspark.sql.SparkSessionorpyspark.sql.connect.session.SparkSession - Only
pysparkinstalled, works with anything as the symbol shows asUnknown - Neither installed, works with anything as the symbol shows as
Unknown
As someone who hasn't got pyspark installed 😭 but has got sqlframe, this is what I see in conftest.py which has a similar case
There was a problem hiding this comment.
I see, yet I am not sure what's the fix for this.
In scan_{csv,parquet} we have **kwargs: Any. I think having session might be more useful/explicit. Depending how complex it is to support typing let's see if it's worth having. I imagine that a protocol might be required. Is this the right assumption 🧐?
There was a problem hiding this comment.
The easiest option is to type it as session: Any | None = None on the public stuff and then use SparkSession internally
I agree it is useful to have the parameter 🙂
There was a problem hiding this comment.
@dangotbanned I am going to shoot myself in the foot with this comment, so please pretty please, consider it as a follow up if this is correct. Wouldn't the following be the proper typing:
if TYPE_CHECKING:
from pyspark.sql import SparkSession as PySparkSession
from pysparl.sql.connect.session import SparkSession as PySparkConnectSession
from sqlframe.base.session import _BaseSession
from narwhals._typing import Dask, DuckDB, EagerAllowed, Ibis, IntoBackend, Polars, PySpark, PySparkConnect, SQLFrame
SQLFrameSession = _BaseSession[Any, Any, Any, Any, Any, Any, Any]
@overload
def lazy(
self,
backend: IntoBackend[Polars | DuckDB | Ibis | Dask],
*,
session: None = None,
) -> LazyFrame[Any]: ...
@overload
def lazy(
self,
backend: IntoBackend[PySpark],
*,
session: PySparkSession,
) -> LazyFrame[Any]: ...
@overload
def lazy(
self,
backend: IntoBackend[PySparkConnect],
*,
session: PySparkConnectSession,
) -> LazyFrame[Any]: ...
@overload
def lazy(
self,
backend: IntoBackend[SQLFrame],
*,
session: SQLFrameSession,
) -> LazyFrame[Any]: ...There was a problem hiding this comment.
am going to shoot myself in the foot with this comment,
😂
Wouldn't the following be the proper typing:
There are a couple of issues
>>> mypy
Found 84 errors in 35 files (checked 429 source files)
>>> pyright
78 errors, 0 warnings, 0 informationsPart of it is what I mentioned in (#3032 (comment)) - which we can totally solve later (#3016 (comment))
But there are also overloads missing (e.g. df.lazy())
A future problem would be that using IntoBackend (a union with ModuleType) would create incompatible overlapping overloads if we ever want to refine LazyFrame[Any].
I very much do! 😄 (#3016 (comment))
There was a problem hiding this comment.
my suggestion was definitely partial in terms of
@overloadcoverage. I will take a look later today, but I am also happy to keep it outside this PR
Note
TL;DR: It'll probably double the size of the PR
@FBruzzesi I 100% want @overloads here, but since we need to have them on all of:
nw.DataFrame.lazynw.stable.v1.DataFrame.lazynw.stable.v2.DataFrame.lazy
And backend, session have defaults - I'd expect the diff is gonna grow a lot
This is somewhat of a starting point:
class SparkSession(Protocol):
def createDataFrame(self, *args: Any, **kwds: Any) -> Any: ... # noqa: N802
class DataFrame(BaseFrame[DataFrameT]):
@overload
def lazy(self) -> LazyFrame[Any]: ...
@overload
def lazy(self, backend: Polars | Dask | DuckDB | Ibis) -> LazyFrame[Any]: ...
@overload
def lazy(
self, backend: Polars | Dask | DuckDB | Ibis, *, session: None = ...
) -> LazyFrame[Any]: ...
@overload
def lazy(self, backend: SparkLike, *, session: SparkSession) -> LazyFrame[Any]: ...
@overload
def lazy(
self, backend: IntoBackend[LazyAllowed], *, session: SparkSession | None
) -> LazyFrame[Any]: ...
def lazy(
self,
backend: IntoBackend[LazyAllowed] | None = None,
*,
session: SparkSession | None = None,
) -> LazyFrame[Any]:But still has loads of issues 😭
It doesn't catch this:
narwhals/tests/frame/lazy_test.py
Lines 105 to 106 in 2812463
But it would catch this:
df.lazy(backend=backend)This test required adding the last overload, but really (IntoBackend[LazyAllowed], session: SparkSession | None) shouldn't be allowed from a typing perspective
narwhals/tests/frame/lazy_test.py
Lines 80 to 89 in 2812463
And everything inside that follows from inside the method now needs updating to use this new SparkSession protocol for session
narwhals/narwhals/dataframe.py
Lines 800 to 805 in 2812463
This comment was marked as resolved.
This comment was marked as resolved.
|
Thanks @dangotbanned - I had a long day already but let me try to reply to #3032 (comment) Isn't what you are proposing the same of using the from sqlframe.duckdb import DuckDBSession
s1 = DuckDBSession.builder.getOrCreate()
s2 = DuckDBSession.builder.getOrCreate()
s1 is s2
# TrueI am not sure what's the "omit" part in
means. Where should we omit it? My understanding is that after we create it once, the same will be used afterwards - In case I would argue that Also I am not following on why we would want to raise an exception:
|
It seems I did quite a bad job explaining in (#3032 (comment)) 😅 The difference between Looking back, what I suggested might be a more confusing public API: import narwhals as nw
from sqlframe.duckdb import DuckDBSession
data = {"a": [1, 2, 3], "b": ["x", "y", "z"]}
df = nw.DataFrame.from_dict(data, backend="pandas")
df.lazy("sqlframe") # error
df.lazy("sqlframe", session=DuckDBSession()) # ok
df.lazy("sqlframe") # okHowever, here I was thinking it would be nice to resolve Examples
narwhals/narwhals/_arrow/dataframe.py Lines 579 to 584 in 4c86fbe narwhals/narwhals/_pandas_like/dataframe.py Lines 826 to 831 in 4c86fbe narwhals/narwhals/_polars/dataframe.py Lines 497 to 502 in 4c86fbe But maybe we could just do this instead? def ensure_session(session: SparkSession | None) -> SparkSession:
if session:
return session
msg = "Spark like backends require `session` to be not None."
raise ValueError(msg)I know it isn't part of the PR now, but (#3032 (comment)) would be introducing a |
|
Thanks for clarifying @dangotbanned (#3032 (comment))! I somehow thought it was only related to our test suite. But now I get it. Honestly I don't have enough experience with spark-like sessions to know if, while the session is the same, some settings be changed (e.g. For SQLFrame I was about to ask: "which session should we even check?", but then I got very surprised to see the following: from sqlframe.standalone import StandaloneSession
from sqlframe.duckdb import DuckDBSession
from sqlframe.spark import SparkSession
StandaloneSession() is DuckDBSession() is SparkSession()
TrueHonestly I am not sure if this is good or bad or even expected 🙈 For now I would hold back: for me it falls into one of those cases in which a user can branch based on the implementation. For example: # Assume I have some lazyframe
some_lazy_frame = ...
nw_lf = nw.from_native(some_lazy_frame)
# Now I receive a pandas object
pd_frame = ...
nw_pd = nw.from_native(pd_frame, eager_only=True)
# and we want to align to the lazy frame:
nw_pd.lazy(
backend=nw_lf.implementation,
session=getattr(some_lazy_frame, "sparkSession", None)
)(based on If we were to have a function to align the backend of multiple dataframes (see: #2193), then we can definitely pick up the current session ourselves |
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks @FBruzzesi!
I guess I'll call it here on the review 😄
I've left myself a TODO, which I'll try to investigate tomorrow
narwhals/schema.py
Outdated
| import pyarrow as pa | ||
| import sqlframe.base.types as sqlframe_types | ||
|
|
||
| from narwhals._spark_like.utils import SparkSession |
There was a problem hiding this comment.
am going to shoot myself in the foot with this comment,
😂
Wouldn't the following be the proper typing:
There are a couple of issues
>>> mypy
Found 84 errors in 35 files (checked 429 source files)
>>> pyright
78 errors, 0 warnings, 0 informationsPart of it is what I mentioned in (#3032 (comment)) - which we can totally solve later (#3016 (comment))
But there are also overloads missing (e.g. df.lazy())
A future problem would be that using IntoBackend (a union with ModuleType) would create incompatible overlapping overloads if we ever want to refine LazyFrame[Any].
I very much do! 😄 (#3016 (comment))
| _LazyFrameCollectImpl: TypeAlias = Literal[_PandasImpl, _PolarsImpl, _ArrowImpl] # noqa: PYI047 | ||
| _DataFrameLazyImpl: TypeAlias = Literal[_PolarsImpl, _DaskImpl, _DuckDBImpl, _IbisImpl] # noqa: PYI047 | ||
|
|
There was a problem hiding this comment.
Farewell questionably-named alias 🥳
|
Thanks for the detailed review @dangotbanned I solved all the suggestions in eda88aa Regarding #3032 (comment), my suggestion was definitely partial in terms of |
| def test_lazy(constructor_eager: ConstructorEager, backend: LazyAllowed) -> None: | ||
| impl = Implementation.from_backend(backend) | ||
| pytest.importorskip(impl.name.lower()) | ||
|
|
||
| is_spark_connect = os.environ.get("SPARK_CONNECT", None) | ||
| if is_spark_connect is not None and impl.is_pyspark(): # pragma: no cover | ||
| # Workaround for impl.name.lower() being "pyspark[connect]" for | ||
| # Implementation.PYSPARK_CONNECT, which is never installed. | ||
| impl = Implementation.PYSPARK_CONNECT |
There was a problem hiding this comment.
Note to self, we need a import_or_skip_backend test util
There was a problem hiding this comment.
Actually, this might be better as part of a lazy_backend and/or lazy_implementation fixture
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks @FBruzzesi!
I'm happy with where things are at, we can tackle (#3032 (comment)) after (#3016)
I think the CI failures were unrelated - but I trust your judgement 🥳

What type of PR is this? (check all applicable)
Related issues
DataFrame().lazy({'pyspark','sqlframe','pyspark[connect]'})#3023Checklist
If you have comments or can explain your changes, please do so below
TODO: