feat: Add support for spark connect by FBruzzesi · Pull Request #2417 · narwhals-dev/narwhals

FBruzzesi · 2025-04-22T20:47:43Z

What type of PR is this? (check all applicable)

Related issues

Related discord thread
Closes [Enh]: Support for Spark Connect #2189

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Opening as draft as:

datetime tests are failing (infer format and timezone aware)
cast to struct type is failing (even though the conversion internally seems fine - needs more investigation)
some naming/choices might need discussion (see code comments)
very unsure if the CI will work as expected

dangotbanned · 2025-04-22T20:51:03Z

Just adding the original tag as well to be careful 🙂 (#2417 (comment))

narwhals/_namespace.py

FBruzzesi · 2025-04-22T20:51:51Z

narwhals/_spark_like/dataframe.py

+        elif (
+            self._implementation is Implementation.PYSPARK_CONNECT
+            and self._backend_version < (4,)
+        ):
+            import pyarrow as pa  # ignore-banned-import
+
+            return pa.Table.from_pandas(self.native.toPandas())


This...got me a laugh and a tear!

FBruzzesi · 2025-04-22T20:52:17Z

narwhals/_spark_like/dataframe.py


    def head(self, n: int) -> Self:
-        return self._with_native(self.native.limit(num=n))
+        return self._with_native(self.native.limit(n))


No argument named num 🤷🏼‍♀️

narwhals/dependencies.py

narwhals/utils.py

pyproject.toml

FBruzzesi · 2025-04-22T20:58:05Z

Just adding the original tag as well to be careful 🙂 (#2417 (comment))

Thanks @dangotbanned , as I might need a few iterations before getting the CI right (just to make it start) I didn't want to trigger everything all the time 🙈
But yes, before making a final call to this PR, we should definitly run all the jobs

dangotbanned · 2025-04-22T21:00:09Z

#2417 (comment)

No worries @FBruzzesi - trust your judgement here 😄

…a spark?

FBruzzesi · 2025-04-22T21:40:36Z

.github/workflows/pytest-pyspark.yml

+      - name: Download Spark
+        run: |
+          curl -sL https://downloads.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz
+          tar xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz
+          echo "SPARK_HOME=$PWD/spark-${SPARK_VERSION}-bin-hadoop3" >> $GITHUB_ENV
+          echo "$PWD/spark-${SPARK_VERSION}-bin-hadoop3/bin" >> $GITHUB_PATH


So apparently this is not a good idea.

Searching for start-connect-server.sh in across all public github actions (yaml files in .github folder) here is what we get: search.

Which to me does not really look great since all the repos are apache/spark forks 🥲 with one exception, which however uses that string to skip the spark connect test.

I will sleep on this and keep looking for alternatives 🤔

Exciting news! It seems to work with the latest commit (05e0bcc). It just takes a while to set up

dangotbanned · 2025-04-23T14:30:17Z

Just wanna throw these into the mix again 😄

So far this PR hasn't seemed to add much branching within methods.
@FBruzzesi if you find there's more of a need for that later - then there could be more of a benefit to splitting out the classes

narwhals/_spark_like/dataframe.py

FBruzzesi · 2025-04-25T07:07:15Z

narwhals/_spark_like/expr.py

                order_by_cols = [self._F.asc_nulls_last(_input)]

-            window = self._Window().orderBy(order_by_cols)
+            window = self._Window().partitionBy(self._F.lit(1)).orderBy(order_by_cols)


Left over from #2429

FBruzzesi · 2025-04-25T07:09:07Z

tests/conftest.py

                .drop(index_col_name)
            )

+            return cast("PySparkDataFrame", frame)


@dangotbanned perdoname por mi vida loca 😂

Hopefully fixed in (f297ac4)

Nice thanks! That's one more trick that I need to add to my belt!

Same trick as the rest of spark_like typing 😉

You need to pick one of the imports for TYPE_CHECKING, which goes first. Then the real code goes afterwards.
A type checker will view the non-TYPE_CHECKING as unreachable - since they "think" it is always True - whereas at runtime it is always False

narwhals/narwhals/_spark_like/dataframe.py

Lines 73 to 98 in 9122aef

@property

def _F(self): # type: ignore[no-untyped-def] # noqa: ANN202, N802

if TYPE_CHECKING:

from sqlframe.base import functions

return functions

else:

return import_functions(self._implementation)

@property

def _native_dtypes(self): # type: ignore[no-untyped-def] # noqa: ANN202

if TYPE_CHECKING:

from sqlframe.base import types

return types

else:

return import_native_dtypes(self._implementation)

@property

def _Window(self) -> type[Window]: # noqa: N802

if TYPE_CHECKING:

from sqlframe.base.window import Window

return Window

else:

return import_window(self._implementation)

narwhals/narwhals/_spark_like/utils.py

Lines 243 to 272 in 9122aef

def import_functions(implementation: Implementation, /) -> ModuleType:

if implementation is Implementation.PYSPARK:

from pyspark.sql import functions

return functions

from sqlframe.base.session import _BaseSession

return import_module(f"sqlframe.{_BaseSession().execution_dialect_name}.functions")

def import_native_dtypes(implementation: Implementation, /) -> ModuleType:

if implementation is Implementation.PYSPARK:

from pyspark.sql import types

return types

from sqlframe.base.session import _BaseSession

return import_module(f"sqlframe.{_BaseSession().execution_dialect_name}.types")

def import_window(implementation: Implementation, /) -> type[Any]:

if implementation is Implementation.PYSPARK:

from pyspark.sql import Window

return Window

from sqlframe.base.session import _BaseSession

return import_module(

f"sqlframe.{_BaseSession().execution_dialect_name}.window"

).Window

.github/workflows/pytest-pyspark.yml

…narwhals into feat/spark-connect

MarcoGorelli

thanks @FBruzzesi !

dangotbanned

Thanks @FBruzzesi!
I was summoned (#2417 (comment)) for typing - but spotted a possible performance regression that would impact other spark-like backends

dangotbanned · 2025-04-26T09:16:25Z

narwhals/_spark_like/dataframe.py

+        for key, value in nw_schema.items():
+            try:
+                native_dtype = narwhals_to_native_dtype(value, self._version)
+            except Exception as exc:  # noqa: BLE001,PERF203
+                native_spark_dtype = native_schema[key].dataType  # type: ignore[index]


@FBruzzesi Could you address this performance issue (PERF203) or leave a comment on why it is unavoidable please?

I'd personally try to also avoid (BLE001) and using exception handling entirely - but understand they were already here 🙂

In principle we could first try to check if self.collect_schema() has any unknown type. If it doesn't then it should be possible to call .to_arrow() - if it does then this workaround is needed

@dangotbanned sorry for the direct ping - let's figure out what to write - to me the explanation be in ln158 is quite good.

I will need your approval to merge 😎

Hey sorry I lost this @FBruzzesi

I started trying to address it, but couldn't get the tests working locally 😔

EdAbati

Thank you for doing this! Looks great to me, just left minor comments :)

EdAbati · 2025-04-25T17:17:22Z

.github/workflows/pytest-pyspark.yml

+      - name: Cache Spark
+        id: cache-spark
+        uses: actions/cache@v4
+        with:
+          path: /opt/spark
+          key: spark-${{ env.SPARK_VERSION }}-bin-hadoop3


EdAbati · 2025-04-26T07:49:18Z

.github/workflows/pytest-pyspark.yml

+          java-version: '11'
+          distribution: 'temurin'


I have ~0 knownledge about Java. Where does this come from? should we add the source in a comment?
I can see in their CI they use this instead:
https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/.github/workflows/build_python_connect.yml#L54-L58

I have ~0 knownledge about Java.

That makes 2 of us

When do this come from should we add the source in a comment?

I can try to bump the version or follow what they do

narwhals/dependencies.py

narwhals/_spark_like/expr.py

#2417 (comment)

EdAbati

Nice! Thank you :)

MarcoGorelli · 2025-04-28T08:20:58Z

thanks @FBruzzesi ! going to go ahead an ship this

happy to discuss whether the PERF203 ruff code can be avoided separately, but i don't think it's a blocker, and it was already there to begin with (probably because i added it 😳 ) it's not introduced by this PR

MarcoGorelli · 2025-04-28T08:22:59Z

quick note - indeed, by deleting the whole merge commit message apart from the "coauthored" by part, we don't make git log unnecessarily long, and we preserve coauthors (github now shows "FBruzzesi and dangotbanned" for this commit). cool, gonna add something about that to the contributing guide

FBruzzesi added 4 commits April 21, 2025 23:11

WIP

6a6e420

Down to handful of tests

27b66f7

CI?

6531ced

add pyspark-connect option in optional deps

cdacf63

FBruzzesi added enhancement New feature or request pyspark-connect labels Apr 22, 2025

dangotbanned added the pyspark Issue is related to pyspark backend label Apr 22, 2025