feat: Add support for spark connect#2417
Conversation
|
Just adding the original tag as well to be careful 🙂 (#2417 (comment)) |
narwhals/_spark_like/dataframe.py
Outdated
| elif ( | ||
| self._implementation is Implementation.PYSPARK_CONNECT | ||
| and self._backend_version < (4,) | ||
| ): | ||
| import pyarrow as pa # ignore-banned-import | ||
|
|
||
| return pa.Table.from_pandas(self.native.toPandas()) |
There was a problem hiding this comment.
This...got me a laugh and a tear!
|
|
||
| def head(self, n: int) -> Self: | ||
| return self._with_native(self.native.limit(num=n)) | ||
| return self._with_native(self.native.limit(n)) |
There was a problem hiding this comment.
No argument named num 🤷🏼♀️
Thanks @dangotbanned , as I might need a few iterations before getting the CI right (just to make it start) I didn't want to trigger everything all the time 🙈 |
|
No worries @FBruzzesi - trust your judgement here 😄 |
.github/workflows/pytest-pyspark.yml
Outdated
| - name: Download Spark | ||
| run: | | ||
| curl -sL https://downloads.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz | ||
| tar xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz | ||
| echo "SPARK_HOME=$PWD/spark-${SPARK_VERSION}-bin-hadoop3" >> $GITHUB_ENV | ||
| echo "$PWD/spark-${SPARK_VERSION}-bin-hadoop3/bin" >> $GITHUB_PATH |
There was a problem hiding this comment.
So apparently this is not a good idea.
Searching for start-connect-server.sh in across all public github actions (yaml files in .github folder) here is what we get: search.
Which to me does not really look great since all the repos are apache/spark forks 🥲 with one exception, which however uses that string to skip the spark connect test.
I will sleep on this and keep looking for alternatives 🤔
There was a problem hiding this comment.
Exciting news! It seems to work with the latest commit (05e0bcc). It just takes a while to set up
|
Just wanna throw these into the mix again 😄
So far this PR hasn't seemed to add much branching within methods. |
| order_by_cols = [self._F.asc_nulls_last(_input)] | ||
|
|
||
| window = self._Window().orderBy(order_by_cols) | ||
| window = self._Window().partitionBy(self._F.lit(1)).orderBy(order_by_cols) |
tests/conftest.py
Outdated
| .drop(index_col_name) | ||
| ) | ||
|
|
||
| return cast("PySparkDataFrame", frame) |
There was a problem hiding this comment.
Nice thanks! That's one more trick that I need to add to my belt!
There was a problem hiding this comment.
Same trick as the rest of spark_like typing 😉
You need to pick one of the imports for TYPE_CHECKING, which goes first. Then the real code goes afterwards.
A type checker will view the non-TYPE_CHECKING as unreachable - since they "think" it is always True - whereas at runtime it is always False
narwhals/narwhals/_spark_like/dataframe.py
Lines 73 to 98 in 9122aef
narwhals/narwhals/_spark_like/utils.py
Lines 243 to 272 in 9122aef
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks @FBruzzesi!
I was summoned (#2417 (comment)) for typing - but spotted a possible performance regression that would impact other spark-like backends
| for key, value in nw_schema.items(): | ||
| try: | ||
| native_dtype = narwhals_to_native_dtype(value, self._version) | ||
| except Exception as exc: # noqa: BLE001,PERF203 | ||
| native_spark_dtype = native_schema[key].dataType # type: ignore[index] |
There was a problem hiding this comment.
@FBruzzesi Could you address this performance issue (PERF203) or leave a comment on why it is unavoidable please?
I'd personally try to also avoid (BLE001) and using exception handling entirely - but understand they were already here 🙂
There was a problem hiding this comment.
In principle we could first try to check if self.collect_schema() has any unknown type. If it doesn't then it should be possible to call .to_arrow() - if it does then this workaround is needed
There was a problem hiding this comment.
@dangotbanned sorry for the direct ping - let's figure out what to write - to me the explanation be in ln158 is quite good.
I will need your approval to merge 😎
There was a problem hiding this comment.
Hey sorry I lost this @FBruzzesi
I started trying to address it, but couldn't get the tests working locally 😔
EdAbati
left a comment
There was a problem hiding this comment.
Thank you for doing this! Looks great to me, just left minor comments :)
| - name: Cache Spark | ||
| id: cache-spark | ||
| uses: actions/cache@v4 | ||
| with: | ||
| path: /opt/spark | ||
| key: spark-${{ env.SPARK_VERSION }}-bin-hadoop3 |
.github/workflows/pytest-pyspark.yml
Outdated
| java-version: '11' | ||
| distribution: 'temurin' |
There was a problem hiding this comment.
I have ~0 knownledge about Java. Where does this come from? should we add the source in a comment?
I can see in their CI they use this instead:
https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/.github/workflows/build_python_connect.yml#L54-L58
There was a problem hiding this comment.
I have ~0 knownledge about Java.
That makes 2 of us
When do this come from should we add the source in a comment?
I can try to bump the version or follow what they do
|
thanks @FBruzzesi ! going to go ahead an ship this happy to discuss whether the |
|
quick note - indeed, by deleting the whole merge commit message apart from the "coauthored" by part, we don't make |


What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
Opening as draft as: