-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,13 +17,15 @@ | |
|
|
||
| from threading import RLock | ||
| from typing import Optional, Any, Union, Dict, cast, overload | ||
| import pandas as pd | ||
|
|
||
| import pyspark.sql.types | ||
| from pyspark.sql.connect.client import SparkConnectClient | ||
| from pyspark.sql.connect.dataframe import DataFrame | ||
| from pyspark.sql.connect.plan import SQL, Range | ||
| from pyspark.sql.connect.readwriter import DataFrameReader | ||
| from pyspark.sql.utils import to_str | ||
| from . import plan | ||
| from ._typing import OptionalPrimitiveType | ||
|
|
||
|
|
||
|
|
@@ -205,6 +207,34 @@ def __init__(self, connectionString: str, userId: Optional[str] = None): | |
| # Create the reader | ||
| self.read = DataFrameReader(self) | ||
|
|
||
| def createDataFrame(self, data: "pd.DataFrame") -> "DataFrame": | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, the implementation here isn't matched to what we have in By default, the Arrow message conversion (more specifically in https://github.com/apache/spark/pull/38659/files#diff-d630cc4be6c65a3c3f7d6dbfe990f99ba992ccc26d9c3aaf6cfe46e163cb7389R514-R521) have to happen in RDD so we can parallelize this. For a bit of history, PySpark added the initial version with RDD first, and added this local relation as an optimization for small dataset (see also #36683) later.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine with the current approach but the main problem here is that 1. we can't stream the input, 2. it will have the size limit (likely 4KB). cc @hvanhovell FYI
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is impossible to match the implementation because in Pyspark to parallelize a first serialization is already happening to pass the input DF to the executors. In our case to even send the data to spark we have to serialize it. That said you're right that this currently does not support streaming of local data to the client. But the limit is not 4kb but probably whatever the max message size of GRPC is so in the megabytes. I think we need to add the client side streaming APIs at some point but I'd like to defer that for a bit.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For a large |
||
| """ | ||
| Creates a :class:`DataFrame` from a :class:`pandas.DataFrame`. | ||
|
|
||
| .. versionadded:: 3.4.0 | ||
|
|
||
|
|
||
| Parameters | ||
| ---------- | ||
| data : :class:`pandas.DataFrame` | ||
|
|
||
| Returns | ||
| ------- | ||
| :class:`DataFrame` | ||
|
|
||
| Examples | ||
| -------- | ||
| >>> import pandas | ||
| >>> pdf = pandas.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) | ||
| >>> self.connect.createDataFrame(pdf).collect() | ||
| [Row(a=1, b='a'), Row(a=2, b='b'), Row(a=3, b='c')] | ||
|
|
||
| """ | ||
| assert data is not None | ||
| if len(data) == 0: | ||
| raise ValueError("Input data cannot be empty") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIRC, |
||
| return DataFrame.withPlan(plan.LocalRelation(data), self) | ||
|
|
||
| @property | ||
| def client(self) -> "SparkConnectClient": | ||
| """ | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar here so a question:
any possible that an empty panda dataframe are used here (e.g. has schema but no data). If so maybe have a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a test for that, thanks for the proposal!