-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39301][SQL][PYTHON] Leverage LocalRelation and respect Arrow batch size in createDataFrame with Arrow optimization #36683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c943c4d to
1f25a6b
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
Outdated
Show resolved
Hide resolved
|
cc @ueshin @viirya @BryanCutler FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally I used Iterator to avoid Py4J copies Array into Python driver side.
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
Outdated
Show resolved
Hide resolved
|
Wow, thanks for reviews guys. Let me think a bit more and push some changes soon. |
449e7c3 to
357d9b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of doing this is to avoid reconfiguring spark.rpc.message.maxSize. When the batch is too large, it throws an exception with complaining spark.rpc.message.maxSize is too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was to control how many partitions were in the rdd? Each partition could have multiple batches, and probably should be capped at arrowMaxRecordsPerBatch, but since it was coming from a local Pandas DataFrame already in memory, that didn't seem to be a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true... perf diff seems trivial in any event and seems it works around the spark.rpc.message.maxSize issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was like this to create the same number of partitions as when arrow is disabled, although that might have changed since. If the DataFrame is split with arrowMaxRecordsPerBatch and a user wanted to create a certain number of partitions, then would they have to look at the size of the input and then adjust arrowMaxRecordsPerBatch accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true .. but I wonder if the default number of partitions is something we should consider given that it wasn't already configurable before, and SparkSession.createDataFrame does not expose the number of partitions too.
If they really need, users might want to create an RDD with an explicit parallelism .. we don't support this now though (see also #29719).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, just to extra clarify, when the pandas DataFrame is small (lower than the threshold), the number of partitions remains same (configured by spark.sql.leafNodeDefaultParallelism that falls back to sparkContext.defaultParallelism if not set).
The number of partitions is only different when the input DataFrame is large, which I think makes more sense in general ..
7bc08ef to
a88eef8
Compare
|
This PR is ready for a look now. TL;DR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 32 MB? Don't have a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the max size of each batch or all batches together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all together ..so pretty small
|
Gentle ping for a review :-). I know it has some trade-off but I believe this addresses more common cases and benefit more users. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's a good optimization for smaller files, thanks @HyukjinKwon !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the max size of each batch or all batches together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was to control how many partitions were in the rdd? Each partition could have multiple batches, and probably should be capped at arrowMaxRecordsPerBatch, but since it was coming from a local Pandas DataFrame already in memory, that didn't seem to be a big deal.
|
cc @mengxr and @WeichenXu123 in case you guys have some comments. |
|
Let me merge this in few days ... assuming that we're all good. Hopefully my benchmark is good enough. |
|
Rebased |
|
Let me get this in. It's the early stage of Spark 3.4 so should be good timing to merge such stuff. Merged to master. |
…ution.arrow.maxRecordsPerBatch ### What changes were proposed in this pull request? This PR fixes the regression introduced by #36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` ### Why are the changes needed? It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a regression as described above. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by #36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3bb762d) Signed-off-by: Hyukjin Kwon <[email protected]>
…ution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by #36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3bb762d) Signed-off-by: Hyukjin Kwon <[email protected]>
…ution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by apache#36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes apache#45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3bb762d) Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to use
LocalRelationinstead ofLogicalRDDwhen creating a (small) DataFrame with Arrow optimization, which passes the data as a local data in the driver side (which is consistent with Scala code path).Namely:
Before
After
This is controlled by a new configuration
spark.sql.execution.arrow.localRelationThresholddefaulting to 48MB. This default was picked by benchmark I ran below.In addition, this PR also fixes
createDataFrameto respectspark.sql.execution.arrow.maxRecordsPerBatchconfiguration when creating Arrow bathes. Previously, we divided the input pandas DataFrame by the default partition number which forced users to setspark.rpc.message.maxSizewhen the input pandas DataFrame is too large. See the benchmark performed below.Why are the changes needed?
We have some nice optimization for
LocalRelation(e.g.,ConvertToLocalRelation). For example, the stats are fully known when you useLocalRelation. WithLogicalRDD, many optimizations cannot be applied. Even in some cases (e.g.,executeCollect), we can avoid creatingRDDs too.For respecting
spark.sql.execution.arrow.maxRecordsPerBatch, 1. we can avoid forcing users to setspark.rpc.message.maxSize, and 2. I believe the configuration is supposed to be respected for all code path that creates Arrow batches if possible.Does this PR introduce any user-facing change?
No, it is an optimization. The number of partitions can be different, but that should be internal.
How was this patch tested?
Benchmark 1 (best cases)
Before
10.250491698582968
After
6.004616181055705
Benchmark 2 (worst cases)
Before
23MB: 1.02 seconds
45MB: 1.69 seconds
90MB: 2.38 seconds
175MB: 3.19 seconds
350MB: 6.10 seconds
2GB: 43.21 seconds
5GB: X (threw an exception that says to set 'spark.rpc.message.size' higher)
After
23MB: 1.31 seconds (local collection is used)
45MB: 2.47 seconds (local collection is used)
90MB: 1.79 seconds
175MB: 3.22 seconds
350MB: 6.41 seconds
2GB: 47.12 seconds
5GB: 1.29 minutes
NOTE that the performance varies depending on network stability, and the numbers above are from second run (it's not the average).