Commit ecaa495
committed
[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance
## What changes were proposed in this pull request?
When executing `toPandas` with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in.
This change sends un-ordered partitions to Python as soon as they arrive in the JVM, followed by a list of partition indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased.
Followup to #21546
## How was this patch tested?
Added new test with a large number of batches per partition, and test that forces a small delay in the first partition. These test that partitions are collected out-of-order and then are are put in the correct order in Python.
## Performance Tests - toPandas
Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8
measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each.
Test code
```python
df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand())
for i in range(5):
start = time.time()
_ = df.toPandas()
elapsed = time.time() - start
```
Spark config
```
spark.driver.memory 5g
spark.executor.memory 5g
spark.driver.maxResultSize 2g
spark.sql.execution.arrow.enabled true
```
Current Master w/ Arrow stream | This PR
---------------------|------------
5.16207 | 4.342533
5.133671 | 4.399408
5.147513 | 4.468471
5.105243 | 4.36524
5.018685 | 4.373791
Avg Master | Avg This PR
------------------|--------------
5.1134364 | 4.3898886
Speedup of **1.164821449**
Closes #22275 from BryanCutler/arrow-toPandas-oo-batches-SPARK-25274.
Authored-by: Bryan Cutler <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>1 parent ab76900 commit ecaa495
File tree
4 files changed
+95
-22
lines changed- python/pyspark
- sql
- tests
- sql/core/src/main/scala/org/apache/spark/sql
4 files changed
+95
-22
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
188 | 221 | | |
189 | 222 | | |
190 | 223 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
2168 | 2168 | | |
2169 | 2169 | | |
2170 | 2170 | | |
2171 | | - | |
| 2171 | + | |
| 2172 | + | |
| 2173 | + | |
| 2174 | + | |
| 2175 | + | |
| 2176 | + | |
| 2177 | + | |
| 2178 | + | |
2172 | 2179 | | |
2173 | 2180 | | |
2174 | 2181 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
381 | 381 | | |
382 | 382 | | |
383 | 383 | | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
384 | 412 | | |
385 | 413 | | |
386 | 414 | | |
| |||
Lines changed: 25 additions & 20 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
3200 | 3201 | | |
3201 | 3202 | | |
3202 | 3203 | | |
3203 | | - | |
| 3204 | + | |
| 3205 | + | |
3204 | 3206 | | |
3205 | 3207 | | |
3206 | 3208 | | |
3207 | 3209 | | |
3208 | | - | |
3209 | | - | |
3210 | | - | |
| 3210 | + | |
| 3211 | + | |
| 3212 | + | |
3211 | 3213 | | |
3212 | | - | |
| 3214 | + | |
3213 | 3215 | | |
3214 | | - | |
3215 | | - | |
| 3216 | + | |
| 3217 | + | |
3216 | 3218 | | |
3217 | | - | |
3218 | | - | |
3219 | | - | |
3220 | | - | |
3221 | | - | |
3222 | | - | |
| 3219 | + | |
| 3220 | + | |
3223 | 3221 | | |
3224 | | - | |
3225 | | - | |
3226 | | - | |
| 3222 | + | |
| 3223 | + | |
| 3224 | + | |
| 3225 | + | |
| 3226 | + | |
| 3227 | + | |
| 3228 | + | |
| 3229 | + | |
| 3230 | + | |
| 3231 | + | |
| 3232 | + | |
| 3233 | + | |
3227 | 3234 | | |
3228 | | - | |
3229 | | - | |
3230 | | - | |
| 3235 | + | |
3231 | 3236 | | |
3232 | 3237 | | |
3233 | 3238 | | |
| |||
0 commit comments