[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas #18459

BryanCutler · 2017-06-28T17:48:34Z

What changes were proposed in this pull request?

Integrate Apache Arrow with Spark to increase performance of DataFrame.toPandas. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an UnsupportedOperation exception is thrown.

Additions to Spark include a Scala package private method Dataset.toArrowPayload that will convert data partitions in the executor JVM to ArrowPayloads as byte arrays so they can be easily served. A package private class/object ArrowConverters that provide data type mappings and conversion routines. In Python, a private method DataFrame._collectAsArrow is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in toPandas() to enable using Arrow (uses the old conversion by default).

How was this patch tested?

Added a new test suite ArrowConvertersSuite that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the toPandas method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

…ersion has basic data types and is working for small datasets with longs, doubles. Using Arrow 0.1.1-SNAPSHOT dependency.

Changed scope of arrow-tools dependency to test commented out lines to Integration.compareXX that are private to arrow closes #10

…ark script Remove arrow-tools dependency changed zipWithIndex to while loop modified benchmark to work with Python2 timeit closes #13

…cala changed tests to use existing SQLTestData and removed unused files closes #14

…g and cleanup closes #15

added more conversion tests short type should have a bit-width of 16 closes #17

Move column writers to Arrow.scala Add support for more types; Switch to arrow NullableVector closes #16

added test for byte data byte type should be signed closes #18

…py; Fix memory leaking bug closes #19

remove unwanted changes removed benchmark.py from repository, will attach to PR instead

added more tests and cleanup closes #20

defined ArrowPayload and encapsulated Arrow classes in ArrowConverters addressed some minor comments in code review closes #21

…batches in a stream closes #22

arrow conversion done at partition by executors some cleanup of APIs, made tests complete for non-complex data types closes #23

…atches not closed properly

…nor cleanup

closes #24

shaneknapp · 2017-07-07T16:54:41Z

test this please

BryanCutler · 2017-07-07T17:14:19Z

Great, thanks @shaneknapp!

…ow-SPARK-13534

SparkQA · 2017-07-07T19:42:05Z

Test build #79339 has finished for PR 18459 at commit 930d624.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-07T19:46:39Z

ArrowTests are verified to be running after forcing this failure:

======================================================================
FAIL: test_filtered_frame (pyspark.sql.tests.ArrowTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py", line 2698, in test_filtered_frame
    self.assertTrue(False)
AssertionError: False is not true

shaneknapp · 2017-07-07T20:02:51Z

one quick comment... i see that these tests are using the default ivy cache of /home/jenkins/.ivy2/cache, which is dangerous as other builds and whatnot can pollute this w/jars and cause test failures.

what @JoshRosen and i have set up is a per-executor ivy cache for PRB builds in /home/sparkivy/per-executor-caches on each worker, w/a subdir for each jenkins worker's executor (1-12). the setup code for this can be seen in the spark PRB config:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/configure

if you ( @BryanCutler or @wesm ) think this will be a factor in these tests (which i feel they could be), hit me up via the contact info in the amplab jenkins wiki and i can set you up w/access to see the PRB config and get access to the workers if you need it.

https://amplab.cs.berkeley.edu/wiki/jenkins

SparkQA · 2017-07-07T22:34:00Z

Test build #79346 has finished for PR 18459 at commit 26dfc82.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-07T23:43:32Z

@shaneknapp this passed the ArrowTests, but looks like it failed while setting up conda for pip-tests because it couldn't acquire a lock

Exceeded max retries, giving upError:     LOCKERROR: It looks like conda is already doing something.
    The lock [u'/home/sparkivy/per-executor-caches/9/.conda/envs/.pkgs/.conda_lock-80213'] was found. Wait for it to finish before continuing.
    If you are sure that conda is not running, remove it and try again.
    You can also use: $ conda clean --lock

Is that the problem you were referring to above? cc @holdenk

shaneknapp · 2017-07-07T23:55:17Z

hmm. i have a feeling w/o looking at the test code that we're creating lots of envs, installing things, and then moving on to a new env... which is leading to a race condition w/lockfiles.

i just did a conda clean --lock on all of the workers, but i don't think that'll fix things long term.

another problem is that i'm heading out of town for the weekend, and won't be able to take a deeper look until sunday night at the earliest. :\

BryanCutler · 2017-07-08T00:01:29Z

Ok, no prob. I'll kick off another test, maybe that was just a fluke.

shaneknapp · 2017-07-08T00:05:03Z

i'd kick off a couple #tbh :)

…

On Fri, Jul 7, 2017 at 5:02 PM, Bryan Cutler ***@***.***> wrote: Ok, no prob. I'll kick off another test, maybe that was just a fluke. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABiDrEuCgMSE_UdDOlgOhxrNOJLdqlCjks5sLsb8gaJpZM4OITbk> .

BryanCutler · 2017-07-08T00:09:09Z

jenkins retest this please

holdenk · 2017-07-08T00:16:33Z

I haven't seen lock contention before setting up conda enviroments, if it happens again lets dig deeper but if its just a one off I wouldn't be too worried.

SparkQA · 2017-07-08T03:03:36Z

Test build #79355 has finished for PR 18459 at commit 26dfc82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2017-07-08T03:25:10Z

test this please

SparkQA · 2017-07-08T06:17:43Z

Test build #79363 has finished for PR 18459 at commit 26dfc82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-10T02:09:37Z

jenkins retest this please

SparkQA · 2017-07-10T04:59:54Z

Test build #79427 has finished for PR 18459 at commit 26dfc82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2017-07-10T15:30:24Z

test this please

shaneknapp · 2017-07-10T18:16:11Z

ok, i feel confident that this PR should be g2g:

i checked the workers that this PRs builds were on and they didn't leave any stray lockfiles
i checked ALL workers for stray lockfiles, and only found one from a month ago (which i cleaned up)
no other spark PRB builds are failing post-upgrade w/system-level issues

so: +1 from me for merging!

SparkQA · 2017-07-10T18:18:56Z

Test build #79468 has finished for PR 18459 at commit 26dfc82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-10T18:28:11Z

That's great to hear @shaneknapp , thanks for all your help getting this going!

@cloud-fan , @holdenk since the environment upgrades this has passed tests 4 time in a row, and I had verified earlier that ArrowTests were being run. The worker upgrades appear to be stable and not causing any failures. Do you think this is ok to be merged back in?

holdenk · 2017-07-10T19:08:52Z

I think we are indeed good to go. I'll merge this back in if no one objects before 3pm pacific today.

holdenk · 2017-07-10T22:22:34Z

Merged to master. Thanks everyone (especially @shaneknapp & @BryanCutler ) :) If anyone sees anything come up in the builds we will revert, but I think the multiple runes @shaneknapp's verification means everything is looking good :)

BryanCutler · 2017-07-10T22:26:02Z

Thanks @holdenk!

cloud-fan · 2017-07-11T03:44:24Z

great work!

BryanCutler and others added 30 commits February 22, 2017 16:35

Inital attempt to integrate Arrow for use in dataframe.toPandas. Conv…

f681d52

…ersion has basic data types and is working for small datasets with longs, doubles. Using Arrow 0.1.1-SNAPSHOT dependency.

Test suite prototyping for collectAsArrow

afd5739

Changed scope of arrow-tools dependency to test commented out lines to Integration.compareXX that are private to arrow closes #10

Test compiling against the newest arrow; Fix validity map; Add benchm…

a4b958e

…ark script Remove arrow-tools dependency changed zipWithIndex to while loop modified benchmark to work with Python2 timeit closes #13

Fix conversion for String type; refactor related functions to Arrow.s…

be508a5

…cala changed tests to use existing SQLTestData and removed unused files closes #14

Moved test data files to a sub-dir for arrow, merged dataType matchin…

5dbad22

…g and cleanup closes #15

added some python unit tests

5837b38

added more conversion tests short type should have a bit-width of 16 closes #17

Implement Arrow column writers

bdba357

Move column writers to Arrow.scala Add support for more types; Switch to arrow NullableVector closes #16

added bool type converstion test

d20437f

added test for byte data byte type should be signed closes #18

changed scope of some functions and minor cleanup

2e81a93

Add support for date/timestamp/binary; Add more numbers to benchmark.…

1ce4f2d

…py; Fix memory leaking bug closes #19

Cleanup of changes before updating the PR for review

ed1f0fa

remove unwanted changes removed benchmark.py from repository, will attach to PR instead

Changed RootAllocator param to Option in collectAsArrow

202650e

added more tests and cleanup closes #20

renamed to ArrowConverters

fbe3b7c

defined ArrowPayload and encapsulated Arrow classes in ArrowConverters addressed some minor comments in code review closes #21

Adjust to cleaned up pyarrow FileReader API, support multiple record …

f44e6d7

…batches in a stream closes #22

changed conversion to use Iterator[InternalRow] instead of Array

e0bf11b

arrow conversion done at partition by executors some cleanup of APIs, made tests complete for non-complex data types closes #23

Changed tests to use generated JSON data instead of files

3090a3e

updated Arrow artifacts to 0.2.0 release

54884ed

fixed python style checks

42af1d5

updated dependency manifest

9c8ea63

test format fix for python 2.6

b7c28ad

fixed docstrings and added list of pyarrow supported types

2851cd6

fixed memory leak of ArrowRecordBatch iterator getting consumed and b…

f8f24ab

…atches not closed properly

changed _collectAsArrow to private method

b6c752b

added netty to exclusion list for arrow dependency

cbab294

dict comprehensions not supported in python 2.6

44ca3ff

ensure payload batches are closed if any exception is thrown, some mi…

33b75b9

…nor cleanup

changed comment for readable seekable byte channel class

97742b8

Remove Date and Timestamp from supported types

b821077

closes #24

Added scaladocs to methods that did not have it

3d786a2

added check for pyarrow import error

cb4c510

Merge remote-tracking branch 'upstream/master' into toPandas_with_arr…

26dfc82

…ow-SPARK-13534

BryanCutler force-pushed the toPandas_with_arrow-SPARK-13534 branch from 930d624 to 26dfc82 Compare July 7, 2017 19:46

asfgit closed this in d03aebb Jul 10, 2017

dongjoon-hyun mentioned this pull request Aug 22, 2017

[SPARK-21750][SQL] Use Arrow 0.6.0 #18974

Closed

[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas #18459

[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas #18459

Uh oh!

Conversation

BryanCutler commented Jun 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

shaneknapp commented Jul 7, 2017

Uh oh!

BryanCutler commented Jul 7, 2017

Uh oh!

SparkQA commented Jul 7, 2017

Uh oh!

BryanCutler commented Jul 7, 2017

Uh oh!

shaneknapp commented Jul 7, 2017

Uh oh!

SparkQA commented Jul 7, 2017

Uh oh!

BryanCutler commented Jul 7, 2017

Uh oh!

shaneknapp commented Jul 7, 2017

Uh oh!

BryanCutler commented Jul 8, 2017

Uh oh!

shaneknapp commented Jul 8, 2017 via email

Uh oh!

BryanCutler commented Jul 8, 2017

Uh oh!

holdenk commented Jul 8, 2017

Uh oh!

SparkQA commented Jul 8, 2017

Uh oh!

shaneknapp commented Jul 8, 2017

Uh oh!

SparkQA commented Jul 8, 2017

Uh oh!

BryanCutler commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

shaneknapp commented Jul 10, 2017

Uh oh!

shaneknapp commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

BryanCutler commented Jul 10, 2017

Uh oh!

holdenk commented Jul 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Jul 10, 2017

Uh oh!

BryanCutler commented Jul 10, 2017

Uh oh!

cloud-fan commented Jul 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

holdenk commented Jul 10, 2017 •

edited

Loading