Skip to content

Conversation

@BryanCutler
Copy link
Member

What changes were proposed in this pull request?

Integrate Apache Arrow with Spark to increase performance of DataFrame.toPandas. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an UnsupportedOperation exception is thrown.

Additions to Spark include a Scala package private method Dataset.toArrowPayload that will convert data partitions in the executor JVM to ArrowPayloads as byte arrays so they can be easily served. A package private class/object ArrowConverters that provide data type mappings and conversion routines. In Python, a private method DataFrame._collectAsArrow is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in toPandas() to enable using Arrow (uses the old conversion by default).

How was this patch tested?

Added a new test suite ArrowConvertersSuite that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the toPandas method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

BryanCutler and others added 30 commits February 22, 2017 16:35
…ersion has basic data types and is working for small datasets with longs, doubles. Using Arrow 0.1.1-SNAPSHOT dependency.
Changed scope of arrow-tools dependency to test

commented out lines to Integration.compareXX that are private to arrow

closes #10
…ark script

Remove arrow-tools dependency

changed zipWithIndex to while loop

modified benchmark to work with Python2 timeit

closes #13
…cala

changed tests to use existing SQLTestData and removed unused files

closes #14
added more conversion tests

short type should have a bit-width of 16

closes #17
Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes #16
added test for byte data

byte type should be signed

closes #18
remove unwanted changes

removed benchmark.py from repository, will attach to PR instead
defined ArrowPayload and encapsulated Arrow classes in ArrowConverters

addressed some minor comments in code review

closes #21
arrow conversion done at partition by executors

some cleanup of APIs, made tests complete for non-complex data types

closes #23
@shaneknapp
Copy link
Contributor

test this please

@BryanCutler
Copy link
Member Author

Great, thanks @shaneknapp!

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79339 has finished for PR 18459 at commit 930d624.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

ArrowTests are verified to be running after forcing this failure:

======================================================================
FAIL: test_filtered_frame (pyspark.sql.tests.ArrowTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py", line 2698, in test_filtered_frame
    self.assertTrue(False)
AssertionError: False is not true

@BryanCutler BryanCutler force-pushed the toPandas_with_arrow-SPARK-13534 branch from 930d624 to 26dfc82 Compare July 7, 2017 19:46
@shaneknapp
Copy link
Contributor

one quick comment... i see that these tests are using the default ivy cache of /home/jenkins/.ivy2/cache, which is dangerous as other builds and whatnot can pollute this w/jars and cause test failures.

what @JoshRosen and i have set up is a per-executor ivy cache for PRB builds in /home/sparkivy/per-executor-caches on each worker, w/a subdir for each jenkins worker's executor (1-12). the setup code for this can be seen in the spark PRB config:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/configure

if you ( @BryanCutler or @wesm ) think this will be a factor in these tests (which i feel they could be), hit me up via the contact info in the amplab jenkins wiki and i can set you up w/access to see the PRB config and get access to the workers if you need it.

https://amplab.cs.berkeley.edu/wiki/jenkins

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79346 has finished for PR 18459 at commit 26dfc82.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@shaneknapp this passed the ArrowTests, but looks like it failed while setting up conda for pip-tests because it couldn't acquire a lock

Exceeded max retries, giving upError:     LOCKERROR: It looks like conda is already doing something.
    The lock [u'/home/sparkivy/per-executor-caches/9/.conda/envs/.pkgs/.conda_lock-80213'] was found. Wait for it to finish before continuing.
    If you are sure that conda is not running, remove it and try again.
    You can also use: $ conda clean --lock

Is that the problem you were referring to above? cc @holdenk

@shaneknapp
Copy link
Contributor

hmm. i have a feeling w/o looking at the test code that we're creating lots of envs, installing things, and then moving on to a new env... which is leading to a race condition w/lockfiles.

i just did a conda clean --lock on all of the workers, but i don't think that'll fix things long term.

another problem is that i'm heading out of town for the weekend, and won't be able to take a deeper look until sunday night at the earliest. :\

@BryanCutler
Copy link
Member Author

Ok, no prob. I'll kick off another test, maybe that was just a fluke.

@shaneknapp
Copy link
Contributor

shaneknapp commented Jul 8, 2017 via email

@BryanCutler
Copy link
Member Author

jenkins retest this please

@holdenk
Copy link
Contributor

holdenk commented Jul 8, 2017

I haven't seen lock contention before setting up conda enviroments, if it happens again lets dig deeper but if its just a one off I wouldn't be too worried.

@SparkQA
Copy link

SparkQA commented Jul 8, 2017

Test build #79355 has finished for PR 18459 at commit 26dfc82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2017

Test build #79363 has finished for PR 18459 at commit 26dfc82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79427 has finished for PR 18459 at commit 26dfc82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

test this please

@shaneknapp
Copy link
Contributor

ok, i feel confident that this PR should be g2g:

  • i checked the workers that this PRs builds were on and they didn't leave any stray lockfiles
  • i checked ALL workers for stray lockfiles, and only found one from a month ago (which i cleaned up)
  • no other spark PRB builds are failing post-upgrade w/system-level issues

so: +1 from me for merging!

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79468 has finished for PR 18459 at commit 26dfc82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

That's great to hear @shaneknapp , thanks for all your help getting this going!

@cloud-fan , @holdenk since the environment upgrades this has passed tests 4 time in a row, and I had verified earlier that ArrowTests were being run. The worker upgrades appear to be stable and not causing any failures. Do you think this is ok to be merged back in?

@holdenk
Copy link
Contributor

holdenk commented Jul 10, 2017

I think we are indeed good to go. I'll merge this back in if no one objects before 3pm pacific today.

@holdenk
Copy link
Contributor

holdenk commented Jul 10, 2017

Merged to master. Thanks everyone (especially @shaneknapp & @BryanCutler ) :) If anyone sees anything come up in the builds we will revert, but I think the multiple runes @shaneknapp's verification means everything is looking good :)

@asfgit asfgit closed this in d03aebb Jul 10, 2017
@BryanCutler
Copy link
Member Author

Thanks @holdenk!

@cloud-fan
Copy link
Contributor

great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants