-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas #18459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas #18459
Conversation
…ersion has basic data types and is working for small datasets with longs, doubles. Using Arrow 0.1.1-SNAPSHOT dependency.
Changed scope of arrow-tools dependency to test commented out lines to Integration.compareXX that are private to arrow closes #10
…ark script Remove arrow-tools dependency changed zipWithIndex to while loop modified benchmark to work with Python2 timeit closes #13
…cala changed tests to use existing SQLTestData and removed unused files closes #14
…g and cleanup closes #15
added more conversion tests short type should have a bit-width of 16 closes #17
Move column writers to Arrow.scala Add support for more types; Switch to arrow NullableVector closes #16
added test for byte data byte type should be signed closes #18
…py; Fix memory leaking bug closes #19
remove unwanted changes removed benchmark.py from repository, will attach to PR instead
added more tests and cleanup closes #20
defined ArrowPayload and encapsulated Arrow classes in ArrowConverters addressed some minor comments in code review closes #21
…batches in a stream closes #22
arrow conversion done at partition by executors some cleanup of APIs, made tests complete for non-complex data types closes #23
…atches not closed properly
|
test this please |
|
Great, thanks @shaneknapp! |
|
Test build #79339 has finished for PR 18459 at commit
|
|
ArrowTests are verified to be running after forcing this failure: |
930d624 to
26dfc82
Compare
|
one quick comment... i see that these tests are using the default ivy cache of what @JoshRosen and i have set up is a per-executor ivy cache for PRB builds in if you ( @BryanCutler or @wesm ) think this will be a factor in these tests (which i feel they could be), hit me up via the contact info in the amplab jenkins wiki and i can set you up w/access to see the PRB config and get access to the workers if you need it. |
|
Test build #79346 has finished for PR 18459 at commit
|
|
@shaneknapp this passed the ArrowTests, but looks like it failed while setting up conda for pip-tests because it couldn't acquire a lock Is that the problem you were referring to above? cc @holdenk |
|
hmm. i have a feeling w/o looking at the test code that we're creating lots of envs, installing things, and then moving on to a new env... which is leading to a race condition w/lockfiles. i just did a another problem is that i'm heading out of town for the weekend, and won't be able to take a deeper look until sunday night at the earliest. :\ |
|
Ok, no prob. I'll kick off another test, maybe that was just a fluke. |
|
i'd kick off a couple #tbh :)
…On Fri, Jul 7, 2017 at 5:02 PM, Bryan Cutler ***@***.***> wrote:
Ok, no prob. I'll kick off another test, maybe that was just a fluke.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18459 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABiDrEuCgMSE_UdDOlgOhxrNOJLdqlCjks5sLsb8gaJpZM4OITbk>
.
|
|
jenkins retest this please |
|
I haven't seen lock contention before setting up conda enviroments, if it happens again lets dig deeper but if its just a one off I wouldn't be too worried. |
|
Test build #79355 has finished for PR 18459 at commit
|
|
test this please |
|
Test build #79363 has finished for PR 18459 at commit
|
|
jenkins retest this please |
|
Test build #79427 has finished for PR 18459 at commit
|
|
test this please |
|
ok, i feel confident that this PR should be g2g:
so: +1 from me for merging! |
|
Test build #79468 has finished for PR 18459 at commit
|
|
That's great to hear @shaneknapp , thanks for all your help getting this going! @cloud-fan , @holdenk since the environment upgrades this has passed tests 4 time in a row, and I had verified earlier that ArrowTests were being run. The worker upgrades appear to be stable and not causing any failures. Do you think this is ok to be merged back in? |
|
I think we are indeed good to go. I'll merge this back in if no one objects before 3pm pacific today. |
|
Merged to master. Thanks everyone (especially @shaneknapp & @BryanCutler ) :) If anyone sees anything come up in the builds we will revert, but I think the multiple runes @shaneknapp's verification means everything is looking good :) |
|
Thanks @holdenk! |
|
great work! |
What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of
DataFrame.toPandas. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise anUnsupportedOperationexception is thrown.Additions to Spark include a Scala package private method
Dataset.toArrowPayloadthat will convert data partitions in the executor JVM toArrowPayloads as byte arrays so they can be easily served. A package private class/objectArrowConvertersthat provide data type mappings and conversion routines. In Python, a private methodDataFrame._collectAsArrowis added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used intoPandas()to enable using Arrow (uses the old conversion by default).How was this patch tested?
Added a new test suite
ArrowConvertersSuitethat will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.Added PySpark tests to verify the
toPandasmethod is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.