[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 #21939

BryanCutler · 2018-08-01T00:47:52Z

What changes were proposed in this pull request?

Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:

Allow for adding BinaryType support ARROW-2141
Bug fix related to array serialization ARROW-1973
Python2 str will be made into an Arrow string instead of bytes ARROW-2101
Python bytearrays are supported in as input to pyarrow ARROW-2141
Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
Cleanup pyarrow type equality checks ARROW-2423
ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
Improved low level handling of messages for RecordBatch ARROW-2704

How was this patch tested?

existing tests

holdensmagicalunicorn · 2018-08-01T00:47:55Z

@BryanCutler, thanks! I am a bot who has found some folks who might be able to help with the review:@HyukjinKwon, @cloud-fan and @vanzin

BryanCutler · 2018-08-01T00:51:03Z

This is a WIP, Arrow 0.10.0 hasn't been released yet but I wanted to get this up since the 2.4.0 code freeze is coming up and there might not be to much time in case some things need discussion or planned out for CI support.

SparkQA · 2018-08-01T00:57:28Z

Test build #93853 has finished for PR 21939 at commit abedec3.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-01T01:22:33Z

pom.xml

    <commons-crypto.version>1.0.0</commons-crypto.version>
    <!--
    If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
    ./python/run-tests.py and ./python/setup.py too.


I think we should do this check?

Ah, actually I did the check instead in his previous attempt :-). Seems we don't need to change the minimum pyarrow version by this upgrade.

BTW, we can remove ./python/run-tests.py. here in the comment and in setup.py comment. This cleanup in ./python/run-tests.py. was done by https://github.com/apache/spark/pull/21107/files#diff-871d87c62d4e9228a47145a8894b6694L172

HyukjinKwon · 2018-08-01T01:31:10Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java

-      NullableMapVector mapVector = (NullableMapVector) vector;
-      accessor = new StructAccessor(mapVector);
+    } else if (vector instanceof StructVector) {
+      StructVector structVector =  (StructVector) vector;


two spaces here :-) "= (".

cloud-fan · 2018-08-01T02:43:46Z

Allow for adding BinaryType support

Aren't we already have it? ArrowColumnVector uses VarBinaryVector to implement binary type.

HyukjinKwon · 2018-08-01T02:56:01Z

I think he meant binary support in Python side (SPARK-23555 / ARROW-2141) I guess. Basically same question tho. Appreciate if that's clarified.

gatorsmile · 2018-08-01T03:01:07Z

@BryanCutler Thanks! What is the target release date of Apache Arrow 0.10.0?

shaneknapp · 2018-08-01T19:24:16Z

i'm ready to pull the trigger on the update to arrow... i'd much prefer a pip dist, but would be ok w/a conda package. :)

BryanCutler · 2018-08-06T05:52:27Z

@cloud-fan , we have BinaryType support in Java already, but it has not been added to Python due to an issue - the related jiras that @HyukjinKwon mentioned. So Arrow 0.10.0 has a bug fix that makes it possible to add it to Python.

BryanCutler · 2018-08-06T05:53:34Z

@gatorsmile , there is a RC1 vote up now, so it should very soon

BryanCutler · 2018-08-06T05:55:34Z

i'm ready to pull the trigger on the update to arrow... i'd much prefer a pip dist, but would be ok w/a conda package. :)

Thanks @shaneknapp ! So for those suggesting we keep the existing minimum pyarrow version of 0.8.0, does that mean we will need to add triple tests to support 0.9.0 and 0.10.0?

gatorsmile · 2018-08-06T11:07:37Z

After the code freeze, the dependency changes are not allowed. Hopefully, we can make it before that.

gatorsmile · 2018-08-06T11:10:32Z

To get this in, we might need to delay the code freeze. Can you reply the dev list email http://apache-spark-developers-list.1001551.n3.nabble.com/code-freeze-and-branch-cut-for-Apache-Spark-2-4-td24365.html ?

gatorsmile · 2018-08-06T11:12:41Z

It sounds like the vote can pass soon. https://lists.apache.org/thread.html/9900da1540be5aafce27691fd40395bb53f465302db29979c154d99a@%3Cdev.arrow.apache.org%3E

shaneknapp · 2018-08-06T16:06:45Z

@BryanCutler we currently test spark against only one version of pyarrow (and against py27 and py34)..

setting things up to test against a matrix of python/pyarrow versions will have to take place after the code freeze/2.4 release and it won't necessarily be straightforward due to the mechanics of how the python test-running framework is set up. see: https://github.com/apache/spark/blob/master/python/run-tests.py

BryanCutler · 2018-08-08T05:37:26Z

@shaneknapp I think we would be better off just upping the minimum version of arrow to 0.10.0 here since it's pretty involved to get a test matrix up and running and the project is still in a fair amount of flux until a stable 1.0 is released. What are your thoughts on this @HyukjinKwon @cloud-fan @holdenk ?

cloud-fan · 2018-08-08T06:01:34Z

SGTM

HyukjinKwon · 2018-08-08T06:23:40Z

Upping PyArrow to 0.10.0 sounds fine to me within the Jenkins environment considering 2.4.0 is being close.

We are already not testing all the combinations and at least I manually test other combinations locally. For the minimum PyArrow upgrade for Spark itself in the code base, wouldn't we better make it up after 2.4.0 release, and target it 3.0.0?

cloud-fan · 2018-08-08T06:39:37Z

We are already not testing all the combinations and at least I manually test other combinations locally. For the minimum PyArrow upgrade for Spark itself in the code base, wouldn't we better make it up after 2.4.0 release, and target it 3.0.0?

Yea makes sense, since we don't have time to do the followups(like binary type support) within Spark 2.4.

SparkQA · 2018-08-08T07:05:01Z

Test build #94407 has finished for PR 21939 at commit 0652617.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-08-08T14:49:16Z

retest this please

BryanCutler · 2018-08-08T15:03:48Z

Ok, so we will up the pyarrow version in Jenkins to 0.10.0, but keep the minimum version in python/setup.py as 0.8.0 for now, correct?

BryanCutler · 2018-08-12T16:37:28Z

Nice! Thanks for getting that running @shaneknapp . So what are peoples thoughts about merging this for 2.4 since it passes normal tests with pyarrow 0.8.0 and we've also shown it passes with 0.10.0?

HyukjinKwon · 2018-08-13T01:25:21Z

@BryanCutler, not a big deal but why don't we link Arrow JIRA for "Allow for adding BinaryType support" too?

BryanCutler · 2018-08-13T16:34:08Z

@BryanCutler, not a big deal but why don't we link Arrow JIRA for "Allow for adding BinaryType support" too?

@HyukjinKwon I added the link, must have forgotten that from before

cloud-fan · 2018-08-13T16:55:25Z

Sorry I didn't follow all the discussions here. @BryanCutler Do you mean we will upgrade arrow to 0.10.0 at java side, but leave the python side as it is? So people can still use PySpark with pyarrow 0.8.0 and python 3.4? If they go with arrow 0.10.0 and python 3.5, they can get these bug fixes?

shaneknapp · 2018-08-13T17:04:16Z

@cloud-fan the 0.10.0 tests are passing both on the new, temporary testing box i set up (python3.5 + arrow 0.10.0), as well as the standard 3.4/0.8.0 deployments (both ubuntu and centos).

since the 0.10.0 tests are passing on the 0.8.0 workers, i think merging would be fine.

BryanCutler · 2018-08-13T20:01:00Z

Do you mean we will upgrade arrow to 0.10.0 at java side, but leave the python side as it is? So people can still use PySpark with pyarrow 0.8.0 and python 3.4? If they go with arrow 0.10.0 and python 3.5, they can get these bug fixes?

@cloud-fan , that is correct. This PR updates the Java artifact and since we are not bumping up the minimum pyarrow version, there is nothing that needs to be done in the python code. It would be best to have pyarrow 0.10.0 in our CI, but @shaneknapp has run tests and I have also locally to be confident enough that there are no issues using Arrow Java 0.10.0 with pyarrow 0.8.0 to 0.10.0. There were also no binary compatibility breaking changes in the Arrow format made since 0.8.0.

cloud-fan · 2018-08-14T04:30:46Z

great! looking forward to seeing arrow 0.10.0 come out.

cloud-fan · 2018-08-14T06:09:26Z

LGTM

BryanCutler · 2018-08-14T06:09:36Z

great! looking forward to seeing arrow 0.10.0 come out.

@cloud-fan Arrow has already been released and the artifacts are available - sorry I should have made a post to indicate that. This is ready to be merged (pending latest test) as long as we are ok with not bumping up the pyarrow version in our CI for the time being. Does that sound ok with you?

HyukjinKwon

LGTM

gatorsmile · 2018-08-14T06:19:16Z

@BryanCutler @shaneknapp Thanks for your work!

SparkQA · 2018-08-14T07:05:02Z

Test build #94729 has finished for PR 21939 at commit ae8a6aa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-14T07:29:26Z

retest this please

SparkQA · 2018-08-14T11:06:24Z

Test build #94736 has finished for PR 21939 at commit ae8a6aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-14T12:01:50Z

retest this please

dongjoon-hyun · 2018-08-14T15:12:22Z

Retest this please.

SparkQA · 2018-08-14T19:52:06Z

Test build #94748 has finished for PR 21939 at commit ae8a6aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-08-15T00:20:51Z

merged to master, thanks for your efforts on this @shaneknapp , and thanks @cloud-fan @HyukjinKwon @viirya and @dongjoon-hyun for reviewing!

yhuai · 2018-08-17T18:15:54Z

@BryanCutler So, for this upgrade, even the JVM side dependency is 0.10, pyspark can work with any version between pyarrow 0.8 to 0.10 without problem? I am asking this because at java side, arrow 0.10 and 0.8 are not source compatible, I am wondering if we have any source compatibility concern at python side.

shaneknapp · 2018-08-17T18:19:17Z

@yhuai see: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94748/

yhuai · 2018-08-17T18:29:09Z

@shaneknapp what was the version of pyarrow in that build? 0.8 or 0.10?

shaneknapp · 2018-08-17T18:44:01Z

0.8:

-bash-4.1$ hostname
amp-jenkins-worker-03
-bash-4.1$ export PATH=/home/anaconda/envs/py3k/bin/:$PATH
-bash-4.1$ python3.4 -c "import pyarrow; print(pyarrow.__version__)"
0.8.0

yhuai · 2018-08-17T19:26:06Z

got it. Thank you!

Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 existing tests Author: Bryan Cutler <[email protected]> Closes apache#21939 from BryanCutler/arrow-upgrade-010. (cherry picked from commit ed075e1)

BryanCutler added 6 commits July 31, 2018 17:27

made required code changes for upgrade

5e4061e

remove unused import

8b63e6c

added some TODOs

95a1bf9

change to using 0.10 snapshot

739a12e

replaced manifests

d1f7355

removed version snapshot

abedec3

viirya reviewed Aug 1, 2018

View reviewed changes

HyukjinKwon reviewed Aug 1, 2018

View reviewed changes

remove space typo

0652617

cleanup ArrowFieldWriter.reset()

ae8a6aa

HyukjinKwon approved these changes Aug 14, 2018

View reviewed changes

asfgit closed this in ed075e1 Aug 15, 2018

BryanCutler deleted the arrow-upgrade-010 branch August 15, 2018 00:21

[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 #21939

[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 #21939

Uh oh!

Conversation

BryanCutler commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Aug 1, 2018

Uh oh!

BryanCutler commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

viirya Aug 1, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 1, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

HyukjinKwon commented Aug 1, 2018

Uh oh!

gatorsmile commented Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaneknapp commented Aug 1, 2018

Uh oh!

BryanCutler commented Aug 6, 2018

Uh oh!

BryanCutler commented Aug 6, 2018

Uh oh!

BryanCutler commented Aug 6, 2018

Uh oh!

gatorsmile commented Aug 6, 2018

Uh oh!

gatorsmile commented Aug 6, 2018

Uh oh!

gatorsmile commented Aug 6, 2018

Uh oh!

shaneknapp commented Aug 6, 2018

Uh oh!

BryanCutler commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 8, 2018

Uh oh!

HyukjinKwon commented Aug 8, 2018

Uh oh!

cloud-fan commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

BryanCutler commented Aug 8, 2018

Uh oh!

BryanCutler commented Aug 8, 2018

Uh oh!

BryanCutler commented Aug 12, 2018

Uh oh!

HyukjinKwon commented Aug 13, 2018

Uh oh!

BryanCutler commented Aug 13, 2018

Uh oh!

cloud-fan commented Aug 13, 2018

Uh oh!

shaneknapp commented Aug 13, 2018

Uh oh!

BryanCutler commented Aug 13, 2018

Uh oh!

cloud-fan commented Aug 14, 2018

Uh oh!

cloud-fan commented Aug 14, 2018

Uh oh!

BryanCutler commented Aug 14, 2018

BryanCutler commented Aug 1, 2018 •

edited

Loading

BryanCutler commented Aug 1, 2018 •

edited

Loading

HyukjinKwon Aug 1, 2018 •

edited

Loading

gatorsmile commented Aug 1, 2018 •

edited

Loading

BryanCutler commented Aug 15, 2018 •

edited

Loading

yhuai commented Aug 17, 2018 •

edited

Loading