Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Aug 1, 2018

What changes were proposed in this pull request?

Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:

  • Allow for adding BinaryType support ARROW-2141
  • Bug fix related to array serialization ARROW-1973
  • Python2 str will be made into an Arrow string instead of bytes ARROW-2101
  • Python bytearrays are supported in as input to pyarrow ARROW-2141
  • Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
  • Cleanup pyarrow type equality checks ARROW-2423
  • ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
  • Improved low level handling of messages for RecordBatch ARROW-2704

How was this patch tested?

existing tests

@holdensmagicalunicorn
Copy link

@BryanCutler, thanks! I am a bot who has found some folks who might be able to help with the review:@HyukjinKwon, @cloud-fan and @vanzin

@BryanCutler
Copy link
Member Author

BryanCutler commented Aug 1, 2018

This is a WIP, Arrow 0.10.0 hasn't been released yet but I wanted to get this up since the 2.4.0 code freeze is coming up and there might not be to much time in case some things need discussion or planned out for CI support.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93853 has finished for PR 21939 at commit abedec3.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<commons-crypto.version>1.0.0</commons-crypto.version>
<!--
If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py,
./python/run-tests.py and ./python/setup.py too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually I did the check instead in his previous attempt :-). Seems we don't need to change the minimum pyarrow version by this upgrade.

BTW, we can remove ./python/run-tests.py. here in the comment and in setup.py comment. This cleanup in ./python/run-tests.py. was done by https://github.com/apache/spark/pull/21107/files#diff-871d87c62d4e9228a47145a8894b6694L172

NullableMapVector mapVector = (NullableMapVector) vector;
accessor = new StructAccessor(mapVector);
} else if (vector instanceof StructVector) {
StructVector structVector = (StructVector) vector;
Copy link
Member

@HyukjinKwon HyukjinKwon Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two spaces here :-) "= (".

@cloud-fan
Copy link
Contributor

Allow for adding BinaryType support

Aren't we already have it? ArrowColumnVector uses VarBinaryVector to implement binary type.

@HyukjinKwon
Copy link
Member

I think he meant binary support in Python side (SPARK-23555 / ARROW-2141) I guess. Basically same question tho. Appreciate if that's clarified.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 1, 2018

@BryanCutler Thanks! What is the target release date of Apache Arrow 0.10.0?

@shaneknapp
Copy link
Contributor

i'm ready to pull the trigger on the update to arrow... i'd much prefer a pip dist, but would be ok w/a conda package. :)

@BryanCutler
Copy link
Member Author

@cloud-fan , we have BinaryType support in Java already, but it has not been added to Python due to an issue - the related jiras that @HyukjinKwon mentioned. So Arrow 0.10.0 has a bug fix that makes it possible to add it to Python.

@BryanCutler
Copy link
Member Author

@gatorsmile , there is a RC1 vote up now, so it should very soon

@BryanCutler
Copy link
Member Author

i'm ready to pull the trigger on the update to arrow... i'd much prefer a pip dist, but would be ok w/a conda package. :)

Thanks @shaneknapp ! So for those suggesting we keep the existing minimum pyarrow version of 0.8.0, does that mean we will need to add triple tests to support 0.9.0 and 0.10.0?

@gatorsmile
Copy link
Member

After the code freeze, the dependency changes are not allowed. Hopefully, we can make it before that.

@gatorsmile
Copy link
Member

To get this in, we might need to delay the code freeze. Can you reply the dev list email http://apache-spark-developers-list.1001551.n3.nabble.com/code-freeze-and-branch-cut-for-Apache-Spark-2-4-td24365.html ?

@gatorsmile
Copy link
Member

@shaneknapp
Copy link
Contributor

@BryanCutler we currently test spark against only one version of pyarrow (and against py27 and py34)..

setting things up to test against a matrix of python/pyarrow versions will have to take place after the code freeze/2.4 release and it won't necessarily be straightforward due to the mechanics of how the python test-running framework is set up. see: https://github.com/apache/spark/blob/master/python/run-tests.py

@BryanCutler
Copy link
Member Author

@shaneknapp I think we would be better off just upping the minimum version of arrow to 0.10.0 here since it's pretty involved to get a test matrix up and running and the project is still in a fair amount of flux until a stable 1.0 is released. What are your thoughts on this @HyukjinKwon @cloud-fan @holdenk ?

@cloud-fan
Copy link
Contributor

SGTM

@HyukjinKwon
Copy link
Member

Upping PyArrow to 0.10.0 sounds fine to me within the Jenkins environment considering 2.4.0 is being close.

We are already not testing all the combinations and at least I manually test other combinations locally. For the minimum PyArrow upgrade for Spark itself in the code base, wouldn't we better make it up after 2.4.0 release, and target it 3.0.0?

@cloud-fan
Copy link
Contributor

We are already not testing all the combinations and at least I manually test other combinations locally. For the minimum PyArrow upgrade for Spark itself in the code base, wouldn't we better make it up after 2.4.0 release, and target it 3.0.0?

Yea makes sense, since we don't have time to do the followups(like binary type support) within Spark 2.4.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94407 has finished for PR 21939 at commit 0652617.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

retest this please

@BryanCutler
Copy link
Member Author

Ok, so we will up the pyarrow version in Jenkins to 0.10.0, but keep the minimum version in python/setup.py as 0.8.0 for now, correct?

@BryanCutler
Copy link
Member Author

Nice! Thanks for getting that running @shaneknapp . So what are peoples thoughts about merging this for 2.4 since it passes normal tests with pyarrow 0.8.0 and we've also shown it passes with 0.10.0?

@HyukjinKwon
Copy link
Member

@BryanCutler, not a big deal but why don't we link Arrow JIRA for "Allow for adding BinaryType support" too?

@BryanCutler
Copy link
Member Author

@BryanCutler, not a big deal but why don't we link Arrow JIRA for "Allow for adding BinaryType support" too?

@HyukjinKwon I added the link, must have forgotten that from before

@cloud-fan
Copy link
Contributor

Sorry I didn't follow all the discussions here. @BryanCutler Do you mean we will upgrade arrow to 0.10.0 at java side, but leave the python side as it is? So people can still use PySpark with pyarrow 0.8.0 and python 3.4? If they go with arrow 0.10.0 and python 3.5, they can get these bug fixes?

@shaneknapp
Copy link
Contributor

@cloud-fan the 0.10.0 tests are passing both on the new, temporary testing box i set up (python3.5 + arrow 0.10.0), as well as the standard 3.4/0.8.0 deployments (both ubuntu and centos).

since the 0.10.0 tests are passing on the 0.8.0 workers, i think merging would be fine.

@BryanCutler
Copy link
Member Author

Do you mean we will upgrade arrow to 0.10.0 at java side, but leave the python side as it is? So people can still use PySpark with pyarrow 0.8.0 and python 3.4? If they go with arrow 0.10.0 and python 3.5, they can get these bug fixes?

@cloud-fan , that is correct. This PR updates the Java artifact and since we are not bumping up the minimum pyarrow version, there is nothing that needs to be done in the python code. It would be best to have pyarrow 0.10.0 in our CI, but @shaneknapp has run tests and I have also locally to be confident enough that there are no issues using Arrow Java 0.10.0 with pyarrow 0.8.0 to 0.10.0. There were also no binary compatibility breaking changes in the Arrow format made since 0.8.0.

@cloud-fan
Copy link
Contributor

great! looking forward to seeing arrow 0.10.0 come out.

@cloud-fan
Copy link
Contributor

LGTM

@BryanCutler
Copy link
Member Author

great! looking forward to seeing arrow 0.10.0 come out.

@cloud-fan Arrow has already been released and the artifacts are available - sorry I should have made a post to indicate that. This is ready to be merged (pending latest test) as long as we are ok with not bumping up the pyarrow version in our CI for the time being. Does that sound ok with you?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gatorsmile
Copy link
Member

@BryanCutler @shaneknapp Thanks for your work!

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94729 has finished for PR 21939 at commit ae8a6aa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94736 has finished for PR 21939 at commit ae8a6aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94748 has finished for PR 21939 at commit ae8a6aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in ed075e1 Aug 15, 2018
@BryanCutler
Copy link
Member Author

BryanCutler commented Aug 15, 2018

merged to master, thanks for your efforts on this @shaneknapp , and thanks @cloud-fan @HyukjinKwon @viirya and @dongjoon-hyun for reviewing!

@BryanCutler BryanCutler deleted the arrow-upgrade-010 branch August 15, 2018 00:21
@yhuai
Copy link
Contributor

yhuai commented Aug 17, 2018

@BryanCutler So, for this upgrade, even the JVM side dependency is 0.10, pyspark can work with any version between pyarrow 0.8 to 0.10 without problem? I am asking this because at java side, arrow 0.10 and 0.8 are not source compatible, I am wondering if we have any source compatibility concern at python side.

@shaneknapp
Copy link
Contributor

@yhuai
Copy link
Contributor

yhuai commented Aug 17, 2018

@shaneknapp what was the version of pyarrow in that build? 0.8 or 0.10?

@shaneknapp
Copy link
Contributor

0.8:

-bash-4.1$ hostname
amp-jenkins-worker-03
-bash-4.1$ export PATH=/home/anaconda/envs/py3k/bin/:$PATH
-bash-4.1$ python3.4 -c "import pyarrow; print(pyarrow.__version__)"
0.8.0

@yhuai
Copy link
Contributor

yhuai commented Aug 17, 2018

got it. Thank you!

curtishoward pushed a commit to curtishoward/spark that referenced this pull request Oct 17, 2018
Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

existing tests

Author: Bryan Cutler <[email protected]>

Closes apache#21939 from BryanCutler/arrow-upgrade-010.

(cherry picked from commit ed075e1)
curtishoward pushed a commit to curtishoward/spark that referenced this pull request Oct 17, 2018
Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

existing tests

Author: Bryan Cutler <[email protected]>

Closes apache#21939 from BryanCutler/arrow-upgrade-010.

(cherry picked from commit ed075e1)
curtishoward pushed a commit to twosigma/spark that referenced this pull request Oct 31, 2018
Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

existing tests

Author: Bryan Cutler <[email protected]>

Closes apache#21939 from BryanCutler/arrow-upgrade-010.

(cherry picked from commit ed075e1)
curtishoward pushed a commit to twosigma/spark that referenced this pull request Oct 31, 2018
Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

existing tests

Author: Bryan Cutler <[email protected]>

Closes apache#21939 from BryanCutler/arrow-upgrade-010.

(cherry picked from commit ed075e1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants