-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 #21939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@BryanCutler, thanks! I am a bot who has found some folks who might be able to help with the review:@HyukjinKwon, @cloud-fan and @vanzin |
|
This is a WIP, Arrow 0.10.0 hasn't been released yet but I wanted to get this up since the 2.4.0 code freeze is coming up and there might not be to much time in case some things need discussion or planned out for CI support. |
|
Test build #93853 has finished for PR 21939 at commit
|
| <commons-crypto.version>1.0.0</commons-crypto.version> | ||
| <!-- | ||
| If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py, | ||
| ./python/run-tests.py and ./python/setup.py too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually I did the check instead in his previous attempt :-). Seems we don't need to change the minimum pyarrow version by this upgrade.
BTW, we can remove ./python/run-tests.py. here in the comment and in setup.py comment. This cleanup in ./python/run-tests.py. was done by https://github.com/apache/spark/pull/21107/files#diff-871d87c62d4e9228a47145a8894b6694L172
| NullableMapVector mapVector = (NullableMapVector) vector; | ||
| accessor = new StructAccessor(mapVector); | ||
| } else if (vector instanceof StructVector) { | ||
| StructVector structVector = (StructVector) vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two spaces here :-) "= (".
Aren't we already have it? |
|
I think he meant binary support in Python side (SPARK-23555 / ARROW-2141) I guess. Basically same question tho. Appreciate if that's clarified. |
|
@BryanCutler Thanks! What is the target release date of Apache Arrow 0.10.0? |
|
i'm ready to pull the trigger on the update to arrow... i'd much prefer a pip dist, but would be ok w/a conda package. :) |
|
@cloud-fan , we have BinaryType support in Java already, but it has not been added to Python due to an issue - the related jiras that @HyukjinKwon mentioned. So Arrow 0.10.0 has a bug fix that makes it possible to add it to Python. |
|
@gatorsmile , there is a RC1 vote up now, so it should very soon |
Thanks @shaneknapp ! So for those suggesting we keep the existing minimum pyarrow version of 0.8.0, does that mean we will need to add triple tests to support 0.9.0 and 0.10.0? |
|
After the code freeze, the dependency changes are not allowed. Hopefully, we can make it before that. |
|
To get this in, we might need to delay the code freeze. Can you reply the dev list email http://apache-spark-developers-list.1001551.n3.nabble.com/code-freeze-and-branch-cut-for-Apache-Spark-2-4-td24365.html ? |
|
It sounds like the vote can pass soon. https://lists.apache.org/thread.html/9900da1540be5aafce27691fd40395bb53f465302db29979c154d99a@%3Cdev.arrow.apache.org%3E |
|
@BryanCutler we currently test spark against only one version of pyarrow (and against py27 and py34).. setting things up to test against a matrix of python/pyarrow versions will have to take place after the code freeze/2.4 release and it won't necessarily be straightforward due to the mechanics of how the python test-running framework is set up. see: https://github.com/apache/spark/blob/master/python/run-tests.py |
|
@shaneknapp I think we would be better off just upping the minimum version of arrow to 0.10.0 here since it's pretty involved to get a test matrix up and running and the project is still in a fair amount of flux until a stable 1.0 is released. What are your thoughts on this @HyukjinKwon @cloud-fan @holdenk ? |
|
SGTM |
|
Upping PyArrow to 0.10.0 sounds fine to me within the Jenkins environment considering 2.4.0 is being close. We are already not testing all the combinations and at least I manually test other combinations locally. For the minimum PyArrow upgrade for Spark itself in the code base, wouldn't we better make it up after 2.4.0 release, and target it 3.0.0? |
Yea makes sense, since we don't have time to do the followups(like binary type support) within Spark 2.4. |
|
Test build #94407 has finished for PR 21939 at commit
|
|
retest this please |
|
Ok, so we will up the pyarrow version in Jenkins to 0.10.0, but keep the minimum version in python/setup.py as 0.8.0 for now, correct? |
|
Nice! Thanks for getting that running @shaneknapp . So what are peoples thoughts about merging this for 2.4 since it passes normal tests with pyarrow 0.8.0 and we've also shown it passes with 0.10.0? |
|
@BryanCutler, not a big deal but why don't we link Arrow JIRA for "Allow for adding BinaryType support" too? |
@HyukjinKwon I added the link, must have forgotten that from before |
|
Sorry I didn't follow all the discussions here. @BryanCutler Do you mean we will upgrade arrow to 0.10.0 at java side, but leave the python side as it is? So people can still use PySpark with pyarrow 0.8.0 and python 3.4? If they go with arrow 0.10.0 and python 3.5, they can get these bug fixes? |
|
@cloud-fan the 0.10.0 tests are passing both on the new, temporary testing box i set up (python3.5 + arrow 0.10.0), as well as the standard 3.4/0.8.0 deployments (both ubuntu and centos). since the 0.10.0 tests are passing on the 0.8.0 workers, i think merging would be fine. |
@cloud-fan , that is correct. This PR updates the Java artifact and since we are not bumping up the minimum pyarrow version, there is nothing that needs to be done in the python code. It would be best to have pyarrow 0.10.0 in our CI, but @shaneknapp has run tests and I have also locally to be confident enough that there are no issues using Arrow Java 0.10.0 with pyarrow 0.8.0 to 0.10.0. There were also no binary compatibility breaking changes in the Arrow format made since 0.8.0. |
|
great! looking forward to seeing arrow 0.10.0 come out. |
|
LGTM |
@cloud-fan Arrow has already been released and the artifacts are available - sorry I should have made a post to indicate that. This is ready to be merged (pending latest test) as long as we are ok with not bumping up the pyarrow version in our CI for the time being. Does that sound ok with you? |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@BryanCutler @shaneknapp Thanks for your work! |
|
Test build #94729 has finished for PR 21939 at commit
|
|
retest this please |
|
Test build #94736 has finished for PR 21939 at commit
|
|
retest this please |
|
Retest this please. |
|
Test build #94748 has finished for PR 21939 at commit
|
|
merged to master, thanks for your efforts on this @shaneknapp , and thanks @cloud-fan @HyukjinKwon @viirya and @dongjoon-hyun for reviewing! |
|
@BryanCutler So, for this upgrade, even the JVM side dependency is 0.10, pyspark can work with any version between pyarrow 0.8 to 0.10 without problem? I am asking this because at java side, arrow 0.10 and 0.8 are not source compatible, I am wondering if we have any source compatibility concern at python side. |
|
@shaneknapp what was the version of pyarrow in that build? 0.8 or 0.10? |
|
0.8: |
|
got it. Thank you! |
Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 existing tests Author: Bryan Cutler <[email protected]> Closes apache#21939 from BryanCutler/arrow-upgrade-010. (cherry picked from commit ed075e1)
Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 existing tests Author: Bryan Cutler <[email protected]> Closes apache#21939 from BryanCutler/arrow-upgrade-010. (cherry picked from commit ed075e1)
Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 existing tests Author: Bryan Cutler <[email protected]> Closes apache#21939 from BryanCutler/arrow-upgrade-010. (cherry picked from commit ed075e1)
Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 existing tests Author: Bryan Cutler <[email protected]> Closes apache#21939 from BryanCutler/arrow-upgrade-010. (cherry picked from commit ed075e1)
What changes were proposed in this pull request?
Upgrade Apache Arrow to 0.10.0
Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
How was this patch tested?
existing tests