-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1149: [Plasma] Create Cython client library for Plasma #797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pcmoritz I'm sorry about the delay, I will have a closer look most likely tomorrow |
|
Thanks, looking forward to it! The central question is how much mixing between the plasma client and pyarrow we want and also how the directory structure should look like. I kept them pretty separate for now. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! As far as the packaging, it's a little more work to add a FindPlasma.cmake file to the pyarrow build system, but for packaging simplicity (for pip and conda) it would be easiest to maintain this as pyarrow/plasma.pyx and have a single build system and package artifact
I left minor comments about proper usage of nogil (you have to have the with nogil: block for it to release the GIL; the nogil annotation in the cdef extern blocks tells Cython that the functions are safe to use in a nogil block, but the actual release must be explicit)
The other thing is the docstrings so that things will render properly when we add these APIs to the Arrow Python documentation (http://arrow.apache.org/docs/python/api.html -- it's in need of some love as you can see)
cpp/src/plasma/plasma.h
Outdated
| #ifdef PyMODINIT_FUNC | ||
| #undef PyMODINIT_FUNC | ||
| #endif | ||
| #define PyMODINIT_FUNC extern "C" __attribute__((visibility ("default"))) PyObject* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very odd, will have to have a closer look at what is going on; I've never had any issues like this
cpp/src/plasma/plasma/plasma.pyx
Outdated
|
|
||
| cdef extern from "plasma/common.h" nogil: | ||
|
|
||
| cdef cppclass CUniqueID" plasma::UniqueID": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed indentation -- use 4 spaces everywhere?
cpp/src/plasma/plasma/plasma.pyx
Outdated
| self.data = CUniqueID.from_binary(object_id) | ||
|
|
||
| def __richcmp__(ObjectID self, ObjectID object_id, operation): | ||
| assert operation == 2, "only equality implemented so far" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise a ValueError here instead
cpp/src/plasma/plasma/plasma.pyx
Outdated
| for object_id in object_ids: | ||
| ids.push_back(object_id.data) | ||
| result[0].resize(ids.size()) | ||
| check_status(self.client.get().Get(ids.data(), ids.size(), timeout_ms, result[0].data())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add with nogil: here to release the GIL
cpp/src/plasma/plasma/plasma.pyx
Outdated
| release_delay (int): The maximum number of objects that the client will | ||
| keep and delay releasing (for caching reasons). | ||
| """ | ||
| check_status(self.client.get().Connect(store_socket_name.encode(), manager_socket_name.encode(), release_delay)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil: here (may require some unboxing of the Python objects outside the with nogil block)
cpp/src/plasma/plasma/plasma.pyx
Outdated
|
|
||
| def release(self, ObjectID object_id): | ||
| """Notify Plasma that the object is no longer needed.""" | ||
| check_status(self.client.get().Release(object_id.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil
cpp/src/plasma/plasma/plasma.pyx
Outdated
| object_id (str): A string used to identify an object. | ||
| """ | ||
| cdef c_bool is_contained | ||
| check_status(self.client.get().Contains(object_id.data, &is_contained)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil
cpp/src/plasma/plasma/plasma.pyx
Outdated
| store, the string will have length zero. | ||
| """ | ||
| cdef c_vector[uint8_t] digest = c_vector[uint8_t](kDigestSize) | ||
| check_status(self.client.get().Hash(object_id.data, digest.data())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil
cpp/src/plasma/plasma/plasma.pyx
Outdated
| num_bytes (int): The number of bytes to attempt to recover. | ||
| """ | ||
| cdef int64_t num_bytes_evicted = -1 | ||
| check_status(self.client.get().Evict(num_bytes, num_bytes_evicted)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nogil, and the other client calls below
cpp/src/plasma/test/test.py
Outdated
|
|
||
| USE_VALGRIND = False | ||
|
|
||
| def random_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent with 4 spaces in this file
|
Thanks a lot for the review. The packaging simplicity is a good argument, let's make it part of pyarrow. I'm traveling right now but hope to get to it this weekend. |
ci/travis_script_plasma.sh
Outdated
| } | ||
|
|
||
| # run tests for python 2.7 and 3.6 | ||
| # python_version_tests 2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably comment this in?
| read_multiple_files(mixed_paths) | ||
|
|
||
|
|
||
| @parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need an extra newline before @parquet
abc5600 to
48b6f90
Compare
|
@pcmoritz please ping me when I should review or you run into problems. I'm ignoring this pull request for now as there's a lot of activity in here with commits. |
|
@xhochy I'm wrapping it up right now, the last thing I'm really blocked on is the error in https://travis-ci.org/pcmoritz/arrow/jobs/253597521 in the manylinux build, see the error message If you have an idea please let me know! I did add -DARROW_PLASMA=ON to Other than that there are some doc fixes I need to do, I'll let you know when it is ready. Thanks for your help :) |
.travis.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about not also testing this on OSX, but I agree it might be too burdensome on Travis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added tests for mac os now, if it becomes too burdensome, we can switch it off
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two newlines before and after this class
python/pyarrow/plasma.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Ray, this used to look more like
if self.plasma_client.alive:
libplasma.release(self.plasma_client.conn, self.plasma_id)I think the motivation for this was that Ray (on a single machine) by default kills the plasma store manually at the end of a script. After this, a PlasmaBuffer can still go out of scope and try to send a release message to the store, and the check_status will raise an exception leaving some ugly error or warning messages.
Or is this avoided in some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done in the C++ client; the file descriptor is set to -1 on disconnect, our old shutdown is disconnect and release checks for the file descriptor being -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that looks good.
6f8ed1c to
0c65b4c
Compare
cpp/src/plasma/CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also move these cmake_modules to the top-level so that they are available to any arrow-cpp component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's where they belong. Shall we do this as a separate PR to make it a little more visible? It might break other people's scripts.
python/manylinux1/build_arrow.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a check in line 67 below that plasma can be imported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for the pointer!
cpp/src/plasma/test/test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify valgrind and plasma options as part of the pytest call, see the --parquet option in pyarrow/tests/conftest.py
8f7b360 to
dc3fecb
Compare
de14fa9 to
1609f48
Compare
|
The build is now green and it's ready to merge. One possible reservation is that travis times are quite long now, any thoughts on that? |
|
I will give this a last look and merge later today assuming no issues now that the 0.5.0 release vote has passed. I will also look at the build times to see if there's anything that needs to be done |
|
Great, thanks! Next steps from our side is a little bit more high level documentation (I'll create a PR once this is merged) and then the blog post. |
|
Sounds good. We can probably publish the blog post right after the IP clearance process is concluded |
|
There's a significant amount of redundant work happening in the build. We should probably only be testing the built-in third-party toolchain (the ExternalProjects) in one entry in the build matrix and using toolchain libraries in all the others. This would probably be simplest to tackle in a follow up patch. I can take a look |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments -- sorry to delay getting this in, but I think we should remove the MutableBuffer classes and instead have a mutable property that forwards buffer->is_mutable(). I can look into speeding up the builds in a separate patch
python/doc/source/api.rst
Outdated
| ObjectID | ||
| PlasmaClient | ||
| PlasmaBuffer | ||
| MutablePlasmaBufferMutablePlasmaBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. Can address when working on documentation generally
python/pyarrow/includes/libarrow.pxd
Outdated
| int64_t GetExtentBytesWritten() | ||
|
|
||
| cdef cppclass CFixedSizeBufferWrite" arrow::io::FixedSizeBufferWriter"(WriteableFile): | ||
| CFixedSizeBufferWrite(const shared_ptr[CBuffer]& buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo
python/pyarrow/io.pxi
Outdated
| return self.size | ||
|
|
||
|
|
||
| cdef class MutableBuffer(Buffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since buffers have an is_mutable() function, is this needed (versus adding a mutable property to Buffer)?
python/pyarrow/plasma.pyx
Outdated
| self.wr_file.reset(new CFixedSizeBufferWrite(buffer.buffer)) | ||
| self.is_readable = 0 | ||
| self.is_writeable = 1 | ||
| self.is_open = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to move this to io.pxi
python/pyarrow/plasma.pyx
Outdated
| self.client.release(self.object_id) | ||
|
|
||
|
|
||
| cdef class MutablePlasmaBuffer(MutableBuffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if this can be collapsed per the discussion above with MutableBuffer
python/pyarrow/tests/test_plasma.py
Outdated
| use_valgrind=os.getenv("PLASMA_VALGRIND") == "1") | ||
| # Connect to Plasma. | ||
| self.plasma_client = plasma.PlasmaClient() | ||
| self.plasma_client.connect(plasma_store_name, "", 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a top level function pyarrow.plasma.connect that returns a fully-baked client? This API is a slight rough edge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right, that seems better!
| self.with_plasma = strtobool( | ||
| os.environ.get('PYARROW_WITH_PLASMA', '0')) | ||
| if self.with_plasma: | ||
| if self.with_plasma and "plasma" not in self.CYTHON_MODULE_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function initialize_options seems to have been called twice and plasma was added twice to CYTHON_MODULE_NAMES, which broke the "python setup.py develop". If there is a better fix let me know!
| def _failure_permitted(self, name): | ||
| if name == '_parquet' and not self.with_parquet: | ||
| return True | ||
| if name == 'plasma' and not self.with_plasma: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this for good measure, not sure where/if it is necessary
| c_string manager_socket_name | ||
|
|
||
| def __cinit__(self): | ||
| def __cinit__(self, store_socket_name, manager_socket_name, int release_delay): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As one rough edge in Cython, if you call the ctor with the wrong parameters you will generally segfault, which is why factory methods are usually safer. We can tackle this in a follow up patch
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @pcmoritz!!
|
Thanks for merging! If you could look into the ExternalProjects build times that'd be great (I'm not sure what is involved there). I can see if we can make the actual plasma tests a little faster (and can also change the PlasmaClient -> connect API). |
This PR introduces a Cython API to Plasma, a FindPlasma.cmake to make it easier to integrate Plasma with CMake projects and sets up packaging with pyarrow.