Skip to content

Conversation

@pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Jun 30, 2017

This PR introduces a Cython API to Plasma, a FindPlasma.cmake to make it easier to integrate Plasma with CMake projects and sets up packaging with pyarrow.

@wesm
Copy link
Member

wesm commented Jul 1, 2017

@pcmoritz I'm sorry about the delay, I will have a closer look most likely tomorrow

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jul 1, 2017

Thanks, looking forward to it! The central question is how much mixing between the plasma client and pyarrow we want and also how the directory structure should look like. I kept them pretty separate for now.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! As far as the packaging, it's a little more work to add a FindPlasma.cmake file to the pyarrow build system, but for packaging simplicity (for pip and conda) it would be easiest to maintain this as pyarrow/plasma.pyx and have a single build system and package artifact

I left minor comments about proper usage of nogil (you have to have the with nogil: block for it to release the GIL; the nogil annotation in the cdef extern blocks tells Cython that the functions are safe to use in a nogil block, but the actual release must be explicit)

The other thing is the docstrings so that things will render properly when we add these APIs to the Arrow Python documentation (http://arrow.apache.org/docs/python/api.html -- it's in need of some love as you can see)

#ifdef PyMODINIT_FUNC
#undef PyMODINIT_FUNC
#endif
#define PyMODINIT_FUNC extern "C" __attribute__((visibility ("default"))) PyObject*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very odd, will have to have a closer look at what is going on; I've never had any issues like this


cdef extern from "plasma/common.h" nogil:

cdef cppclass CUniqueID" plasma::UniqueID":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed indentation -- use 4 spaces everywhere?

self.data = CUniqueID.from_binary(object_id)

def __richcmp__(ObjectID self, ObjectID object_id, operation):
assert operation == 2, "only equality implemented so far"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise a ValueError here instead

for object_id in object_ids:
ids.push_back(object_id.data)
result[0].resize(ids.size())
check_status(self.client.get().Get(ids.data(), ids.size(), timeout_ms, result[0].data()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add with nogil: here to release the GIL

release_delay (int): The maximum number of objects that the client will
keep and delay releasing (for caching reasons).
"""
check_status(self.client.get().Connect(store_socket_name.encode(), manager_socket_name.encode(), release_delay))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil: here (may require some unboxing of the Python objects outside the with nogil block)


def release(self, ObjectID object_id):
"""Notify Plasma that the object is no longer needed."""
check_status(self.client.get().Release(object_id.data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

object_id (str): A string used to identify an object.
"""
cdef c_bool is_contained
check_status(self.client.get().Contains(object_id.data, &is_contained))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

store, the string will have length zero.
"""
cdef c_vector[uint8_t] digest = c_vector[uint8_t](kDigestSize)
check_status(self.client.get().Hash(object_id.data, digest.data()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

num_bytes (int): The number of bytes to attempt to recover.
"""
cdef int64_t num_bytes_evicted = -1
check_status(self.client.get().Evict(num_bytes, num_bytes_evicted))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil, and the other client calls below


USE_VALGRIND = False

def random_name():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent with 4 spaces in this file

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jul 4, 2017

Thanks a lot for the review. The packaging simplicity is a good argument, let's make it part of pyarrow. I'm traveling right now but hope to get to it this weekend.

}

# run tests for python 2.7 and 3.6
# python_version_tests 2.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably comment this in?

read_multiple_files(mixed_paths)


@parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need an extra newline before @parquet

@pcmoritz pcmoritz force-pushed the plasma-cython branch 7 times, most recently from abc5600 to 48b6f90 Compare July 14, 2017 12:36
@xhochy
Copy link
Member

xhochy commented Jul 14, 2017

@pcmoritz please ping me when I should review or you run into problems. I'm ignoring this pull request for now as there's a lot of activity in here with commits.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jul 14, 2017

@xhochy I'm wrapping it up right now, the last thing I'm really blocked on is the error in https://travis-ci.org/pcmoritz/arrow/jobs/253597521 in the manylinux build, see the error message
"ImportError: libplasma.so.0: cannot open shared object file: No such file or directory"

If you have an idea please let me know! I did add -DARROW_PLASMA=ON to python/manylinux1/build_arrow.sh and FindPlasma.cmake seems to find libplasma.so at least.

Other than that there are some doc fixes I need to do, I'll let you know when it is ready.

Thanks for your help :)

.travis.yml Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about not also testing this on OSX, but I agree it might be too burdensome on Travis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for mac os now, if it becomes too burdensome, we can switch it off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two newlines before and after this class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ray, this used to look more like

if self.plasma_client.alive:
    libplasma.release(self.plasma_client.conn, self.plasma_id)

I think the motivation for this was that Ray (on a single machine) by default kills the plasma store manually at the end of a script. After this, a PlasmaBuffer can still go out of scope and try to send a release message to the store, and the check_status will raise an exception leaving some ugly error or warning messages.

Or is this avoided in some other way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in the C++ client; the file descriptor is set to -1 on disconnect, our old shutdown is disconnect and release checks for the file descriptor being -1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also move these cmake_modules to the top-level so that they are available to any arrow-cpp component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's where they belong. Shall we do this as a separate PR to make it a little more visible? It might break other people's scripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check in line 67 below that plasma can be imported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the pointer!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify valgrind and plasma options as part of the pytest call, see the --parquet option in pyarrow/tests/conftest.py

@pcmoritz pcmoritz force-pushed the plasma-cython branch 8 times, most recently from 8f7b360 to dc3fecb Compare July 19, 2017 07:38
@pcmoritz pcmoritz force-pushed the plasma-cython branch 4 times, most recently from de14fa9 to 1609f48 Compare July 19, 2017 18:41
@pcmoritz
Copy link
Contributor Author

The build is now green and it's ready to merge. One possible reservation is that travis times are quite long now, any thoughts on that?

@wesm
Copy link
Member

wesm commented Jul 23, 2017

I will give this a last look and merge later today assuming no issues now that the 0.5.0 release vote has passed. I will also look at the build times to see if there's anything that needs to be done

@pcmoritz
Copy link
Contributor Author

Great, thanks! Next steps from our side is a little bit more high level documentation (I'll create a PR once this is merged) and then the blog post.

@wesm
Copy link
Member

wesm commented Jul 23, 2017

Sounds good. We can probably publish the blog post right after the IP clearance process is concluded

@wesm
Copy link
Member

wesm commented Jul 24, 2017

There's a significant amount of redundant work happening in the build. We should probably only be testing the built-in third-party toolchain (the ExternalProjects) in one entry in the build matrix and using toolchain libraries in all the others. This would probably be simplest to tackle in a follow up patch. I can take a look

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments -- sorry to delay getting this in, but I think we should remove the MutableBuffer classes and instead have a mutable property that forwards buffer->is_mutable(). I can look into speeding up the builds in a separate patch

ObjectID
PlasmaClient
PlasmaBuffer
MutablePlasmaBufferMutablePlasmaBuffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Can address when working on documentation generally

int64_t GetExtentBytesWritten()

cdef cppclass CFixedSizeBufferWrite" arrow::io::FixedSizeBufferWriter"(WriteableFile):
CFixedSizeBufferWrite(const shared_ptr[CBuffer]& buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo

return self.size


cdef class MutableBuffer(Buffer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since buffers have an is_mutable() function, is this needed (versus adding a mutable property to Buffer)?

self.wr_file.reset(new CFixedSizeBufferWrite(buffer.buffer))
self.is_readable = 0
self.is_writeable = 1
self.is_open = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to move this to io.pxi

self.client.release(self.object_id)


cdef class MutablePlasmaBuffer(MutableBuffer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if this can be collapsed per the discussion above with MutableBuffer

use_valgrind=os.getenv("PLASMA_VALGRIND") == "1")
# Connect to Plasma.
self.plasma_client = plasma.PlasmaClient()
self.plasma_client.connect(plasma_store_name, "", 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a top level function pyarrow.plasma.connect that returns a fully-baked client? This API is a slight rough edge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, that seems better!

self.with_plasma = strtobool(
os.environ.get('PYARROW_WITH_PLASMA', '0'))
if self.with_plasma:
if self.with_plasma and "plasma" not in self.CYTHON_MODULE_NAMES:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function initialize_options seems to have been called twice and plasma was added twice to CYTHON_MODULE_NAMES, which broke the "python setup.py develop". If there is a better fix let me know!

def _failure_permitted(self, name):
if name == '_parquet' and not self.with_parquet:
return True
if name == 'plasma' and not self.with_plasma:
Copy link
Contributor Author

@pcmoritz pcmoritz Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for good measure, not sure where/if it is necessary

c_string manager_socket_name

def __cinit__(self):
def __cinit__(self, store_socket_name, manager_socket_name, int release_delay):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one rough edge in Cython, if you call the ctor with the wrong parameters you will generally segfault, which is why factory methods are usually safer. We can tackle this in a follow up patch

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @pcmoritz!!

@asfgit asfgit closed this in a94f471 Jul 24, 2017
@pcmoritz
Copy link
Contributor Author

Thanks for merging! If you could look into the ExternalProjects build times that'd be great (I'm not sure what is involved there). I can see if we can make the actual plasma tests a little faster (and can also change the PlasmaClient -> connect API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants