Add custom serialization support for pyarrow by dhirschfeld · Pull Request #2115 · dask/distributed

dhirschfeld · 2018-07-13T02:07:11Z

No description provided.

dhirschfeld · 2018-07-13T02:08:36Z

Tests incoming - just wanted to get this up here for consideration in the meantime

dhirschfeld · 2018-07-13T02:12:42Z

c.c. @wesm, @xhochy - this may be of interest and you're certainly better qualified than me to comment on the implementation!

mrocklin · 2018-07-13T02:59:17Z

c.c. @wesm, @xhochy - this may be of interest and you're certainly better qualified than me to comment on the implementation!

To be explicit, the objective would be to get something like a bytes, buffer, or ideally memoryview object out as cheaply as possible. A list of such objects would also be welcome if data isn't contiguous.

mrocklin · 2018-07-13T03:00:30Z

So if I were to do this for a pandas dataframe I would probably pull the .data attribute from each block in the block manager separately, and then put a lot of metadata into the header.

Closes dask#2103

mrocklin · 2018-07-13T22:41:46Z

So the solution where you scatter explicitly should work fine, but when working with coroutines you'll need to yield the scatter result. This shouldn't be an issue if you use the normal API.

The rest of the tests here are related to #2110 . It's also worth noting that this shouldn't be an issue for worker-to-worker transfers. So if your tasks generate and then pass around arrow objects then things will also be fine with this change. This is only an issue when you push data into Dask.

xhochy · 2018-07-14T16:57:11Z

In future, we will support pickling these objects. The return values of __reduce__ should be the necessary buffers packed into nested tuples. Would that be sufficient for dask or do you need to have a flat list of buffers?

dhirschfeld · 2018-07-15T03:44:19Z

xref:

Remove tests which are known not to work currently (by design)

dhirschfeld · 2018-07-15T04:48:44Z

I think this is good to go now - all the tests, including the scatter pass ro me locally:

The other Client.submit tests have been removed as adding that functionality is being discussed in #2110.

Until arrow supports pickling itself this allows passing around arrow data-structures with dask so I think is useful new functionality.

If I've implemented it correctly I'd hope that it might be more efficient than pickling, at least until PEP 574 is passed and apache/arrow#2161 is merged.

wesm · 2018-07-16T03:42:04Z

If you don't mind giving me a chance to review I'll have a closer look tomorrow

wesm

LGTM aside from questions around sending pyarrow.Buffer directly

wesm · 2018-07-16T23:31:22Z

distributed/protocol/arrow.py

+    writer.close()
+    buf = sink.get_result()
+    header = {}
+    frames = [buf.to_pybytes()]


This causes an extra memory copy. Can frames contain objects exporting the buffer protocol?

That's a good question. It looks like it can with small modification (pushed). Presumably this means that PyArrow also works with sockets and Tornado IOStreams. Hooray for consistent use of protocols.

wesm · 2018-07-16T23:32:51Z

distributed/protocol/arrow.py

+def deserialize_batch(header, frames):
+    import pyarrow as pa
+    blob = frames[0]
+    reader = pa.RecordBatchStreamReader(pa.BufferReader(blob))


I opened ARROW-2859 to see if we can get rid of this pa.BufferReader detail

wesm · 2018-07-16T23:32:57Z

distributed/protocol/arrow.py

+    writer.close()
+    buf = sink.get_result()
+    header = {}
+    frames = [buf.to_pybytes()]


wesm · 2018-07-17T01:56:05Z

Awesome. Is there something we can do to make the Python API for Buffer more conforming?

dhirschfeld · 2018-07-17T02:51:20Z

distributed/protocol/arrow.py

+    sink = pa.BufferOutputStream()
+    writer = pa.RecordBatchStreamWriter(sink, batch.schema)
+    writer.write_batch(batch)
+    writer.close()


One improvement on the arrow side would be if RecordBatchStreamWriter was a context manager as that would avoid the need for an explicit close.

A fair point. ARROW-2863

mrocklin · 2018-07-17T11:18:45Z

Merging this in a few hours if there are no further comments.

mrocklin · 2018-07-17T12:18:32Z

Awesome. Is there something we can do to make the Python API for Buffer more conforming?

No, this was due to ugliness on the Dask side.

mrocklin · 2018-07-17T16:37:58Z

Thanks @dhirschfeld ! Merged.

dhirschfeld · 2018-07-17T20:17:46Z

Thanks @mrocklin, @wesm & @xhochy - I'm excited about getting this support into dask! :)

Update .gitignore

c0f842d

Add custom serialization support for pyarrow

6b6af6c

Closes dask#2103

dhirschfeld force-pushed the arrow-support branch from 5bac442 to 6b6af6c Compare July 13, 2018 03:08

dhirschfeld mentioned this pull request Jul 13, 2018

Serialize data within tasks #2110

Open

Add some serialization tests for arrow

be5ce71

dhirschfeld force-pushed the arrow-support branch from 97ba62a to be5ce71 Compare July 13, 2018 11:40

yield scatter

9870b07

Fix tests

911f71a

Remove tests which are known not to work currently (by design)

dhirschfeld force-pushed the arrow-support branch from d496e2a to 911f71a Compare July 15, 2018 03:51

Cleanup imports

0c4da1d

wesm reviewed Jul 16, 2018

View reviewed changes

remove to_pybytes in arrow serialization

af4eb31

dhirschfeld commented Jul 17, 2018

View reviewed changes

mrocklin merged commit 82d51e1 into dask:master Jul 17, 2018

dhirschfeld deleted the arrow-support branch July 17, 2018 20:14

dhirschfeld mentioned this pull request Jul 18, 2018

Support arrow Table/RecordBatch types #2103

Closed

Uh oh!

Conversation

dhirschfeld commented Jul 13, 2018

Uh oh!

dhirschfeld commented Jul 13, 2018

Uh oh!

dhirschfeld commented Jul 13, 2018

Uh oh!

mrocklin commented Jul 13, 2018

Uh oh!

mrocklin commented Jul 13, 2018

Uh oh!

mrocklin commented Jul 13, 2018

Uh oh!

xhochy commented Jul 14, 2018

Uh oh!

dhirschfeld commented Jul 15, 2018

Uh oh!

dhirschfeld commented Jul 15, 2018

Uh oh!

wesm commented Jul 16, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

wesm Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

wesm Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 17, 2018

Uh oh!

dhirschfeld Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

wesm Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 17, 2018

Uh oh!

mrocklin commented Jul 17, 2018

Uh oh!

mrocklin commented Jul 17, 2018

Uh oh!

dhirschfeld commented Jul 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants