Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented May 26, 2020

Subclasses Serializable for Dask-cuDF objects and implements serialize and deserialize methods (replacing __getstate__ and __setstate__). Should allow more efficient serialization when possible. Otherwise it can still fallback to pickle.

Note: Adding PR ( dask/distributed#3832 ) upstream to make sure Dask registers Dask-cuDF object serialization methods.

cc @rjzamora

@rjzamora
Copy link
Member

Thanks for the work here @jakirkham - It definitely makes sense to take advantage of the simplifications you made in #5139

I know there is not much serialization test coverage in dask_cudf at the moment. Perhaps this would be a good time to add a simple test for _Frame-level serialization?

@jakirkham
Copy link
Member Author

Yeah that makes sense. Do you have an example of where _Frame would show up in serialization? That should help me figure out some appropriate tests 🙂

@rjzamora
Copy link
Member

Yeah that makes sense. Do you have an example of where _Frame would show up in serialization? That should help me figure out some appropriate tests

Good question! Perhaps an appropriate test would be to take a simple dask_cudf DataFrame like dask_cudf.from_cudf(cudf.datasets.timeseries().iloc[:12], npartitions=3) (or even simpler) and make sure you get the same dataframe when you do dask_cudf.core._Frame.deserialize(*df.serialize()) or pickle.loads(pickle.dumps(df))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return cls(dsk=dsk, name=name, meta=meta, divisions=divisions)
return dd.core.new_dd_object(dsk=dsk, name=name, meta=meta, divisions=divisions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will run into problems relying on __init__ here - new_dd_object should smooth things over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cls here should be whatever is subclassing _Frame (like DataFrame, Series, etc.). Is it not safe to call their constructors?

Copy link
Member

@rjzamora rjzamora May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have done the same as you - but noticed that the constructor assumes self._partition_type is defined elsewhere (without doing anything like super().__init__). I haven't looked carefully into the possibility that this is an oversight. However, I do know that we typically use new_dd_object in both dask_cudf and dask.dataframe to create new _Frame objects.

Copy link
Member Author

@jakirkham jakirkham May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It seems like all of the subclasses implement _partition_type. So I think we are safe.

That said, I don't have strong feelings either way. Do you know how new_dd_object knows what type it is creating? Is there some other metadata or context we should be giving it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like all of the subclasses implement _partition_type. So I think we are safe.

Ah good point - If it works for both Series and DataFrame I am fine with using cls (seems more intuitive anyway)

@jakirkham jakirkham force-pushed the use_serializable_dask_cudf branch from c967c95 to 4b8f6bf Compare May 27, 2020 04:18
def serialize(self):
meta_header, meta_frames = self._meta.serialize()
header = {
"dask": pickle.dumps(self.dask),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to get away with not pickling this and just adding it to the header as-is. Though am a little worried that issues like ( dask/distributed#3716 ) might bite us. Though that bug seems solvable now. I suppose we could just try it and fix any bugs that crop up. WDYT?

Copy link
Member

@rjzamora rjzamora May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to try - So, is the header typically serialized with msgpack? Isn't it possible for the user to inject some messy types into the graph that msgpack will choke on? (sorry for my lack of knowledge here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the header is required to be msgpack serializable.

Good point! I suppose we can call serialize(...) on the graph. Or are there other pitfalls you imagine here?

@harrism harrism added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask-cudf labels Jun 16, 2020
@kkraus14
Copy link
Collaborator

Given this has gone stale I'm going to close this PR, but we can reopen in the future.

@kkraus14 kkraus14 closed this Aug 17, 2020
@jakirkham jakirkham deleted the use_serializable_dask_cudf branch August 17, 2020 22:40
@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress dask Dask issue Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants