-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: Implement Serializable for Dask-cuDF objects
#5294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement Serializable for Dask-cuDF objects
#5294
Conversation
|
Thanks for the work here @jakirkham - It definitely makes sense to take advantage of the simplifications you made in #5139 I know there is not much serialization test coverage in dask_cudf at the moment. Perhaps this would be a good time to add a simple test for |
|
Yeah that makes sense. Do you have an example of where |
Good question! Perhaps an appropriate test would be to take a simple dask_cudf DataFrame like |
python/dask_cudf/dask_cudf/core.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return cls(dsk=dsk, name=name, meta=meta, divisions=divisions) | |
| return dd.core.new_dd_object(dsk=dsk, name=name, meta=meta, divisions=divisions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will run into problems relying on __init__ here - new_dd_object should smooth things over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cls here should be whatever is subclassing _Frame (like DataFrame, Series, etc.). Is it not safe to call their constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have done the same as you - but noticed that the constructor assumes self._partition_type is defined elsewhere (without doing anything like super().__init__). I haven't looked carefully into the possibility that this is an oversight. However, I do know that we typically use new_dd_object in both dask_cudf and dask.dataframe to create new _Frame objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! It seems like all of the subclasses implement _partition_type. So I think we are safe.
That said, I don't have strong feelings either way. Do you know how new_dd_object knows what type it is creating? Is there some other metadata or context we should be giving it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like all of the subclasses implement _partition_type. So I think we are safe.
Ah good point - If it works for both Series and DataFrame I am fine with using cls (seems more intuitive anyway)
c967c95 to
4b8f6bf
Compare
| def serialize(self): | ||
| meta_header, meta_frames = self._meta.serialize() | ||
| header = { | ||
| "dask": pickle.dumps(self.dask), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to get away with not pickling this and just adding it to the header as-is. Though am a little worried that issues like ( dask/distributed#3716 ) might bite us. Though that bug seems solvable now. I suppose we could just try it and fix any bugs that crop up. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to try - So, is the header typically serialized with msgpack? Isn't it possible for the user to inject some messy types into the graph that msgpack will choke on? (sorry for my lack of knowledge here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the header is required to be msgpack serializable.
Good point! I suppose we can call serialize(...) on the graph. Or are there other pitfalls you imagine here?
|
Given this has gone stale I'm going to close this PR, but we can reopen in the future. |
Subclasses
Serializablefor Dask-cuDF objects and implementsserializeanddeserializemethods (replacing__getstate__and__setstate__). Should allow more efficient serialization when possible. Otherwise it can still fallback to pickle.Note: Adding PR ( dask/distributed#3832 ) upstream to make sure Dask registers Dask-cuDF object serialization methods.
cc @rjzamora