diff --git a/rfcs/20200902-pickle-for-keras.md b/rfcs/20200902-pickle-for-keras.md new file mode 100644 index 000000000..e2afa4db1 --- /dev/null +++ b/rfcs/20200902-pickle-for-keras.md @@ -0,0 +1,292 @@ +# Support for Pickle, Python's serialization protocol + +| Status | Proposed | +:-------------- |:---------------------------------------------------- | +| **RFC #** | [286](https://github.com/tensorflow/community/pull/286) | +| **Author(s)** | Adrian Garcia Badaracco ({firstname}@{firstname}gb.com), Scott Sievert (tf-rfc@stsievert.com) | +| **Sponsor** | Mihai Maruseac (mihaimaruseac@google.com) | +| **Updated** | 2020-09-02 | + +## Objective + +Implement support for Pickle, Python's serialization protocol within Keras. + +## Motivation + +> *Why this is a valuable problem to solve? What background information is +> needed to show how this design addresses the problem?* + +The specific motivation for this RFC: we want to use Keras models in Dask-ML's +and Ray's hyperparameter optimization. More generally, support for serialization +with the Pickle protocol will enable: + +* Using Keras with other parallelization libraries like Python's + `multiprocessing`, Dask, Ray or IPython parallel. +* Saving Keras models to disk with custom serialization libraries like Joblib + or Dill. This is common when using a Keras model as part of a Scikit-Learn + pipeline or with their hyperparameter searches. +* Copying Keras models with Python's built-in `copy.deepcopy`. + +Supporting Pickle will enable wider usage in the Python ecosystem because +Python's ecosystems of libraries depend strongly on the presence of protocols. +Without these protocols, it's necessary for each library to implement a custom +serialization method for every other library. For example, Dask Distributed has +a custom serialization method for Keras at [distributed/protocol/keras.py]. +See "[Pickle isn't slow, it's a protocol]" for more detail (notably, this post +focuses on having an efficient Pickle implementation for PyTorch). + +[distributed/protocol/keras.py]:https://github.com/dask/distributed/blob/73fa9bd1bd7dcb4ceed72cdbdc6dd4b92f887521/distributed/protocol/keras.py + +This request is *not* advocating for use of Pickle while saving or sharing +Keras models. We believe the efficient, secure and stable methods in TF should +be used for that. Instead, we are proposing to add a Pickle implementation to +support wider usage in the Python ecosystem. + +[Pickle isn't slow, it's a protocol]:https://blog.dask.org/2018/07/23/protocols-pickle + +> *Which users are affected by the problem? Why is it a problem? What data +> supports this? What related work exists?* + +Users trying to use distributed systems (e.g, Ray or Dask) with Keras are +affected. In our experience, this is common in hyperparameter optimization. In +general, having Pickle support means a better experience, especially when using +Keras with other libraries. Briefly, implementation of this RFC will make the +following possible: + +* Saving a Scikit-Learn pipeline to disk if it includes a Keras model +* Using custom parallelization like Joblib or Dask. + +More use cases and examples are give in "User Benefit." + +Related work is in [SciKeras], which brings a Scikit-Learn API +to Keras. Pickle is relevant because Scikit-Learn requires that estimators must be able to be pickled ([source][skp]). +As such, SciKeras has an implementation of `__reduce__`, which is also in +[tensorflow#39609]. + +[dask-ml#534]:https://github.com/dask/dask-ml/issues/534 +[SO#51110834]:https://stackoverflow.com/questions/51110834/cannot-pickle-dill-a-keras-object +[SO#54070845]:https://stackoverflow.com/questions/54070845/how-to-pickle-keras-custom-layer +[SO#59872509]:https://stackoverflow.com/questions/59872509/how-to-export-a-model-created-from-kerasclassifier-and-gridsearchcv-using-joblib +[SO#37984304]:https://stackoverflow.com/questions/37984304/how-to-save-a-scikit-learn-pipline-with-keras-regressor-inside-to-disk +[SO#48295661]:https://stackoverflow.com/questions/48295661/how-to-pickle-keras-model +[skper]:https://scikit-learn.org/stable/modules/model_persistence.html#persistence-example +[TF#33204]:https://github.com/tensorflow/tensorflow/issues/33204 +[TF#34697]:https://github.com/tensorflow/tensorflow/issues/34697 + +[tensorflow#39609]:https://github.com/tensorflow/tensorflow/pull/39609 +[SciKeras]:https://github.com/adriangb/scikeras +[skp]:https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/utils/estimator_checks.py#L1523-L1524 + + + + +## User Benefit + +> How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post? + +One blog post headline: "Keras models can be used with the advanced +hyperparameter optimization techniques found in Dask-ML and Ray Tune." This has +already been mentioned in "Framework support" of [a Dask blog post][dbp] +comparing Dask-ML's hyperparameter optimization with Ray's tune-sklearn. + +[dbp]:https://blog.dask.org/2020/08/06/ray-tune#framework-support + +Users will also benefit with easier usage; they won't run into any of these +errors: + +* People try to save Scikit-Learn meta-estimators with Keras components using + the serialization libraries Joblib or Dill. + This fails because Keras models can not be serialized without a custom + method. Examples include [SO#59872509], [SO#37984304] and + [SO#48295661], and [SO#51110834]. +* Using custom parallelization strategies requires serialization support through + Pickle; however, many parallelization libraries don't + special case Keras models (e.g, Joblib). Relevant errors are most common in hyperparameter + optimization with Scikit-Learn's parallelization through Joblib + ([TF#33204] and [TF#34697]) or parallelization through Dask ([dask-ml#534]). +* Lack of Pickle support can complicate saving training history like in + (the poorly asked) [SO#54070845]. + + +This RFC would resolve these issues. + +## Design Proposal + +We propose implementing the Pickle protocol using the existing Keras +saving functionality as a backend. For example, adding pickle support to TF Metrics +is as simple as the following: + +``` python +# tensorflow/python/keras/metrics.py + +@keras_export('keras.metrics.Metric') # line 80 +@six.add_metaclass(abc.ABCMeta) +class Metric(base_layer.Layer): + ... + + def __reduce__(self, protocol): + # the deserialized/serialize functions are defined in this file + return deserialize, (serialize(self),) +``` + +This implementation adds support for the Pickle protocol, which supports serialization +to arbitrary IO, either memory or disk. The `__reduce__` special method can return +the string that would have been written to disk and the function to load that string into memory ([docs][reduce_docs]). + +[reduce_docs]:https://docs.python.org/3/library/pickle.html#object.__reduce__ + +For `tf.keras.Model`, we can use `SaveModel` as the backend for `__reduce__`: + +``` python +# tensorflow/python/keras/engine/training.py +... +from tesorflow.python.keras.models import load_model + +class Model(base_layer.Layer, version_utils.ModelVersionSelector): # line 131 + ... + + def __reduce__(self, protocol): + temp_ram_location = f"ram://tmp/saving/{id(self)}" + self.save(temp_ram_location) + b = tf.io.gfile.read_folder(temp_ram_location) + return self._reconstruct_pickle, (np.asarray(memoryview(b)), ) + + @classmethod + def _reconstruct_pickle(cls, obj): + temp_ram_location = f"ram://tmp/saving/{id(obj)}" + tf.io.gfile.write_folder(temp_ram_location, b) + return load_model(temp_ram_location) +``` + + +This almost exactly mirrors the PyTorch +implementation of Pickle support in [pytorch#9184] +as mentioned in "[Pickle isn't slow, it's a protocol]." +In addition, small augmentations to TensorFlow's IO module will be required (as discussed in [tensorflow#39609]). + +By wrapping the pickled object within a Numpy array, pickling will support +pickle protocol 5 for zero-copy pickling. This provides an immediate +performance improvement for many use cases. + +[pytorch#9184]:https://github.com/pytorch/pytorch/pull/9184 + +### Alternatives Considered + +Of course, one method is to ask users to monkey-patch Keras models themselves. +This would hold for libraries too. Clearly, this is unreasonable. Regardless, +some libraries like Dask Distributed have already implemented custom serialization +protocols ([distributed/protocol/keras.py]). + +#### Other pickle implementations + +The Pickle protocol supports two features: + +1. In-memory copying of live objects: via Python's `copy` module. This falls back to (2) below. +2. Serialization to arbitrary IO (memory or disk): via Python's `pickle` module. + +This proposal seeks to take the conservative approach at least initially and +only implement (2) above since (1) can always fall back to (2) and using only +(2) alleviates any concerns around references to freed memory in the C++ +portions of TF and other such bugs. + +This said, for situations where the user is making an in-memory copy of an object and it might +even be okay to keep around references to non-Python objects, a separate approach that optimizes +(1) would be warranted. This RFC does not seek to address this problem. Hence +this RFC is generally not concerned with: + +* Issues arising from C++ references. These cannot be kept around when + serializing to a binary file stream. +* Performance of the serialization/deserialization. + +### Performance Implications + +* The performance should be the same as the underlying backend that is already + implemented in TF. +* For cases where the user was going to pickle anyway, this will be faster + because it uses TF's methods instead of letting Python deal with it naively. +* Tests will consist of running `new_model = pickle.loads(pickle.dumps(model))` + and then doing checks on `new_model`. + +### Dependencies + +* Dependencies: does this proposal add any new dependencies to TensorFlow? + * No +* Dependent projects: are there other areas of TensorFlow or things that use + TensorFlow (TFX/pipelines, TensorBoard, etc.) that this affects? + * This should not affect those libraries. It will affect libraries + further downstream like Dask-ML and Ray Tune. + +### Engineering Impact + +* Do you expect changes to binary size / startup time / build time / test + times? + * No +* Who will maintain this code? Is this code in its own buildable unit? Can this + code be tested in its own? Is visibility suitably restricted to only a small + API surface for others to use? + * This code depends on existing Keras/TF methods. This code will not break + presuming they are maintained (the new API surface area is very small). + +### Platforms and Environments + +* Platforms: does this work on all platforms supported by TensorFlow? If not, + why is that ok? Will it work on embedded/mobile? Does it impact automatic + code generation or mobile stripping tooling? Will it work with transformation + tools? + * Yes, as long as a Python backend is available. +* Execution environments (Cloud services, accelerator hardware): what impact do + you expect and how will you confirm? + * We don't see any impact. + +### Best Practices + +* Does this proposal change best practices for some aspect of using/developing + TensorFlow? How will these changes be communicated/enforced? + * No + +### Tutorials and Examples + +There are plenty of examples of how this can and would be used within all of the issues above, in addition to the linked notebook +([link again](https://colab.research.google.com/drive/14ECRN8ZQDa1McKri2dctlV_CaPkE574I?authuser=1#scrollTo=qlXDfJObNXVf)) which has +end to end implementations and tests for all of this. + +### Compatibility + +* Does the design conform to the backwards & forwards compatibility + [requirements](https://www.tensorflow.org/programmers_guide/version_compat)? + * Yes + +> *How will this proposal interact with other parts of the TensorFlow Ecosystem?* + +* How will it work with TFLite? + * N/A +* How will it work with distribution strategies? + * This enables use of other serialization libraries, which might enable support for other distribution strategies. +* How will it interact with tf.function? + * N/A +* Will this work on GPU/TPU? + * N/A +* How will it serialize to a SavedModel? + * Not applicable, and almost a circular question. + +### User Impact + +* What are the user-facing changes? How will this feature be rolled out? + +There are no user-facing changes: this is a backend change to private methods. + +Rolling out only involves testing. It will not require any documentation +changes to advertise this features: `Model.save` should still be used for users +simply trying to save their model to disk. + +## Questions and Discussion Topics + +Seed this with open questions you require feedback on from the RFC process. diff --git a/rfcs/20200902-pickle-for-keras/README.md b/rfcs/20200902-pickle-for-keras/README.md deleted file mode 100644 index 847ce227e..000000000 --- a/rfcs/20200902-pickle-for-keras/README.md +++ /dev/null @@ -1,176 +0,0 @@ -# Title of RFC - -| Status | (Proposed / Accepted / Implemented / Obsolete) | -:-------------- |:---------------------------------------------------- | -| **RFC #** | [286](https://github.com/tensorflow/community/pull/286) -| **Author(s)** | Adrian Garcia Badaracco ({firstname}@{firstname}gb.com) -| **Sponsor** | Mihai Maruseac (mihaimaruseac@google.com) | -| **Updated** | 2020-09-02 | - -## Objective - -Implement support for Python's Pickle protocol within Keras. - -## Motivation - -The pickle protocol is used extensively -within the Python ecosystem, including by [Dask](https://github.com/dask/dask), -[Scikit-Learn](https://github.com/scikit-learn/scikit-learn) and several other -popular machine learning libraries. These libraries rely on the pickle protocol and -cannot work without it. This hinders what would otherwise be great uses of Keras. - -Pickle and `copy` (referring to the specific Python module, it can use -the pickle protocol as it's backend) are also the _only_ universal way to -save Python objects. This means that this is what most users try first. -As evidenced by several of the StackOverflow/TensorFlow issues below, even if there are -TensorFlow specific ways to copy things in memory or disk, users will probably -try pickle or copy first and be confused by cryptic errors. - -Here are several examples -of Keras users running into issues because pickle is not supported. - -GH issues where `Model.save` would not work: - -* [TF#34697](https://github.com/tensorflow/tensorflow/issues/34697) -* [TF#33204](https://github.com/tensorflow/tensorflow/issues/33204) -* TODO: issues from Dask - -StackOverflow questions where `Model.save` would not work: - -* [SO#59872509](https://stackoverflow.com/questions/59872509/how-to-export-a-model-created-from-kerasclassifier-and-gridsearchcv-using-joblib) -* [SO#37984304](https://stackoverflow.com/questions/37984304/how-to-save-a-scikit-learn-pipline-with-keras-regressor-inside-to-disk) -* [SO#51110834](https://stackoverflow.com/questions/51110834/cannot-pickle-dill-a-keras-object) -* [SO#54070845](https://stackoverflow.com/questions/54070845/how-to-pickle-keras-custom-layer) -* [SO#40396042](https://stackoverflow.com/questions/40396042/how-to-save-scikit-learn-keras-model-into-a-persistence-file-pickle-hd5-json-ya) -* [SO#48295661](https://stackoverflow.com/questions/48295661/how-to-pickle-keras-model) - -Examples that could be resolved using `Model.save` (but the user tried pickle first): - -* [SO #51878627](https://stackoverflow.com/questions/51878627/pickle-keras-ann) - -Here is an post describing how PyTorch ran into similar issues and how they resolved them: [link](https://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle) - -## User Benefit - -* Lessen the learning curve for new Keras/TF users since they will be able to -use entry points they already know. -* Improve compatibility with libraries like Scikit-Learn and Dask. - -## Design Proposal - -The pickle protocol supports two distinct functions: - -1. In-memory copying of live objects: via Python's `copy` module. This falls back to (2) below. -2. Serialization to arbitrary IO (memory or disk): via Python's `pickle` module. - -This proposal seeks to take the conservative approach at least initially and only -implement (2) above since (1) can always fall back to (2) and using only (2) alleviates -any concerns around references to freed memory in the C++ portions of TF and other such bugs. - -This said, for situations where the user is making an in-memory copy of an object and it might -even be okay to keep around references to non-Python objects, a separate approach that optimizes -(1) would be warranted. This RFC does not seek to address this problem. Hence this RFC is generally -not concerned with: - -* Issues arising from C++ references. These cannot be kept around when serializing to a binary file stream. -* Performance of the serialization/deserialization. - -The general proposal is to implement the pickle protocol using existing Keras saving functionality -as a backend. For example, adding pickle/copy support to Metrics is as simple as: - -```python3 -class Metric: # in tf.keras.metrics - - def __reduce_ex__(self, protocol): - return deserialize, (serialize(metric),) # where deserialize is tf.keras.metrics.deserialize -``` - -Documentation for how to use `__reduce__ex__` and other alternatives that allow implementing of -the pickle protocol can be found [here](https://docs.python.org/3/library/pickle.html) and -[here](https://docs.python.org/3/library/copyreg.html) (official Python docs). - -For more complex objects (namely `tf.keras.Model`) we can either: - -1. Implement a similar approach, but we would need to save the weights and config separately. See [this notebook](https://colab.research.google.com/drive/14ECRN8ZQDa1McKri2dctlV_CaPkE574I?authuser=1#scrollTo=qlXDfJObNXVf) for an example. -2. Use `Model.save` as the backend. This would require implementing support for serializing to memory in `Model.save`, but is overall a better solution (since `Model.save` is the official way to save models in `tf.keras`) - -Solution (2) would look something like this (assuming `Model.save` worked with `io.BytesIO()`): - -```python3 -class Model: - - def __reduce_ex__(self, protocol): - b = io.BytesIO() # python built in library - self.save(b) # i.e. tf.keras.Model.save - return load_model, (b.get_value()),) # where load_model is tf.keras.models.load_model -``` - -Note how this exactly mirrors the aforementioned blog post regarding PyTorch ([link](https://matthewrocklin.com/blog/work/2018/07/23/protocols-pickle)). - -By implementing this in all of Keras' base classes, things will automatically work -with custom metrics and subclassed models. - -### Alternatives Considered - -The only real alternative is to: - -1. Ask all libraries that currently use pickle to make a special case for each Keras object and figure out how each Keras object prefers to be serialized (see use of `serialize` vs `Model.save` above). -2. Ask all users to learn the above as well. -3. Override `__reduce_ex__` to give a user friendly warning instead of failing cryptically. - -### Performance Implications - -* The performance should be the same as the underlying backend that is already implemented in TF. -* For cases where the user was going to pickle anyway, this will be faster because it uses TF's methods instead of letting Python deal with it naively. -* Tests will consist of running `new_model = pickle.loads(pickle.loads(model))` and then doing checks on `new_model`. - -### Dependencies - -* Dependencies: does this proposal add any new dependencies to TensorFlow? **NO** -* Dependent projects: are there other areas of TensorFlow or things that use TensorFlow (TFX/pipelines, TensorBoard, etc.) that this affects? **This should not affect anything** - -### Engineering Impact - -* Do you expect changes to binary size / startup time / build time / test times? **NO** -* Who will maintain this code? Is this code in its own buildable unit? Can this code be tested in its own? Is visibility suitably restricted to only a small API surface for others to use? - -This code depends on existing Keras/TF methods. As long as those are maintained and don't break, this code will not break. The new API surface area is very small. - -### Platforms and Environments - -* Platforms: does this work on all platforms supported by TensorFlow? If not, why is that ok? Will it work on embedded/mobile? Does it impact automatic code generation or mobile stripping tooling? Will it work with transformation tools? -* Execution environments (Cloud services, accelerator hardware): what impact do you expect and how will you confirm? - -This will work on anything that is running Python >= 2.7 (as far as I can tell, the pickle protocol has not changed since then). - -### Best Practices - -* Does this proposal change best practices for some aspect of using/developing TensorFlow? How will these changes be communicated/enforced? **NO** - -### Tutorials and Examples - -There are plenty of examples of how this can and would be used within all of the issues above, in addition to the linked notebook -([link again](https://colab.research.google.com/drive/14ECRN8ZQDa1McKri2dctlV_CaPkE574I?authuser=1#scrollTo=qlXDfJObNXVf)) which has -end to end implementations and tests for all of this. - -### Compatibility - -* Does the design conform to the backwards & forwards compatibility [requirements](https://www.tensorflow.org/programmers_guide/version_compat)? **YES** - -* How will this proposal interact with other parts of the TensorFlow Ecosystem? - - How will it work with TFLite? *N/A* - - How will it work with distribution strategies? *N/A* - - How will it interact with tf.function? *N/A* - - Will this work on GPU/TPU? *N/A* - - How will it serialize to a SavedModel? *Circular question...* - -### User Impact - -* What are the user-facing changes? How will this feature be rolled out? - -We implement it and are done. I do not think there is any need to edit the docs to advertise this feature. We still want -users to use `Model.save` when they are trying to save their model. - -## Questions and Discussion Topics - -Seed this with open questions you require feedback on from the RFC process. \ No newline at end of file