Checkpoint stateful handlers and metrics #966

amatsukawa · 2020-04-22T23:18:27Z

🚀 Feature

Things that are attached to the Engine might have state that would ideally be checkpointed and restored using as part of the Engine's state_dict.

An example is a Checkpoint handler when it has a score_function. Currently, the priorities the Checkpoint class stores is not saved anywhere. It is not able to recover gracefully from failure without manual intervention to parse the checkpoint path names, and directly setting the internals of the class.

Handlers and Metrics should have state_dict and load_state_dict methods (empty by default), and I think it should be possible for these to automatically make it into/restored from the Engine's state_dict when it's attached to an Engine.

The text was updated successfully, but these errors were encountered:

vfdev-5 · 2020-04-22T23:23:41Z

@amatsukawa thanks for FR! Yes, it definitely makes sense for handlers and metrics with internal state 👍

amatsukawa · 2020-08-12T15:15:52Z

FWIW, I'm completely happy with #1156 and the way things work now.

With complications I didn't think about, eg. some handlers needing to go on the valid engine and handlers needing to run in a specific order (Checkpoint needs to run last after all other stateful handlers) perhaps automatically adding things to the engine's checkpoints is not trivial and might make things harder to reason about.

H4dr1en · 2021-06-23T15:19:03Z

Bringing here a specific use case that this FR could solve:

Having the RunningAverage metric being able to restore the state would allow to not "forget" previous iterations scores when resuming an experiments. Otherwise the metric can show "peaks" when resuming an experiment, as can be shown in the figure below:

vfdev-5 added enhancement help wanted labels Apr 22, 2020

This was referenced Apr 30, 2020

Added Serializable in mixins #1000

Merged

Issue 966 #1008

Closed

This was referenced Jun 22, 2020

Getting iterations in Checkpoint is wrong for global_step_transform #1148

Closed

Stateful handlers #1156

Merged

vfdev-5 added the Hacktoberfest label Aug 27, 2020

vfdev-5 added PyDataGlobal PyData Global 2020 Sprint and removed Hacktoberfest PyDataGlobal PyData Global 2020 Sprint labels Oct 31, 2020

vfdev-5 added the module: metrics Metrics module label Jan 18, 2021

vfdev-5 mentioned this issue May 13, 2022

How to resume learning? #2569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint stateful handlers and metrics #966

Checkpoint stateful handlers and metrics #966

amatsukawa commented Apr 22, 2020

vfdev-5 commented Apr 22, 2020 •

edited

Loading

amatsukawa commented Aug 12, 2020

H4dr1en commented Jun 23, 2021

Checkpoint stateful handlers and metrics #966

Checkpoint stateful handlers and metrics #966

Comments

amatsukawa commented Apr 22, 2020

🚀 Feature

vfdev-5 commented Apr 22, 2020 • edited Loading

amatsukawa commented Aug 12, 2020

H4dr1en commented Jun 23, 2021

vfdev-5 commented Apr 22, 2020 •

edited

Loading