Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for preemption support #6403

Merged
merged 1 commit into from
Apr 10, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/source/core/exp_manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,9 +173,20 @@ and stability. To use EMA, simply set the following via YAML or :class:`~nemo.ut
every_n_steps: 1 # How often to update EMA weights
validate_original_weights: False # Whether to use original weights for validation calculation or EMA weights

Support for Preemption
----------------------

.. _exp_manager_preemption_support-label:

NeMo adds support for a callback upon preemption while running the models on clusters. The callback takes care of saving the current state of training via the ``.ckpt``
file followed by a graceful exit from the run. The checkpoint saved upon preemption has the ``*last.ckpt`` suffix and replaces the previously saved last checkpoints.
This feature is useful to increase utilization on clusters.
The ``PreemptionCallback`` is enabled by default. To disable it simply add ``create_preemption_callback: False`` under exp_manager in the config YAML file.


.. _nemo_multirun-label:


Hydra Multi-Run with NeMo
-------------------------

Expand Down