You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement async distributed checkpoint save (#9028)
* Prevent duplicated checkpoints
Signed-off-by: Mikołaj Błaż <[email protected]>
* Introduce DistributedCheckpointIO
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix DistCkptIO usage
Signed-off-by: Mikołaj Błaż <[email protected]>
* Use NeMo logger
Signed-off-by: Mikołaj Błaż <[email protected]>
* [DCIO] Fix save_to dist ckpt path
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add versioning to save_to
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add versioning logic to all .nemo files
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add versioning test
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add dist-ckpt test
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Mikołaj Błaż <[email protected]>
* Rename existing ckpts instead of using different name
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add comment
Signed-off-by: Mikołaj Błaż <[email protected]>
* Use dist ckpt flag in all methods
Signed-off-by: Mikołaj Błaż <[email protected]>
* Improve error msg
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add dist ckpt unit tests
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix load_checkpoint
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix auto-issues
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix ckpt_dir var
Signed-off-by: Mikołaj Błaż <[email protected]>
* Restore skipping behavior
The fix from prevent-duplicated-checkpoints is required to skip the checkpoints
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix steps on single-GPU machine
Signed-off-by: Mikołaj Błaż <[email protected]>
* Run dist-ckpt test on GPU
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add docs
Signed-off-by: Mikołaj Błaż <[email protected]>
* Apply black
Signed-off-by: Mikołaj Błaż <[email protected]>
* Prevent saving last for non-equal val intervals
Signed-off-by: Mikołaj Błaż <[email protected]>
* Move checkpoint on rank 0
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix num steps in tests
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add async ckpt implementation
Signed-off-by: Mikołaj Błaż <[email protected]>
* Abstract AsyncFinalizableCheckpointIO away
Signed-off-by: Mikołaj Błaż <[email protected]>
* Change async_save flag location
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add debug info
Signed-off-by: Mikołaj Błaż <[email protected]>
* Apply formatting
Signed-off-by: Mikołaj Błaż <[email protected]>
* Handle multiple async saves
Signed-off-by: Mikołaj Błaż <[email protected]>
* Apply formatting
Signed-off-by: Mikołaj Błaż <[email protected]>
* Move finalization calls to a callback
Signed-off-by: Mikołaj Błaż <[email protected]>
* Avoid deadlock in teardown
Signed-off-by: Mikołaj Błaż <[email protected]>
* Adjust to MCore implementation
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add notes and copyrights
Signed-off-by: Mikołaj Błaż <[email protected]>
* Apply formatting
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix async_request attribute
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add MCore import guards
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add async test
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix finalize_fn arg
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add docs
Signed-off-by: Mikołaj Błaż <[email protected]>
* Remove checkpoints from accurate steps
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix MCore class usage
Signed-off-by: Mikołaj Błaż <[email protected]>
* Update docs
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix logger usage
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix rebase
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix code scan issues
Signed-off-by: Mikołaj Błaż <[email protected]>
* Remove unsused import
Signed-off-by: Mikołaj Błaż <[email protected]>
* Use dist-ckpt for Bert
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix load checkpoint return val
Signed-off-by: Mikołaj Błaż <[email protected]>
* Use dist-ckpt based on sharded_state_dict
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add async logging
Signed-off-by: Mikołaj Błaż <[email protected]>
* Remove deprecated argument
Signed-off-by: Mikołaj Błaż <[email protected]>
* Use correct checkpoint_io
Signed-off-by: Mikołaj Błaż <[email protected]>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix bad merge
Signed-off-by: Mikołaj Błaż <[email protected]>
* Improve debug msg
Signed-off-by: Mikołaj Błaż <[email protected]>
* Run async test on GPU
Signed-off-by: Mikołaj Błaż <[email protected]>
* Fix async ckpt unit test
Signed-off-by: Mikołaj Błaż <[email protected]>
* Apply isort and black reformatting
Signed-off-by: mikolajblaz <[email protected]>
* Clarify async logs
Signed-off-by: Mikołaj Błaż <[email protected]>
* Add schema print
Signed-off-by: Mikołaj Błaż <[email protected]>
---------
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: mikolajblaz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
0 commit comments