Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DistributedCheckpointIO #9016

Merged
merged 28 commits into from
May 3, 2024
Merged

Implement DistributedCheckpointIO #9016

merged 28 commits into from
May 3, 2024

Conversation

mikolajblaz
Copy link
Collaborator

@mikolajblaz mikolajblaz commented Apr 23, 2024

What does this PR do ?

This PR sets the ground for more advanced distributed checkpointing features, like fully parallel save/load or async save which will be configured and implemented as part of DistributedCheckpointIO plugin. It doesn't introduce any new features, just refactors existing behavior.

Collection: NLP

Changelog

  • implement DistributedCheckpointIO plugin for save/load of distributed checkpoints

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Apr 23, 2024
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
@mikolajblaz mikolajblaz changed the title Mblaz/dist ckpt io Implement DistributedCheckpointIO Apr 23, 2024
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
@github-actions github-actions bot added the core Changes to NeMo Core label Apr 23, 2024
@mikolajblaz mikolajblaz marked this pull request as ready for review April 23, 2024 13:39
@mikolajblaz mikolajblaz self-assigned this Apr 23, 2024
tests/core/test_dist_ckpt.py Fixed Show fixed Hide fixed
tests/core/test_dist_ckpt.py Fixed Show fixed Hide fixed
tests/core/test_dist_ckpt.py Fixed Show fixed Hide fixed
tests/core/test_dist_ckpt.py Fixed Show fixed Hide fixed
tests/core/test_dist_ckpt.py Fixed Show fixed Hide fixed
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Signed-off-by: Mikołaj Błaż <[email protected]>
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@dimapihtar dimapihtar merged commit f28773f into main May 3, 2024
129 checks passed
@dimapihtar dimapihtar deleted the mblaz/dist-ckpt-io branch May 3, 2024 17:29
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <[email protected]>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <[email protected]>

* Improve error msg

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <[email protected]>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <[email protected]>

* Add docs

Signed-off-by: Mikołaj Błaż <[email protected]>

* Apply black

Signed-off-by: Mikołaj Błaż <[email protected]>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <[email protected]>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mikołaj Błaż <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants