-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support pre-extracted nemo checkpoint for restoration #4061
Conversation
Signed-off-by: smajumdar <[email protected]>
/blossom-ci |
2 similar comments
/blossom-ci |
/blossom-ci |
if save_restore_connector.model_extracted_dir is None: | ||
restore_path = os.path.abspath(os.path.expanduser(restore_path)) | ||
else: | ||
restore_path = os.path.abspath(os.path.expanduser(save_restore_connector.model_extracted_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the user provided absolute path outside of their homedir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expanduser() only expands ~
in a user provided path, so if absolute path is given then as long as theres no ~ in it, it will be correct expansion.
@@ -523,3 +541,67 @@ def test_mock_model_model_collision(self): | |||
with pytest.raises(ValueError, match="Creating model config node is forbidden"): | |||
model = MockModel(cfg=cfg.model, trainer=None) # type: MockModel | |||
model = model.to('cpu') | |||
|
|||
@pytest.mark.unit | |||
def test_restore_from_save_restore_connector_extracted_dir(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding unittests!
Signed-off-by: smajumdar [email protected]
What does this PR do ?
Adds support to SaveRestoreConnector to employ a pre-extracted nemo directory path as a restoration directory.
This is required for extremely large language model restoration, which cannot be extracted partially during multinode inference.
Collection: [Core, Megatron]
Changelog
Note
connector.model_extracted_dir
inside AppState.Usage
Before your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
@okuchaiev @ericharper