-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nested NeMo models #5671
Conversation
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
The PR is still actual. Need a review / discussion (maybe after 9 Jan) |
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
@titu1994, I reworked the approach to handle nested submodules artifacts automatically, I added tests for:
I'm not sure about model-parallel serialization. Do we have any examples to start from? Also, I'm not sure about the |
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments here and there, but overall incredible work !
Would not have thought of this solution, but it works very very well.
@ericharper after your review, we can merge this for Non-MP models. I think the refactor for MP models would be only minor changes which can be done in a subsequent PR (this PR is a blocker for two others)
@@ -37,7 +39,7 @@ def __init__(self) -> None: | |||
self._model_weights_ckpt = "model_weights.ckpt" | |||
self._model_extracted_dir = None | |||
|
|||
def save_to(self, model, save_path: str): | |||
def save_to(self, model: nemo_classes.ModelPT, save_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use string here, do not use the actual class just for typing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid quotes, and use PEP563 as a more stable approach, since quotes are intended to use with forward declarations by original PEP484, and there is no forward declarations in most cases when quotes are used in NeMo.
But there is a problem with partial initialization here, so I used from __future__ import annotations
. This is fully compatible with Python3.7+, widely recommended and adopted.
E.g. see in mypy, pydantic and so on.
You can't see it in old code (due to Python3.7+ limitations), but in new code it is also widely use it, e.g.:
To avoid long discussions here, I added quotes.
But I will appreciate if you look at alternative approach and discuss it (maybe separately).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use string name here. Pep is not an enforcement it is a suggestion. It does not make sense to mess around internals of Nemo just for an annotation.
self._unpack_nemo_file(path2file=model_metadata.restoration_path, out_folder=archive_dir) | ||
# unpack all restorations paths (nemo checkpoints) | ||
# in nemo checkpoints all resources contain hash in name, so there should be no collisions | ||
for path in restoration_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restoration paths were tempdirs, do we recreate those tempdirs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restoration_paths
are only paths to .nemo
checkpoints, since we don't use .nemo
files inside parent .nemo
file. Using .nemo
in .nemo
will still break the code and should be avoided.
I changed it to a set
to unpack each checkpoint only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add a test where we attempt to register a Nemo file. Maybe check and raise an error ?
tests/core/test_save_restore.py
Outdated
# path in config should be set to `None` to restore model from config after saving | ||
self.cfg.child1_model_path = None | ||
# config for child model should be stored in base model config | ||
self.cfg.child1_model = self.child1_model.cfg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for demonstration or do we need to ask model devs to do this? If so, can we automate it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the model is loaded from checkpoint (see a comment about 2 ways to construct child models), the developer should assign a config.
I also thought about doing this automatically (since we can use __getattr__
on ModelPT
to do this), but:
- there are cases when the model will be directly constructed from config, so we don't need to assign it once more
- this will break existing code: there is existing NLP code, which uses additional NeMo models as attributes, but doesn't save them (actually I had some errors when tried to do everything recursively). E.g.
examples/nlp/token_classification/token_classification_train.py
->NLPModel.bert_model
is assigned, but there is some tricky code to handle this submodel - if the model is not needed to be saved (assigned as attribute, but then deleted), the developer will have to delete
cfg.child1_model
directly after 'magic' construction of this attribute.
So, I think it's better to ask developers to do it manually if constructing child models from .nemo
checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not modify getattr. Very annoying to do it correctly and brittle for future.
However it is also very easy to forget this step and completely break your implementation.
Add a test that checks this - a model which is nested but "forgets" to set the config. Raise an error.
What about doing this - inside of save_to, check if any module is ModelPT and if so, check the config is available other wise raise error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do this - inside save to check if the config is set, if yes don't do anything else otherwise use self.nested_module_i.cfg and set the config automatically inside of self.cfg
def __init__(self, cfg, trainer=None): | ||
super().__init__(cfg=cfg, trainer=trainer) | ||
|
||
# child 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you register the model file? Or is that no longer necessary cause internally it will handle the module traversal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 ways to instantiate child model.
I added comments to MockModelWithChildren
.
Variant 1 for creating nested NeMo model:
- create a child model from .nemo checkpoint (from
*_model_path
) - assign config to indicate that this model should be saved/restored
- when model is restored, parent and children will be restored from a signle parent
.nemo
checkpoint
Variant 2 for creating nested NeMo model: load directly from config (can be instantiated via MyModel(child_cfg)
or universally ModelPT.from_config_dict(child_cfg)
(if _target_
is provided).
In both cases no .nemo
file inside parent .nemo
model will be saved (this is still incompatible with NeMo and global AppState
!). So, after loading submodels from .nemo
checkpoints, a solid model is constructed.
Keep in mind, that synchronization is only done in save_to
:
- aggregate and update subconfigs
- aggregate and save artifacts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then add a check inside of register_artifact that checks if it's a .Nemo file and raise an error saying that Nemo file itsel cannot be registered. Add a test case for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then add a check inside of register_artifact that checks if it's a .Nemo file and raise an error saying that Nemo file itsel cannot be registered. Add a test case for this
Firstly added, but then reverted, since such a check breaks existing code. See
tests/collections/asr/test_asr_rnnt_encoder_model_bpe.py::TestEncDecRNNTBPEModel::test_save_restore_nested_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add this check, and either remove or update the faulty RNNT test. That one was my initial attempt to hack together nested model support and is no longer in the correct format of this PR
parent = ModelPT.restore_from(parent_path) | ||
# check model is transparent, child models can be accessed and can be saved/restored separately | ||
_ = self.__test_restore_elsewhere(parent.child1_model, map_location='cpu') | ||
child2 = self.__test_restore_elsewhere(parent.child2_model, map_location='cpu') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing test, incredible work here !
tests/core/test_save_restore.py
Outdated
if cfg.get('child1_model_path') is None and cfg.get('child1_model') is None: | ||
self.child1_model = None | ||
else: | ||
# if `child1_model_path` is set, model should be restored from nemo checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here "nemo checkpoint" -> "nemo model"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a checkpoint, so I changed it to ".nemo model checkpoint"
I also added a comment above to clarify options how child model is constructed.
tests/core/test_save_restore.py
Outdated
if cfg.get('child2_model_path') is None and cfg.get('child2_model') is None: | ||
self.child2_model = None | ||
else: | ||
# if `child2_model_path` is set, model should be restored from nemo checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nemo checkpoint" -> "nemo model" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a checkpoint, so I changed it to ".nemo model checkpoint"
tests/core/test_save_restore.py
Outdated
child2_path = os.path.join(tmpdir_child, 'child2.nemo') | ||
child2.save_to(child2_path) | ||
|
||
# create model with children using saved "nemo" checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nemo" checkpoints -> nemo models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
".nemo model checkpoints"
Signed-off-by: Vladimir Bataev <[email protected]>
Enforce this check, remove/update that test. That test is a proxy of the old way I was trying to hack together multi model support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready to merge (after adding back the nemo file checks).
@ericharper for final review
Amazing work !
docs/source/core/core.rst
Outdated
self.register_nemo_submodule( | ||
"child_model", | ||
config_field="child_model", | ||
model=ChildModel(self.cfg.child_model), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the trainer=trainer part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/source/core/core.rst
Outdated
# either if config provided initially, or automatically | ||
# after model restoration | ||
self.register_nemo_submodule( | ||
"child_model", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using kwargs, add the key for the first arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def __init__(self, cfg, trainer=None): | ||
super().__init__(cfg=cfg, trainer=trainer) | ||
|
||
# child 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add this check, and either remove or update the faulty RNNT test. That one was my initial attempt to hack together nested model support and is no longer in the correct format of this PR
Signed-off-by: Vladimir Bataev <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Bataev <[email protected]>
…o support_nested_models
I added the check and removed the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for many changes, iterations, and documentation!
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
@titu1994, @ericharper I finally updated the documentation + fixed the faulty test for NestedRNNTModel in I think now it's ready to be merged) |
LGTM as well, absolutely bonkers you made this work! I hope this promotes new awesome clean multi-model pipelines in NeMo 🚀 |
Co-authored-by: Sean Naren <[email protected]> Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Nested NeMo models support Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]>
Nested NeMo models support Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]>
Nested NeMo models support Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]>
Nested NeMo models support Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]>
Nested NeMo models support Signed-off-by: Vladimir Bataev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Eric Harper <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <[email protected]>
What does this PR do ?
Adds support for nested NeMo models (with resources).
It's possible to instantiate and save models with NeMo submodules. Artifacts are correctly saved and restored.
Collection: [core]
Changelog
Usage
Example usage:
child_model_path
with.nemo
checkpoint path to load the modelBefore your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information