Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested NeMo models #5671

Merged
merged 44 commits into from
Jan 23, 2023
Merged

Conversation

artbataev
Copy link
Collaborator

@artbataev artbataev commented Dec 19, 2022

What does this PR do ?

Adds support for nested NeMo models (with resources).

It's possible to instantiate and save models with NeMo submodules. Artifacts are correctly saved and restored.

Collection: [core]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

Example usage:

  • 3 ways to instantiate child models:
    • use subconfig directly
    • use child_model_path with .nemo checkpoint path to load the model
    • use pretrained model
  • child model can contain artifacts
  • child model config can be changed after parent model instantiation. Config will be saved when saving parent model
from nemo.core.classes import ModelPT

class ChildModel(ModelPT):
    ...  # implement necessary methods

class ParentModel(ModelPT):
    def __init__(self, cfg, trainer=None):
        super().__init__(cfg=cfg, trainer=trainer)

        # optionally annotate type for IDE autocompletion and type checking
        self.child_model: Optional[ChildModel]
        if cfg.get("child_model") is not None:
            # load directly from config
            # either if config provided initially, or automatically
            # after model restoration
            self.register_nemo_submodule(
                "child_model",
                config_field="child_model",
                model=ChildModel(self.cfg.child_model),
            )
        elif cfg.get('child_model_path') is not None:
            # load from .nemo model checkpoint
            # while saving, config will be automatically assigned/updated
            # in cfg.child_model
            self.register_nemo_submodule(
                "child_model",
                config_field="child_model",
                model=ChildModel.restore_from(self.cfg.child_model_path),
            )
        elif cfg.get('child_model_name') is not None:
            # load from pretrained model
            # while saving, config will be automatically assigned/updated
            # in cfg.child_model
            self.register_nemo_submodule(
                "child_model",
                config_field="child_model",
                model=ChildModel.from_pretrained(self.cfg.child_model_name),
            )
        else:
            self.child_model = None

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the core Changes to NeMo Core label Dec 19, 2022
@artbataev artbataev mentioned this pull request Dec 19, 2022
14 tasks
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jan 6, 2023
@artbataev
Copy link
Collaborator Author

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

The PR is still actual. Need a review / discussion (maybe after 9 Jan)

@artbataev artbataev removed the stale label Jan 6, 2023
@artbataev
Copy link
Collaborator Author

@titu1994, I reworked the approach to handle nested submodules artifacts automatically, register_submodule_artifacts is not required now.
This is because nested models can change their config + artifacts, e.g., change_vocabulary for BPE models changes model resources. Since we can't track changes in submodules from the parent module, I check submodules for changes when saving the model.

I added tests for:

  • serialization + deserialization
  • multi-nested model
  • multiple save-restore passes
  • cases when the child model changes its config

I'm not sure about model-parallel serialization. Do we have any examples to start from?

Also, I'm not sure about the is_model_being_restored flag.
I can add a simple context manager to preserve this flag when loading the child NeMo model. Is it a good solution? Do we have any tests for this flag?

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments here and there, but overall incredible work !
Would not have thought of this solution, but it works very very well.
@ericharper after your review, we can merge this for Non-MP models. I think the refactor for MP models would be only minor changes which can be done in a subsequent PR (this PR is a blocker for two others)

@@ -37,7 +39,7 @@ def __init__(self) -> None:
self._model_weights_ckpt = "model_weights.ckpt"
self._model_extracted_dir = None

def save_to(self, model, save_path: str):
def save_to(self, model: nemo_classes.ModelPT, save_path: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use string here, do not use the actual class just for typing

Copy link
Collaborator Author

@artbataev artbataev Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid quotes, and use PEP563 as a more stable approach, since quotes are intended to use with forward declarations by original PEP484, and there is no forward declarations in most cases when quotes are used in NeMo.

But there is a problem with partial initialization here, so I used from __future__ import annotations. This is fully compatible with Python3.7+, widely recommended and adopted.

E.g. see in mypy, pydantic and so on.

You can't see it in old code (due to Python3.7+ limitations), but in new code it is also widely use it, e.g.:

To avoid long discussions here, I added quotes.

But I will appreciate if you look at alternative approach and discuss it (maybe separately).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use string name here. Pep is not an enforcement it is a suggestion. It does not make sense to mess around internals of Nemo just for an annotation.

nemo/core/classes/modelPT.py Show resolved Hide resolved
nemo/core/connectors/save_restore_connector.py Outdated Show resolved Hide resolved
self._unpack_nemo_file(path2file=model_metadata.restoration_path, out_folder=archive_dir)
# unpack all restorations paths (nemo checkpoints)
# in nemo checkpoints all resources contain hash in name, so there should be no collisions
for path in restoration_paths:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restoration paths were tempdirs, do we recreate those tempdirs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restoration_paths are only paths to .nemo checkpoints, since we don't use .nemo files inside parent .nemo file. Using .nemo in .nemo will still break the code and should be avoided.

I changed it to a set to unpack each checkpoint only once

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a test where we attempt to register a Nemo file. Maybe check and raise an error ?

nemo/core/connectors/save_restore_connector.py Outdated Show resolved Hide resolved
# path in config should be set to `None` to restore model from config after saving
self.cfg.child1_model_path = None
# config for child model should be stored in base model config
self.cfg.child1_model = self.child1_model.cfg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for demonstration or do we need to ask model devs to do this? If so, can we automate it somehow.

Copy link
Collaborator Author

@artbataev artbataev Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the model is loaded from checkpoint (see a comment about 2 ways to construct child models), the developer should assign a config.

I also thought about doing this automatically (since we can use __getattr__ on ModelPT to do this), but:

  • there are cases when the model will be directly constructed from config, so we don't need to assign it once more
  • this will break existing code: there is existing NLP code, which uses additional NeMo models as attributes, but doesn't save them (actually I had some errors when tried to do everything recursively). E.g. examples/nlp/token_classification/token_classification_train.py -> NLPModel.bert_model is assigned, but there is some tricky code to handle this submodel
  • if the model is not needed to be saved (assigned as attribute, but then deleted), the developer will have to delete cfg.child1_model directly after 'magic' construction of this attribute.

So, I think it's better to ask developers to do it manually if constructing child models from .nemo checkpoints.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not modify getattr. Very annoying to do it correctly and brittle for future.

However it is also very easy to forget this step and completely break your implementation.

Add a test that checks this - a model which is nested but "forgets" to set the config. Raise an error.

What about doing this - inside of save_to, check if any module is ModelPT and if so, check the config is available other wise raise error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do this - inside save to check if the config is set, if yes don't do anything else otherwise use self.nested_module_i.cfg and set the config automatically inside of self.cfg

def __init__(self, cfg, trainer=None):
super().__init__(cfg=cfg, trainer=trainer)

# child 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you register the model file? Or is that no longer necessary cause internally it will handle the module traversal ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 ways to instantiate child model.

I added comments to MockModelWithChildren.

Variant 1 for creating nested NeMo model:

  • create a child model from .nemo checkpoint (from *_model_path)
  • assign config to indicate that this model should be saved/restored
  • when model is restored, parent and children will be restored from a signle parent .nemo checkpoint

Variant 2 for creating nested NeMo model: load directly from config (can be instantiated via MyModel(child_cfg) or universally ModelPT.from_config_dict(child_cfg) (if _target_ is provided).

In both cases no .nemo file inside parent .nemo model will be saved (this is still incompatible with NeMo and global AppState!). So, after loading submodels from .nemo checkpoints, a solid model is constructed.

Keep in mind, that synchronization is only done in save_to:

  • aggregate and update subconfigs
  • aggregate and save artifacts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add a check inside of register_artifact that checks if it's a .Nemo file and raise an error saying that Nemo file itsel cannot be registered. Add a test case for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add a check inside of register_artifact that checks if it's a .Nemo file and raise an error saying that Nemo file itsel cannot be registered. Add a test case for this

Firstly added, but then reverted, since such a check breaks existing code. See
tests/collections/asr/test_asr_rnnt_encoder_model_bpe.py::TestEncDecRNNTBPEModel::test_save_restore_nested_model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this check, and either remove or update the faulty RNNT test. That one was my initial attempt to hack together nested model support and is no longer in the correct format of this PR

parent = ModelPT.restore_from(parent_path)
# check model is transparent, child models can be accessed and can be saved/restored separately
_ = self.__test_restore_elsewhere(parent.child1_model, map_location='cpu')
child2 = self.__test_restore_elsewhere(parent.child2_model, map_location='cpu')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing test, incredible work here !

if cfg.get('child1_model_path') is None and cfg.get('child1_model') is None:
self.child1_model = None
else:
# if `child1_model_path` is set, model should be restored from nemo checkpoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here "nemo checkpoint" -> "nemo model"?

Copy link
Collaborator Author

@artbataev artbataev Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a checkpoint, so I changed it to ".nemo model checkpoint"

I also added a comment above to clarify options how child model is constructed.

if cfg.get('child2_model_path') is None and cfg.get('child2_model') is None:
self.child2_model = None
else:
# if `child2_model_path` is set, model should be restored from nemo checkpoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nemo checkpoint" -> "nemo model" ?

Copy link
Collaborator Author

@artbataev artbataev Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a checkpoint, so I changed it to ".nemo model checkpoint"

child2_path = os.path.join(tmpdir_child, 'child2.nemo')
child2.save_to(child2_path)

# create model with children using saved "nemo" checkpoints
Copy link
Collaborator

@ericharper ericharper Jan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nemo" checkpoints -> nemo models

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

".nemo model checkpoints"

tests/core/test_save_restore.py Fixed Show resolved Hide resolved
Signed-off-by: Vladimir Bataev <[email protected]>
@titu1994
Copy link
Collaborator

Disallowing registering .nemo file as an artifact breaks existing code, so I reverted changes related to this check. See tests/collections/asr/test_asr_rnnt_encoder_model_bpe.py::TestEncDecRNNTBPEModel::test_save_restore_nested_model

Enforce this check, remove/update that test. That test is a proxy of the old way I was trying to hack together multi model support.

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to merge (after adding back the nemo file checks).
@ericharper for final review

Amazing work !

docs/source/core/core.rst Show resolved Hide resolved
self.register_nemo_submodule(
"child_model",
config_field="child_model",
model=ChildModel(self.cfg.child_model),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the trainer=trainer part

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# either if config provided initially, or automatically
# after model restoration
self.register_nemo_submodule(
"child_model",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using kwargs, add the key for the first arg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/source/core/core.rst Show resolved Hide resolved
nemo/core/classes/modelPT.py Show resolved Hide resolved
def __init__(self, cfg, trainer=None):
super().__init__(cfg=cfg, trainer=trainer)

# child 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add this check, and either remove or update the faulty RNNT test. That one was my initial attempt to hack together nested model support and is no longer in the correct format of this PR

@artbataev
Copy link
Collaborator Author

Disallowing registering .nemo file as an artifact breaks existing code, so I reverted changes related to this check. See tests/collections/asr/test_asr_rnnt_encoder_model_bpe.py::TestEncDecRNNTBPEModel::test_save_restore_nested_model

Enforce this check, remove/update that test. That test is a proxy of the old way I was trying to hack together multi model support.

I added the check and removed the test

ericharper
ericharper previously approved these changes Jan 20, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for many changes, iterations, and documentation!

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
@artbataev
Copy link
Collaborator Author

@titu1994, @ericharper
Thank you so much for your comments and discussion!

I finally updated the documentation + fixed the faulty test for NestedRNNTModel in tests/collections/asr/test_asr_rnnt_encoder_model_bpe.py instead of removing it.

I think now it's ready to be merged)

@SeanNaren
Copy link
Collaborator

LGTM as well, absolutely bonkers you made this work! I hope this promotes new awesome clean multi-model pipelines in NeMo 🚀

artbataev and others added 2 commits January 23, 2023 20:18
Co-authored-by: Sean Naren <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@artbataev artbataev merged commit 97973c5 into NVIDIA:main Jan 23, 2023
Kipok pushed a commit to Kipok/NeMo that referenced this pull request Jan 31, 2023
Nested NeMo models support

Signed-off-by: Vladimir Bataev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
ericharper added a commit that referenced this pull request Jan 31, 2023
Nested NeMo models support

Signed-off-by: Vladimir Bataev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
ericharper added a commit that referenced this pull request Jan 31, 2023
Nested NeMo models support

Signed-off-by: Vladimir Bataev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
Kipok pushed a commit to Kipok/NeMo that referenced this pull request Jan 31, 2023
Nested NeMo models support

Signed-off-by: Vladimir Bataev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
Nested NeMo models support

Signed-off-by: Vladimir Bataev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sean Naren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants