Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Codec] Update codec checkpoint config #7835

Merged
merged 1 commit into from
Nov 16, 2023

Conversation

anteju
Copy link
Collaborator

@anteju anteju commented Oct 30, 2023

What does this PR do ?

Update codec configs to always save NeMo checkpoint with best val_loss.

Collection: TTS

Changelog

  • Updated checkpoint_callback_params to always save NeMo checkpoint with best val_loss and to save 5 best checkpoints
  • Set create_wandb_logger to true, since model.log_config.log_wandb is set to true by default

Usage

N/A

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

nithinraok
nithinraok previously approved these changes Nov 1, 2023
examples/tts/conf/audio_codec/audio_codec_24000.yaml Outdated Show resolved Hide resolved
create_wandb_logger: true
wandb_logger_kwargs:
name: ${name}
project: ???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the convention that all ??? be at the top of the file? I think TTS is the only domain which does it currently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference.
We could move to the top, set it to project: audio_codec or possibly remove altogether (wandb.init API indicates it's optional, but haven't tested it).

Copy link
Collaborator

@rlangman rlangman Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for wandb the "name" field is supposed to be the name of a specific run to track. "project" is the category to sort runs by.

So project could be set to ${name} or some other hardcoded value. Name can be removed, assuming user will override it if they want to organize their runs (if not provided then wandb generates a random name like happy-giraffe-42)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set both to null, seems to be common across other recipes:

  wandb_logger_kwargs:
    name: null
    project: null

@anteju anteju force-pushed the pr/codec-checkpoint-callback-params branch 2 times, most recently from 03e4712 to a2f521e Compare November 9, 2023 02:33
@anteju
Copy link
Collaborator Author

anteju commented Nov 9, 2023

jenkins

@anteju
Copy link
Collaborator Author

anteju commented Nov 9, 2023

CI failing with an unrelated error

FAILED tests/core/test_save_restore.py::TestSaveRestore::test_hf_model_info_with_card_data - AssertionError: assert not True

 +  where True = hasattr(ModelInfo(id='Koleshjr/stt_sw_40k_10k_3e_4_5_epoch', author='Koleshjr', sha='bb602f31a8bdd466af474b8bfb735ebf26bded51', last_modified=datetime.datetime(2023, 11, 8, 12, 41, 13, tzinfo=datetime.timezone.utc), private=False, gated=None, disabled=None, downloads=92, likes=0, library_name='nemo', tags=['nemo', 'region:us'], pipeline_tag=None, mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=[RepoSibling(rfilename='.gitattributes', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='readme_template.md', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='stt_sw_40k_10k_3e_4_5_epoch.nemo', size=None, blob_id=None, lfs=None)], spaces=None, safetensors=None), 'cardData')

= 1 failed, 1562 passed, 267 skipped, 22 deselected, 530 warnings in 834.39s (0:13:54) =

script returned exit code 1

@anteju anteju force-pushed the pr/codec-checkpoint-callback-params branch from a2f521e to f99f5a1 Compare November 9, 2023 23:40
Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ante. LGTM.

@anteju
Copy link
Collaborator Author

anteju commented Nov 10, 2023

CI failing on test-nlp-imports, seems unrelated and other PRs are failing atm.

@nithinraok
Copy link
Collaborator

They are coming from megatron.
ModuleNotFoundError: No module named 'transformer_engine

@anteju anteju force-pushed the pr/codec-checkpoint-callback-params branch from f99f5a1 to 8f012da Compare November 14, 2023 23:14
@anteju
Copy link
Collaborator Author

anteju commented Nov 14, 2023

Rebasing onto main to hopefully resolve CI failure.

@anteju anteju force-pushed the pr/codec-checkpoint-callback-params branch from 8f012da to 9ba6d81 Compare November 15, 2023 22:01
@anteju
Copy link
Collaborator Author

anteju commented Nov 15, 2023

jenkins

@anteju anteju merged commit b4438a3 into NVIDIA:main Nov 16, 2023
11 checks passed
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants