Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with Segfault in ASR models #3956

Merged
merged 3 commits into from
Apr 8, 2022

Conversation

titu1994
Copy link
Collaborator

@titu1994 titu1994 commented Apr 8, 2022

Signed-off-by: smajumdar [email protected]

What does this PR do ?

Fixes a segfault in nemo ASR models due to hydra / omegaconf access during DDP.

Collection: [ASR]

Changelog

  • Add new flag and method to pre-allocate the value of the check
  • Update skip_nan_grad code to support global DDP access.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Copy link
Collaborator

@VahidooX VahidooX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@titu1994 titu1994 merged commit a3fb54c into NVIDIA:r1.8.0 Apr 8, 2022
@titu1994 titu1994 deleted the fix_skip_nan_grad branch April 8, 2022 20:06
ericharper pushed a commit that referenced this pull request Apr 8, 2022
* Fix issue with Segfault in ASR models

Signed-off-by: smajumdar <[email protected]>

* Add docstring

Signed-off-by: smajumdar <[email protected]>
titu1994 added a commit that referenced this pull request Apr 9, 2022
* update version

Signed-off-by: ericharper <[email protected]>

* Stateless timer fix for PTL 1.6 (#3925)

* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <[email protected]>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <[email protected]>

* Fix year

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove unused imports

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* GPU test

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* clean import

Signed-off-by: ericharper <[email protected]>

Co-authored-by: ericharper <[email protected]>

* fix save_best missing chpt bug, update for setup_tokenizer() changes (#3932)

* fix save_best missing chpt bug, update for setup_tokenizer() changes

Signed-off-by: ekmb <[email protected]>

* style fix

Signed-off-by: ekmb <[email protected]>

* Fix divide by world size (#3941)

Signed-off-by: MaximumEntropy <[email protected]>

* remove old doc (#3946)

Signed-off-by: ekmb <[email protected]>

* Fix issues with librosa deprecations (#3950)

Signed-off-by: smajumdar <[email protected]>

* Fix issue with Segfault in ASR models (#3956)

* Fix issue with Segfault in ASR models

Signed-off-by: smajumdar <[email protected]>

* Add docstring

Signed-off-by: smajumdar <[email protected]>

* Fix notebook bugs for branch r1.8.0 (#3948)

* load the model from ngc

Signed-off-by: Yi Dong <[email protected]>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <[email protected]>

* fix the typos

Signed-off-by: Yi Dong <[email protected]>

* remove output

Signed-off-by: Yi Dong <[email protected]>

* fix isort

Signed-off-by: Yi Dong <[email protected]>

* fix merge error

Signed-off-by: Yi Dong <[email protected]>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <[email protected]>

* fix unit test

Signed-off-by: Yi Dong <[email protected]>

* fix ci

Signed-off-by: Yi Dong <[email protected]>

* fix ci bert pretraining

Signed-off-by: Yi Dong <[email protected]>

* make it compatible with main

Signed-off-by: Yi Dong <[email protected]>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <[email protected]>

* fix argument

Signed-off-by: Yi Dong <[email protected]>

* fix usablity issue

Signed-off-by: Yi Dong <[email protected]>

* work around

Signed-off-by: Yi Dong <[email protected]>

Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix global batch fit loop (#3936)

* add lightning module hooks for global batch

Signed-off-by: ericharper <[email protected]>

* clean scripts

Signed-off-by: ericharper <[email protected]>

* style

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* DP=1 fix

Signed-off-by: MaximumEntropy <[email protected]>

* set num dataset workers to 2

Signed-off-by: ericharper <[email protected]>

* update validation_loop with GlobalDataFetcher

Signed-off-by: ericharper <[email protected]>

* add test global data fetcher

Signed-off-by: ericharper <[email protected]>

* Drop last for test ds

Signed-off-by: MaximumEntropy <[email protected]>

* Fix test epoch end

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Fix eval

Signed-off-by: MaximumEntropy <[email protected]>

* Fix reconfigure microbatch in the complete method

Signed-off-by: MaximumEntropy <[email protected]>

* add comments

Signed-off-by: MaximumEntropy <[email protected]>

* Set init consumed samples

Signed-off-by: MaximumEntropy <[email protected]>

* fix shuffle

Signed-off-by: MaximumEntropy <[email protected]>

* add save_restore_connector arg

Signed-off-by: ericharper <[email protected]>

* Fix padding for labels and loss mask

Signed-off-by: MaximumEntropy <[email protected]>

* GLUE/XNLI CI tests

Signed-off-by: MaximumEntropy <[email protected]>

* limit val batches in hydra fix

Signed-off-by: MaximumEntropy <[email protected]>

* Restart CI

Signed-off-by: MaximumEntropy <[email protected]>

* Fix unittest

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: MaximumEntropy <[email protected]>

* Update max_epochs on megatron configs (#3958)

* update config

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* update version

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Evelina <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants