Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved random seed configuration for Lhotse dataloaders with docs #9001

Merged
merged 7 commits into from
Apr 26, 2024

Conversation

pzelasko
Copy link
Collaborator

What does this PR do ?

  • Comprehensive documentation about random seed configuration, behavior, and effects on dataloading.
  • Backward-compatible changes to random seed configuration in training.
  • Allows more precise control over randomness; this work was triggered by the need to support mixed data-parallel (DP) and tensor-parallel (TP) training. Before this PR, it wasn't possible to configure the dataloader to yield identical data on identical DP ranks but different TP ranks, causing deadlocks. Now it's possible but the user has to be careful about manually setting different random seeds on training continuations (this is documented).
  • Unit tests to ensure the desired behavior in DDP setups.

Collection: ASR

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Copy link
Collaborator

@zhehuaichen zhehuaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side.
Thanks for the quick fix!


* This setup guarantees 100% dataloading reproducibility.

* Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the recommended practice on managing the "seed" on large data setup?

Copy link
Collaborator Author

@pzelasko pzelasko Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally every time you resume, you'd provide a different value to model.train_ds.seed=<val>. A true-enough random seed can be obtained on most systems by reading /dev/urandom, e.g. uint32 seed: RSEED=$(od -An -N4 -tu4 < /dev/urandom | tr -d ' '). If you have some sort of "launcher script" that queues multiple jobs, this would be the right place to use this. Let me update the docs with this example.

Ideally we'd be able to automate this seed management thing by keeping some state in the checkpoints, but at this point it'd be a scope creep.

Copy link
Collaborator

@erastorgueva-nv erastorgueva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend to get rid of the tabs before the bullet points. Currently, with the tabs, the bullet points are placed inside block quotes

docs/source/asr/datasets.rst Outdated Show resolved Hide resolved
docs/source/asr/datasets.rst Outdated Show resolved Hide resolved
docs/source/asr/datasets.rst Outdated Show resolved Hide resolved
docs/source/asr/datasets.rst Outdated Show resolved Hide resolved
Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
@pzelasko
Copy link
Collaborator Author

Thanks @erastorgueva-nv, for some reason I thought it's required by sphinx. It should look much better indeed!

@erastorgueva-nv
Copy link
Collaborator

Docs LGTM

Copy link
Collaborator

@zhehuaichen zhehuaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side

@pzelasko pzelasko merged commit b8ad0a8 into main Apr 26, 2024
128 checks passed
@pzelasko pzelasko deleted the lhotse-random-seeds branch April 26, 2024 21:10
suiyoubi pushed a commit that referenced this pull request May 2, 2024
…9001)

* Improving RNG seeding with Lhotse dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Add documentation about random seeds

Signed-off-by: Piotr Żelasko <[email protected]>

* Add doc about managing random seed

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…VIDIA#9001)

* Improving RNG seeding with Lhotse dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Add documentation about random seeds

Signed-off-by: Piotr Żelasko <[email protected]>

* Add doc about managing random seed

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply suggestions from code review

Co-authored-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants