Improved random seed configuration for Lhotse dataloaders with docs #9001

pzelasko · 2024-04-22T19:01:46Z

What does this PR do ?

Comprehensive documentation about random seed configuration, behavior, and effects on dataloading.
Backward-compatible changes to random seed configuration in training.
Allows more precise control over randomness; this work was triggered by the need to support mixed data-parallel (DP) and tensor-parallel (TP) training. Before this PR, it wasn't possible to configure the dataloader to yield identical data on identical DP ranks but different TP ranks, causing deadlocks. Now it's possible but the user has to be careful about manually setting different random seeds on training continuations (this is documented).
Unit tests to ensure the desired behavior in DDP setups.

Collection: ASR

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <[email protected]>

zhehuaichen

Looks good from my side.
Thanks for the quick fix!

zhehuaichen · 2024-04-23T05:24:19Z

docs/source/asr/datasets.rst

+
+    * This setup guarantees 100% dataloading reproducibility.
+
+    * Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default.


what's the recommended practice on managing the "seed" on large data setup?

Generally every time you resume, you'd provide a different value to model.train_ds.seed=<val>. A true-enough random seed can be obtained on most systems by reading /dev/urandom, e.g. uint32 seed: RSEED=$(od -An -N4 -tu4 < /dev/urandom | tr -d ' '). If you have some sort of "launcher script" that queues multiple jobs, this would be the right place to use this. Let me update the docs with this example.

Ideally we'd be able to automate this seed management thing by keeping some state in the checkpoints, but at this point it'd be a scope creep.

Signed-off-by: Piotr Żelasko <[email protected]>

erastorgueva-nv

Recommend to get rid of the tabs before the bullet points. Currently, with the tabs, the bullet points are placed inside block quotes

docs/source/asr/datasets.rst

Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko · 2024-04-24T16:37:44Z

Thanks @erastorgueva-nv, for some reason I thought it's required by sphinx. It should look much better indeed!

erastorgueva-nv · 2024-04-24T16:58:05Z

Docs LGTM

zhehuaichen

LGTM from my side

…9001) * Improving RNG seeding with Lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Add documentation about random seeds Signed-off-by: Piotr Żelasko <[email protected]> * Add doc about managing random seed Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: Ao Tang <[email protected]>

…VIDIA#9001) * Improving RNG seeding with Lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Add documentation about random seeds Signed-off-by: Piotr Żelasko <[email protected]> * Add doc about managing random seed Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]>

pzelasko added 4 commits April 16, 2024 16:42

Improving RNG seeding with Lhotse dataloading

7b03d3b

Signed-off-by: Piotr Żelasko <[email protected]>

Fix

08d8a84

Signed-off-by: Piotr Żelasko <[email protected]>

Merge branch 'main' into lhotse-random-seeds

250266b

Add documentation about random seeds

45ab785

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko requested review from krishnacpuvvada and zhehuaichen April 22, 2024 19:01

github-actions bot added ASR common labels Apr 22, 2024

pzelasko requested a review from erastorgueva-nv April 22, 2024 19:02

zhehuaichen reviewed Apr 23, 2024

View reviewed changes

pzelasko added 2 commits April 23, 2024 09:36

Add doc about managing random seed

7add45b

Signed-off-by: Piotr Żelasko <[email protected]>

Merge branch 'main' into lhotse-random-seeds

9905eda

erastorgueva-nv requested changes Apr 23, 2024

View reviewed changes

docs/source/asr/datasets.rst Outdated Show resolved Hide resolved

docs/source/asr/datasets.rst Outdated Show resolved Hide resolved

docs/source/asr/datasets.rst Outdated Show resolved Hide resolved

docs/source/asr/datasets.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

ad895d6

Co-authored-by: Elena Rastorgueva <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

erastorgueva-nv approved these changes Apr 24, 2024

View reviewed changes

zhehuaichen approved these changes Apr 25, 2024

View reviewed changes

pzelasko merged commit b8ad0a8 into main Apr 26, 2024
128 checks passed

pzelasko deleted the lhotse-random-seeds branch April 26, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved random seed configuration for Lhotse dataloaders with docs #9001

Improved random seed configuration for Lhotse dataloaders with docs #9001

pzelasko commented Apr 22, 2024

zhehuaichen left a comment •

edited

Loading

zhehuaichen Apr 23, 2024

pzelasko Apr 23, 2024 •

edited

Loading

erastorgueva-nv left a comment

pzelasko commented Apr 24, 2024

erastorgueva-nv commented Apr 24, 2024

zhehuaichen left a comment


		* This setup guarantees 100% dataloading reproducibility.

		* Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default.

Improved random seed configuration for Lhotse dataloaders with docs #9001

Improved random seed configuration for Lhotse dataloaders with docs #9001

Conversation

pzelasko commented Apr 22, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

zhehuaichen left a comment • edited Loading

Choose a reason for hiding this comment

zhehuaichen Apr 23, 2024

Choose a reason for hiding this comment

pzelasko Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

erastorgueva-nv left a comment

Choose a reason for hiding this comment

pzelasko commented Apr 24, 2024

erastorgueva-nv commented Apr 24, 2024

zhehuaichen left a comment

Choose a reason for hiding this comment

zhehuaichen left a comment •

edited

Loading

pzelasko Apr 23, 2024 •

edited

Loading