-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix tarred dataset len when num shards is not divisible by workers #4553
Conversation
Signed-off-by: Iztok Lebar Bajec <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. Thanks for fixing this! Looks good to me for NMT and Language Modeling, will let @ekmb and @PeganovAnton review for Punctuation and Capitalization.
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Show resolved
Hide resolved
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Outdated
Show resolved
Hide resolved
Great fix! Reviewed punctuation part. |
…o fix/tarred_dataset_len
While running NLP/PC on 8 GPUs I've noticed that the NeMo/nemo/core/optim/lr_scheduler.py Lines 861 to 882 in fea3775
But, NeMo/nemo/core/optim/lr_scheduler.py Line 792 in fea3775
lead to a faulty computation of max_steps , i.e. instead of max_epocs it will optimise for max_epocs/num_workers .
Since prior to this PR Multiplying the |
@titu1994 I would like your thoughts on what |
ASR tarred datasets always return the total number of samples in the manifest - https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/data/audio_to_text.py#L765 For tarred dataset we then limit the train dataset to have a fixed number of steps per epoch explicitly - https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/ctc_models.py#L420-L433 |
Thanks @titu1994. Note though that when the number of shards is not divisible by NeMo/nemo/collections/asr/data/audio_to_text.py Lines 186 to 200 in fea3775
Assume you have 146 tarred files, each containing 1000 samples, and a
|
In practice it doesn't matter, we shuffle tarfiles before each draw of samples anyway. |
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py
Show resolved
Hide resolved
@titu1994 Correct me if I'm wrong, but based on the way it is implemented even in ASR shuffling occurs after the WebDataset is built from Hence, if my understanding is correct, in the above case of 8GPUs, 146 tar files with 1000 samples each, only the first 144 of 146 tar files will ever be used. A larger number of samples in What I am seeing in NLP/PC is that in the above case:
I can:
Let me know which solution do you prefer. Personally, I would opt for the second, especially since some collections use tarred datasets also for validation, where exclusion of data may be acceptable, but oversampling may be less so. |
No, Webdataset shuffles tarfile list every time iter is called, which is after every epoch - https://github.com/webdataset/webdataset/blob/0.1.65/webdataset/dataset.py#L118 |
f7b21dc
to
cd25832
Compare
@titu1994 The DCO required me to do a rebase, which I followed to the letter, hope this didn't mess things up. |
DCO Rebase always messes things up. I think you need to |
I'm a bit lost here, don't have much experience with rebasing. What I'm seeing is that under the I have a backup of the repo prior to doing the rebase, if this helps. Please advise to what should I do, or let me know the status is acceptable the way it is. |
Skip dco for now, we will override it. |
To fix this, ignore dco bot and rebase with main branch using an ide in interactive mode, there shouldn't be any conflicts, and finally force push. Dco messed up here and made you sign commits you didn't touch. |
The way it is right now is definitely not what its supposed to be before merging, reverting back to what you had before the DCO related commits might be the best thing to do and we can set DCO to pass. |
Agreed let's get the Pr back to original state and override dco |
b577da4
to
f7b21dc
Compare
OK. This should be it then. |
@titu1994 should I wait until DCO will be resolved? |
I just set DCO to pass. @PeganovAnton please take a final look. |
nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/token_classification/punctuation_capitalization_tarred_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Iztok Lebar Bajec <[email protected]>
Signed-off-by: Iztok Lebar Bajec <[email protected]>
@MaximumEntropy you'll probably have to once again set DCO to pass. |
@titu1994 @MaximumEntropy I guess one of you will need to set DCO to pass. If I follow its instructions and rebase it will mess things up. |
…VIDIA#4553) * fix tarred dataset len when num shards is not divisible by workers Signed-off-by: Iztok Lebar Bajec <[email protected]> * update error reporting on invalid `shard_strategy` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC `@dataclass` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC docs * revert test with Dataloader retruning the actual data length * make dataloader return actual num of samples, set `limit_train_baches` on `setup_*` * update `shard_strategy` docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> * update `tarred_dataset` documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * fix style * update documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * updated docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> Co-authored-by: PeganovAnton <[email protected]> Signed-off-by: David Mosallanezhad <[email protected]>
…VIDIA#4553) * fix tarred dataset len when num shards is not divisible by workers Signed-off-by: Iztok Lebar Bajec <[email protected]> * update error reporting on invalid `shard_strategy` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC `@dataclass` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC docs * revert test with Dataloader retruning the actual data length * make dataloader return actual num of samples, set `limit_train_baches` on `setup_*` * update `shard_strategy` docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> * update `tarred_dataset` documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * fix style * update documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * updated docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> Co-authored-by: PeganovAnton <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
Signed-off-by: Iztok Lebar Bajec [email protected]
What does this PR do ?
Fixes #4522. When using tarred datasets and the number of shards is not divisible by
world_size
, nemo logs a warning, but training continues. However, in this case the computation ofDataLoader.len()
becomes incorrect. This leads to invalidlr_scheduler.max_steps
, and other issues down the road. One example beingValidation check
not running at end of epoch.Collection: [nlp]
Changelog
In case of
shard_strategy='scatter'
thelen()
is computed based on the amount of taken shards and assuming totalnum_baches
are equally distributed over all shards, i.e. that each shard contains the same number of batches. In addition, following other implementations this PR addsshard_strategy
support to NLP/PC tarred datasets.There are two additional occasions, where tarred datasets are used
NeMo/nemo/collections/asr/data/audio_to_text.py
Line 186 in fa2e55e
which based on the implementation does not seem to be affected by this issue, and
NeMo/nemo/collections/nlp/data/language_modeling/l2r_lm_dataset.py
Lines 169 to 184 in fa2e55e
which might, but I chose to leave it as is, as the computation of
len()
NeMo/nemo/collections/nlp/data/language_modeling/l2r_lm_dataset.py
Lines 239 to 240 in fa2e55e
greatly differs from others
PR Type:
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
note to @PeganovAnton for NLP/C
Additional Information