Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with long cutset during zipformer training #1844

Open
bsshruthi22 opened this issue Dec 20, 2024 · 8 comments
Open

Issues with long cutset during zipformer training #1844

bsshruthi22 opened this issue Dec 20, 2024 · 8 comments

Comments

@bsshruthi22
Copy link

Hello All,
we are training a zip former model for about 3400 hours of Tamil data. This is in reference with #1751
We have NVIDIA A6000 50GB GPU. Getting the below error:
2024-12-19 16:44:26,604 INFO [asr_datamodule.py:375] About to create dev dataloader
2024-12-19 16:44:26,604 INFO [train.py:1326] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic.py:342: UserWarning: We have exceeded the max_duration constraint during sampling but have only 1 cut. This is likely because max_duration was set to a very low value ~10s, or you're using a CutSet with very long cuts (e.g. 100s of seconds long).
warnings.warn(
2024-12-19 16:58:57,481 ERROR [train.py:1345] Your GPU ran out of memory with the current max_duration setting. We recommend decreasing max_duration and trying again.
Failing criterion: single_longest_cut (=162.26) ...

2024-12-19 16:58:57,482 INFO [train.py:1304] Saving batch to zipformer/exp/batch-6c307511-b2b9-437a-28df-6ec4ce4a2bbd.pt
2024-12-19 16:59:01,314 INFO [train.py:1310] features shape: torch.Size([7, 16226, 80])
2024-12-19 16:59:01,315 INFO [train.py:1314] num tokens: 817
Traceback (most recent call last):
File "./zipformer/train.py", line 1380, in
main()
File "./zipformer/train.py", line 1373, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1225, in run
scan_pessimistic_batches_for_oom(
File "./zipformer/train.py", line 1341, in scan_pessimistic_batches_for_oom
loss.backward()
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/armuser/k2_249/icefall/egs/tamil/ASR/zipformer/scaling.py", line 313, in backward
x_grad = x_grad - ans * x_grad.sum(dim=ctx.dim, keepdim=True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 37.04 GiB already allocated; 3.94 GiB free; 41.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Training command: ./zipformer/train.py --world-size 1 --num-epochs 30 --start-batch 336000 --use-fp16 1 --exp-dir zipformer/exp --max-duration 100
Initially had kept the max duration as 150
The training had completed for 4 epochs.Then got the above issue.Loaded the batch.pt file and got as below:
'sequence_idx': tensor([0, 1, 2, 3, 4, 5, 6], dtype=torch.int32), 'start_frame': tensor([0, 0, 0, 0, 0, 0, 0], dtype=torch.int32), 'num_frames': tensor([16226, 2022, 1926, 1744, 1716, 1676, 1613], dtype=torch.int32), 'cut': [MonoCut(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, supervisions=[SupervisionSegment(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, text='அவ்வையார் விருது தமிழ்நாட்டில் சமூகநலப் பணிகளை அரப்பணிப்புடன் செயலாற்றியதாக 2020ஆம் ஆண்டிற்கான அவ்வையார் விருதுக்கு தேர்வு செய்யப்பட்ட திருவண்ணாமலையைச் சேர்ந்த சமூக சேவகி திருமதி', language=None, speaker='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', gender=None, custom={'origin': 'giga'}, alignment=None)], features=Features(type='kaldifeat-fbank', num_frames=16226, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0, duration=162.26, storage_type='lilcom_chunky', storage_path='/home/armuser/10TBHDD/CUDA_11.6/icefall/egs/tamil/ASR/data/fbank/train_split/tamil_feats_train_00032581.lca', storage_key='964876,45872,45111,44652,45255,45498,44806,45091,45317,44865,45016,44804,44720,44784,44749,45046,44983,44943,45297,44866,45335,45125,45507,44978,44909,44841,44914,44718,44569,45297,44670,45390,44619,20203', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', channels=0), recording=Recording(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', sources=[AudioSource(type='file', channels=[0], source='/media/ASR_database/shruthilipi_data/tamil/newsonair_renamed
/Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97.wav')], sampling_rate=8000, num_samples=1298080, duration=162.26, channel_ids=[0], transforms=None),

Kindly suggest how to go about this issue.

@csukuangfj
Copy link
Collaborator

we have a function in train.py to remove long and short utterances, which is enabled by default.
Please don't disable it.

@bsshruthi22
Copy link
Author

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

@csukuangfj
Copy link
Collaborator

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

which file.are you referring to?

Please recheck.

@bsshruthi22
Copy link
Author

@csukuangfj
Copy link
Collaborator

please refer to librispeech

@csukuangfj
Copy link
Collaborator

def remove_short_and_long_utt(c: Cut):
# Keep only utterances with duration between 1 second and 20 seconds
#
# Caution: There is a reason to select 20.0 here. Please see
# ../local/display_manifest_statistics.py
#
# You should use ../local/display_manifest_statistics.py to get
# an utterance duration distribution for your dataset to select
# the threshold

@bsshruthi22

please read the comment in train.py carefully.

@bsshruthi22
Copy link
Author

@csukuangfj ok. thanks for your suggestion. .Now the training has resumed. Hopefully it gets completed without any error.

@bsshruthi22
Copy link
Author

bsshruthi22 commented Dec 29, 2024

@csukuangfj Is there anyway to retain audios which are greater than 20s or less than 1s by doing any modification to cuts so that it doesn't give error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants