Issues with long cutset during zipformer training #1844

bsshruthi22 · 2024-12-20T16:56:56Z

Hello All,
we are training a zip former model for about 3400 hours of Tamil data. This is in reference with #1751
We have NVIDIA A6000 50GB GPU. Getting the below error:
2024-12-19 16:44:26,604 INFO [asr_datamodule.py:375] About to create dev dataloader
2024-12-19 16:44:26,604 INFO [train.py:1326] Sanity check -- see if any of the batches in epoch 1 would cause OOM.
/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/dynamic.py:342: UserWarning: We have exceeded the max_duration constraint during sampling but have only 1 cut. This is likely because max_duration was set to a very low value ~10s, or you're using a CutSet with very long cuts (e.g. 100s of seconds long).
warnings.warn(
2024-12-19 16:58:57,481 ERROR [train.py:1345] Your GPU ran out of memory with the current max_duration setting. We recommend decreasing max_duration and trying again.
Failing criterion: single_longest_cut (=162.26) ...
2024-12-19 16:58:57,482 INFO [train.py:1304] Saving batch to zipformer/exp/batch-6c307511-b2b9-437a-28df-6ec4ce4a2bbd.pt
2024-12-19 16:59:01,314 INFO [train.py:1310] features shape: torch.Size([7, 16226, 80])
2024-12-19 16:59:01,315 INFO [train.py:1314] num tokens: 817
Traceback (most recent call last):
File "./zipformer/train.py", line 1380, in
main()
File "./zipformer/train.py", line 1373, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1225, in run
scan_pessimistic_batches_for_oom(
File "./zipformer/train.py", line 1341, in scan_pessimistic_batches_for_oom
loss.backward()
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/armuser/anaconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/home/armuser/k2_249/icefall/egs/tamil/ASR/zipformer/scaling.py", line 313, in backward
x_grad = x_grad - ans * x_grad.sum(dim=ctx.dim, keepdim=True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.86 GiB (GPU 0; 47.54 GiB total capacity; 37.04 GiB already allocated; 3.94 GiB free; 41.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Training command: ./zipformer/train.py --world-size 1 --num-epochs 30 --start-batch 336000 --use-fp16 1 --exp-dir zipformer/exp --max-duration 100
Initially had kept the max duration as 150
The training had completed for 4 epochs.Then got the above issue.Loaded the batch.pt file and got as below:
'sequence_idx': tensor([0, 1, 2, 3, 4, 5, 6], dtype=torch.int32), 'start_frame': tensor([0, 0, 0, 0, 0, 0, 0], dtype=torch.int32), 'num_frames': tensor([16226, 2022, 1926, 1744, 1716, 1676, 1613], dtype=torch.int32), 'cut': [MonoCut(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, supervisions=[SupervisionSegment(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', start=0.0, duration=162.26, channel=0, text='அவ்வையார் விருது தமிழ்நாட்டில் சமூகநலப் பணிகளை அரப்பணிப்புடன் செயலாற்றியதாக 2020ஆம் ஆண்டிற்கான அவ்வையார் விருதுக்கு தேர்வு செய்யப்பட்ட திருவண்ணாமலையைச் சேர்ந்த சமூக சேவகி திருமதி', language=None, speaker='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', gender=None, custom={'origin': 'giga'}, alignment=None)], features=Features(type='kaldifeat-fbank', num_frames=16226, num_features=80, frame_shift=0.01, sampling_rate=8000, start=0, duration=162.26, storage_type='lilcom_chunky', storage_path='/home/armuser/10TBHDD/CUDA_11.6/icefall/egs/tamil/ASR/data/fbank/train_split/tamil_feats_train_00032581.lca', storage_key='964876,45872,45111,44652,45255,45498,44806,45091,45317,44865,45016,44804,44720,44784,44749,45046,44983,44943,45297,44866,45335,45125,45507,44978,44909,44841,44914,44718,44569,45297,44670,45390,44619,20203', recording_id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', channels=0), recording=Recording(id='Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97', sources=[AudioSource(type='file', channels=[0], source='/media/ASR_database/shruthilipi_data/tamil/newsonair_renamed
/Regional-Tiruchirapalli-Tamil-1345-202039142853_sent_97.wav')], sampling_rate=8000, num_samples=1298080, duration=162.26, channel_ids=[0], transforms=None),

Kindly suggest how to go about this issue.

csukuangfj · 2024-12-20T18:56:05Z

we have a function in train.py to remove long and short utterances, which is enabled by default.
Please don't disable it.

bsshruthi22 · 2024-12-21T04:45:53Z

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

csukuangfj · 2024-12-21T04:54:46Z

@csukuangfj in train.py there is train_cuts = train_cuts.filter(remove_short_utt). I was not able to find any option for long utt.

which file.are you referring to?

Please recheck.

bsshruthi22 · 2024-12-21T05:21:02Z

@csukuangfj I am using this file : https://github.com/k2-fsa/icefall/blob/master/egs/gigaspeech/ASR/zipformer/train.py

csukuangfj · 2024-12-21T09:24:04Z

please refer to librispeech

csukuangfj · 2024-12-24T02:25:20Z

icefall/egs/librispeech/ASR/zipformer/train.py

Lines 1377 to 1385 in ad966fb

    
           def remove_short_and_long_utt(c: Cut): 
        
               # Keep only utterances with duration between 1 second and 20 seconds 
        
               # 
        
               # Caution: There is a reason to select 20.0 here. Please see 
        
               # ../local/display_manifest_statistics.py 
        
               # 
        
               # You should use ../local/display_manifest_statistics.py to get 
        
               # an utterance duration distribution for your dataset to select 
        
               # the threshold

@bsshruthi22

please read the comment in train.py carefully.

bsshruthi22 · 2024-12-24T06:43:23Z

@csukuangfj ok. thanks for your suggestion. .Now the training has resumed. Hopefully it gets completed without any error.

bsshruthi22 · 2024-12-29T10:00:49Z

@csukuangfj Is there anyway to retain audios which are greater than 20s or less than 1s by doing any modification to cuts so that it doesn't give error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with long cutset during zipformer training #1844

Issues with long cutset during zipformer training #1844

bsshruthi22 commented Dec 20, 2024

csukuangfj commented Dec 20, 2024

bsshruthi22 commented Dec 21, 2024

csukuangfj commented Dec 21, 2024

bsshruthi22 commented Dec 21, 2024

csukuangfj commented Dec 21, 2024

csukuangfj commented Dec 24, 2024

bsshruthi22 commented Dec 24, 2024

bsshruthi22 commented Dec 29, 2024 •

edited

Loading

Issues with long cutset during zipformer training #1844

Issues with long cutset during zipformer training #1844

Comments

bsshruthi22 commented Dec 20, 2024

csukuangfj commented Dec 20, 2024

bsshruthi22 commented Dec 21, 2024

csukuangfj commented Dec 21, 2024

bsshruthi22 commented Dec 21, 2024

csukuangfj commented Dec 21, 2024

csukuangfj commented Dec 24, 2024

bsshruthi22 commented Dec 24, 2024

bsshruthi22 commented Dec 29, 2024 • edited Loading

bsshruthi22 commented Dec 29, 2024 •

edited

Loading