-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer2 with CTC is hard to train #1352
Comments
The fact that the validation loss is inf right from the start suggests to me that you may have utterances that have more symbols in their transcript than they have frames; is that possible? I don't recall whether we check for that somehow. |
Thank you! I will try to check that. In the mean time I found that for now it looks like worldsize 1 and --max-duration 400 works. If I increase either one, it dies with the errors above. Would that mean that it's probably not the symbols ? This is my current status: |
If you are not using fp16, make sure the grad-scaling code is deactivated. |
EDITED: sorry, I was wrong. |
I have the remove_short_and_long_utt function in train.py, but i don't have that warning in the log files. |
look at this also |
Thanks @armusc ! |
it's the (num-symbols / length) that would be the issue for this scenario. would make sense to check before computing the ctc loss. |
@armusc I tried it, nothing is getting filtered. |
I mean, even better, you are not losing any data, I guess |
@armusc |
2023-10-30 18:43:56,867 INFO [scaling.py:979] (5/8) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=8.22 This happens right before the problem occurs |
I launched a training of the large model, and I am having indeed the same issue before, the training looked to be converging as expected (i.e. loss values pretty much comparable to completed training of smaller models) |
The message:
will indicate that there are NaN's appearing in the forward() function. This could be related to fp16 or amp training; I assume you must be uysing this because there are messages about grad_scale in the previous message you posted. I suspect the original problem, before this, was that you started getting bad gradients because of grad scale getting too small; and grad scale was getting too small probably because you had too-long transcripts for utterances and this generated inf's in the loss, eventually generating nan grads by some mechanism. |
@danpovey this last run where the metric is nan was without fp16. i will try the inf-check=true and the assertion. |
I never have inf in the ctc_loss but rather nan, I have inf in the other losses that above is the first batch where I can see that |
This is the output with inf-check=true in compute_loss module.output[2] is not finite: (tensor(23070.8750, device='cuda:7', grad_fn=), tensor(29302.8672, device='cuda:7', grad_fn=), tensor(inf, device='cuda: 7', grad_fn=)) |
I added some more debugging, at some point there is one ctc loss that is inf. the next log for the batch will have tot_loss ctc inf. The ctc_loss is inf because of x_covarsq_mean_diag: tensor(nan, device='cuda:0', grad_fn=) When i resume from a checkpoint, i suppose it starts from batch 0 again ? It will break after the same number of batches it stopped last time. (it doesn't crash immediately, at this nan, but keeps going. I also see the losses slowly go up after resuming from the checkpoint. The checkpoint is when i stopped it manually after 200.000+ batches that trained properly on the same card, with the same settings. Can you tell me what to look for next ? |
2023-11-01 01:37:03,879 INFO [train.py:1034] Epoch 1, batch 1550, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9476.00 frames. ], tot_loss[loss=nan, simple_loss=i 2023-11-01 01:36:42,503 INFO [train.py:1034] Epoch 1, batch 1500, loss[loss=0.2885, simple_loss=0.3269, pruned_loss=0.09162, ctc_loss=0.1671, over 9095.00 frames. ], tot_loss[loss=0.331 the last print is from |
This suggests that there are some utterances where the label sequence is longer than the frame sequence. You earlier mentioned that you have this filter in place but it does not filter anything? You should check whether there is some issue in applying the filter or elsewhere. |
This
makes me think that you might be able to help the issue by updating icefall. For some time now, this has been a warning, not an error. Over time we have resolved issues that might lead to infinities. It's better if you use the latest zipformer, the one in the zipformer/ directory. |
Thank you for the help! @danpovey The icefall checkout is from last week, this is the one that hits (I added some extra comments to make sure it was using the proper file)
this is in the forward_hook, the backward_hook is the one that was changed to warning in this commit: 6031712 |
I can't find now where you showed us the ValueError message about the nan or inf. Didn't it have a stack trace printed also? |
BTW, we should figure out why exactly the inf's are being generated (e.g. it might be utterances with too many symbols, and that should be checked for), but if you apply #1359 it might prevent it from ruining the model; it changes the gradient-clipping logic in the optimizer to get rid of infinities and nan's in the gradients. |
@danpovey this is the output: Traceback (most recent call last): I will try the patch and let you know, thank you!! |
I applied the patch, i'm still getting this: It didn't crash yet, but i suppose it will eventually. It doesn't crash, but does not train either. I will let it run for a bit to see if it dies eventually or recovers. (update, it doesn't crash. I see occasional messages about the symbols exceeding the length, but that is long after the metric becomes nan. |
If it is consistently giving nan's and inf's in the loss, then the model is corrupted. If you do not filter out symbol sequences that are longer than the num-frames, it is expected to generate inf's in the CTC loss. If you have applied the optim.py patch and the train.py has "clipping_scale=2.0" passed into ScaledAdam, then I am a little confused why the optimizer is putting the nan's/inf's into the model. Note, it should print something like "Scaling gradients by 0.0... " when it gets inf/nan gradients and should otherwise continue fairly normally. Also, if you are seeing these problems after resuming from a checkpoint it is possible that the inf's or nan's are already in the model. If so you would see inf's/nan's right from the start of the training after resuming. In that case it's a bad checkpoint, you cannot use it. You could perhaps run right from the start with --inf-check=True. It is strange that you got inf's even in the first validation set that it tested- presumably with a fresh model. That might indicate an issue with data lengths. We need to debug right from the very first point that inf's or nan's appear, with the --inf-check=True. If you have a model that's trained for longer and eventually goes bad, the output of the --print-diagnostics=True option would be interesting to see as well, e.g. attach a file. Could do that on a trained model before it goes bad, to see if any values are getting way too large. |
What I found so far: but when i add this in model.py I will dig deeper tomorrow (it takes a while to get to the problematic batch, so progress is rather slow) |
OK, thanks! Let's persevere with finding the underlying problem. Oh, wait... I see from your "quartiles" message that because over half the grads were nan, the median gradient was nan and the grad-clipping cutoff became nan. This is a case that we didn't consider. (Incidentally, this means that a lot of your data may have bad lengths, which could indicate some kind of problem, or maybe you need to subsample less, if they are speaking super fast.) @zr_jin can you change the code to detect this and exit? E.g.: so that if the middle quartile is nan, it goes back to the original |
Thanks! I will certainly give it a try by Monday. In the mean time some more observations:
Some batches have inf ctc loss but when I calculate the ctc loss for the individual samples, sometimes none of them have inf ctc loss. In that case, if i hardcode the ctc_loss to be something fixed (like 500), the metric does not end up with nan and I can train successfully. (I ran it for 8 hours on 8 gpu's with large duration and it survived and trained). @dan the -inf check is giving me some deprecation (iirc) warnings.( I don't have the logs with me while writing this text), they are flooding the terminal, do you happen to know how i can disable them ? |
Is it a streaming model? |
I'm using the train.py from the zipformer2 in librispeech eg, with the large values from the librispeech zipformer2 results document, but with ctc and transducer instead of only transducer. The only changes are related to the different dataset (I use a modified gigaspeech dataloader) and with debugging code related to this issue. It's the offline model, no streaming. |
... also, let us know what warning message the inf-check is giving. |
It seems to work, thank you! No more nan. I still check that the batch loss is not infinite, and if it is, i remove any samples where the ctc loss is inf (because of the symbol length versus frames.) How could i filter those samples earlier ? The dataloader check doesn't seem to work for me. |
Hello, The inf. appears after batch 4000:
It is CTC-only training, fp32, base_lr is an item from (0.04,0.004,0.0004). The default data filtering rule is : Could that be the source ? I.e. the edge case of Best regards, |
The batch was saved to disk so you may be able to inspect it to see if there was anything odd there. |
Hi Dan,
For some reason, for CTC there has to be no more than Does it also make sense theoretically ? K. |
@KarelVesely84 I have vague memories that i managed to make it work by ignoring batches that lead to inf loss. |
@joazoa |
Great! (sent you an invite on linked in btw :) |
Created PR on that : #1713 |
I am playing a bit with the CTC option in the zipformer 2, with the largest model from the documentation.
It trained well for a first dataset but when I try another dataset, the training stops.
I have tried reducing the LR, increasing the warmup period, disabling the FP16. changing the max duration, removing any augmentations in lhotse, reducing maximum file duration, removing specaugment and musan and changing the worldsize.
The same dataset works fine with zipformer2 tranducer, does not work for zipformer2 with only CTC.
Works fine for zipformer-ctc as well.
Do you have any suggestions on what I could try next ?
2023-10-29 23:40:46,833 INFO [train.py:1034] (2/8) Epoch 1, batch 0, loss[loss=8.228, simple_loss=7.432, pruned_loss=7.424, ctc_loss=5.236, over 20568.00 frames. ], tot_loss[loss=8.228, simple_loss=7.432, pruned_loss=7.424, ctc_loss=5.236, over 20568.00 frames. ], batch size: 95, lr: 2.25e-02, grad_scale: 1.0
2023-10-29 23:40:46,833 INFO [train.py:1057] (2/8) Computing validation loss
2023-10-29 23:40:54,294 INFO [train.py:1066] (2/8) Epoch 1, validation: loss=inf, simple_loss=7.453, pruned_loss=7.416, ctc_loss=inf, over 901281.00 frames.
2023-10-29 23:40:54,295 INFO [train.py:1067] (2/8) Maximum memory allocated so far is 22148MB
2023-10-29 23:40:59,516 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.26 vs. limit=5.0
2023-10-29 23:41:11,380 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=0.0, ans=0.9
2023-10-29 23:41:16,030 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=106.66666666666667, ans=0.2016
2023-10-29 23:41:26,737 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106.66666666666667, ans=0.29893333333333333
2023-10-29 23:41:40,934 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=7.54
2023-10-29 23:41:50,438 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=315.95 vs. limit=7.58
2023-10-29 23:42:01,307 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=768, metric=17.16 vs. limit=4.085333333333334
2023-10-29 23:42:09,032 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=37.41 vs. limit=7.62
2023-10-29 23:42:19,442 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=165.76 vs. limit=5.08
2023-10-29 23:42:29,662 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.92 vs. limit=7.62
2023-10-29 23:42:40,221 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=433.28 vs. limit=7.66
2023-10-29 23:42:45,693 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=768, metric=595.94 vs. limit=7.82
2023-10-29 23:43:02,777 INFO [train.py:1034] (2/8) Epoch 1, batch 50, loss[loss=3.141, simple_loss=2.879, pruned_loss=2.03, ctc_loss=4.85, over 19767.00 frames. ], tot_loss[loss=inf, simple_loss=4.825, pruned_loss=4.687, ctc_loss=inf, over 918170.33 frames. ], batch size: 274, lr: 2.48e-02, grad_scale: 4.76837158203125e-07
2023-10-29 23:43:04,688 INFO [scaling.py:979] (2/8) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=191.49 vs. limit=4.1066666666666665
2023-10-29 23:43:10,266 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=48.81 vs. limit=7.7
2023-10-29 23:43:13,223 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.29 vs. limit=5.133333333333334
2023-10-29 23:43:23,039 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.2.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.32 vs. limit=7.9
2023-10-29 23:43:41,183 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=4.256
2023-10-29 23:43:41,680 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=768, metric=90.25 vs. limit=7.98
2023-10-29 23:43:43,362 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=640.0, ans=0.2436
2023-10-29 23:43:46,978 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=768, metric=236.17 vs. limit=7.74
2023-10-29 23:43:53,663 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=22.56 vs. limit=5.32
2023-10-29 23:44:09,088 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.2.encoder.layers.3.whiten.whitening_limit, batch_count=746.6666666666666, ans=4.298666666666667
2023-10-29 23:44:13,685 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=355.68 vs. limit=8.06
2023-10-29 23:44:21,592 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=175.08 vs. limit=7.82
2023-10-29 23:44:30,091 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.64 vs. limit=7.82
2023-10-29 23:44:30,502 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=768, metric=124.32 vs. limit=8.14
2023-10-29 23:44:44,328 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=141.67 vs. limit=7.82
2023-10-29 23:44:49,658 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=576, metric=48.55 vs. limit=5.24
2023-10-29 23:44:51,541 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.75 vs. limit=5.24
2023-10-29 23:44:51,565 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=39.07 vs. limit=7.86
2023-10-29 23:44:54,528 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=80.86 vs. limit=7.86
2023-10-29 23:45:02,881 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=4.384
2023-10-29 23:45:06,051 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=768, metric=15.26 vs. limit=4.384
2023-10-29 23:45:17,726 INFO [checkpoint.py:75] (2/8) Saving checkpoint to zipformer/exp-large-ctc-transducer/bad-model-first-warning-2.pt
The text was updated successfully, but these errors were encountered: