Skip to content

Conversation

@danpovey
Copy link
Contributor

No description provided.

@danpovey
Copy link
Contributor Author

THIS IS IMPORTANT!

Everyone (especially @GaofengCheng as I'm hoping you can look into this experimentally),
I haven't really had a chance to look into this or tune this properly but I am letting you know early because I think it will have a big impact on our ASR numbers on all our setups.

When I was messing around with the image classification setup I realized that l2 regularization was important. I was never a fan on l2 as it's meaningless as an actual regularization method (it doesn't make sense as the models, both CNNs and TDNNs, are invariant to the scale of the parameters)... but instead it seems to help for a different reason-- it keeps the network learning for longer, by affecting the size of the parameter matrices (making them smaller so they learn faster) and by shrinking the magnitude of the output layer so we continue to get derivatives to propagate back. I implemented l2 as a form of shrinkage/parameter scaling, done every training iteration, since nnet3 doesn't support l2 regularization in the training code (it's tricky to integrate with natural gradient). This is the --proportional-shrink option (it sets the parameter scale as 1.0 - current-learning-rate * proportional-shrink).

Anyway, using this option we can use more epochs without getting excessive overtraining. As you can see, I did this on the mini-librispeech setup and got huge improvements, of 2.5% and 2% absolute for the different language models we decode with (more than 10% relative). That setup was not well tuned, but still it's a huge improvement.

I suspect that this shrinkage/l2 regularization may work particularly well with TDNNs, because, like CNNs, they are scale invariant, but who knows; it might work equally well with LSTMs. A couple of things to bear in mind:

  • I have not added this option to train_dnn.py or train_rnn.py. Anyone who needs those scripts to support that option can add it; and the changes will be similar to the changes I am making to chain/train.py.
  • When tuning both the CNN setups and this setup, I found that the results can be a little sensitive to the precise value of the --proportional-shrink option. If you keep doubling it, it will keep getting better then suddenly a lot worse. Tuning more precisely than to within a factor of 2 may be a good idea.
  • In general we expect that for setups with more data, we might want to use a smaller proportional-shrink value. So 120 may be on the high side, you may get best results with 50, or maybe 25 for large data... but all this is guessing, you will have to experiment with it.
  • When you use this option, more epochs may be usable, but we don't want to go overboard. I used 10 on mini-librispeech because it's so tiny in the first place.

@danpovey danpovey merged commit 32dc7fe into kaldi-asr:kaldi_52 May 17, 2017
@naxingyu
Copy link
Contributor

naxingyu commented May 17, 2017 via email

@GaofengCheng
Copy link
Contributor

@danpovey amazing results, will it be OK for me to test this on AMI IHM with TDNN , CNN, BLSTMs and CNN-TDNN-LSTMs firstly. Then on SWBD, I will merge this method with my existing CNN-TDNN-LSTMs experiments, as a new kind of regularization tech together with dropout, backstitch etc.

@danpovey
Copy link
Contributor Author

danpovey commented May 17, 2017 via email

@danpovey
Copy link
Contributor Author

I manually pushed a change to modify the proportional-shrink value from 120 to 150, as it improved the results further. Against the original baseline without proportional-shrink:

# local/chain/compare_wer.sh --online exp/chain/tdnn1a_sp exp/chain/tdnn1b_sp                                                                                                  
# System                tdnn1a_sp tdnn1b_sp                                                                                                                                    
#WER dev_clean_2 (tgsmall)      18.11     15.24                                                                                                                                
#             [online:]         18.12     15.15                                                                                                                                
#WER dev_clean_2 (tglarge)      13.20     11.20                                                                                                                                
#             [online:]         13.18     11.13                                                                                                                                
# Final train prob        -0.0602   -0.0642                                                                                                                                    
# Final valid prob        -0.1038   -0.1014                                                                                                                                    
# Final train prob (xent)   -1.4997   -1.3679                                                                                                                                  
# Final valid prob (xent)   -1.7786   -1.6482                                                                                                                                  

@mikenewman1
Copy link

When I try 120 or even 60 (on my particular training set), I get an error message:
Exception: proportional-shrink=60.0 is too large, it gives shrink-value=0.04
I have to crank down to 30 to get it to run.
Is this to be expected?

@danpovey
Copy link
Contributor Author

danpovey commented May 19, 2017 via email

@jtrmal
Copy link
Contributor

jtrmal commented May 19, 2017

@GaofengCheng
I'd be interested in seeing how this will interact with dropout and backstitch, so please let me know whenever you have any results. (If you don't mind).

@GaofengCheng
Copy link
Contributor

@jtrmal I'm testing it with dropout and backstitch separately on AMI SDM ... it needs about one more day to finish

@GaofengCheng
Copy link
Contributor

@danpovey
AMI SDM
for left trained for 4 epochs, right trained for 6 epochs

System               tdnn1e_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali
WER on dev        39.2      38.2
WER on eval        42.8      42.3
Final train prob      -0.235518 -0.227921
Final valid prob      -0.275605 -0.271848
Final train prob (xent)      -2.75633  -2.69137
Final valid prob (xent)      -2.88854  -2.84269

@danpovey
Copy link
Contributor Author

danpovey commented May 23, 2017 via email

@GaofengCheng
Copy link
Contributor

@danpovey

System               tdnn1e_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali tdnn1e_shrink40_4epochs_sp_bi_ihmali
WER on dev        39.2      38.2      39.0
WER on eval        42.8      42.3      42.8
Final train prob      -0.235518 -0.227921 -0.234807
Final valid prob      -0.275605 -0.271848 -0.274932
Final train prob (xent)      -2.75633  -2.69137  -2.75303
Final valid prob (xent)      -2.88854  -2.84269  -2.88553

@GaofengCheng
Copy link
Contributor

GaofengCheng commented May 24, 2017

@danpovey OK, I'll stay in 6 epochs and tune the proportion, I'm afraid for tdnn-lstm we need a much smaller value

@GaofengCheng
Copy link
Contributor

update results:

System              tdnn1e_sp_bi_ihmali tdnn1e_shrink40_4epochs_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali(6epochs) tdnn1e_shrink60_6epochs_sp_bi_ihmali tdnn1e_shrink80_6epochs_sp_bi_ihmali
WER on dev        39.2      39.0      38.2      39.6      41.2
WER on eval        42.8      42.8      42.3      43.6      45.3
Final train prob      -0.235518 -0.234807 -0.227921 -0.245898 -0.263461
Final valid prob      -0.275605 -0.274932 -0.271848 -0.283888 -0.294465
Final train prob (xent)      -2.75633  -2.75303  -2.69137  -2.88203  -3.07493
Final valid prob (xent)      -2.88854  -2.88553  -2.84269  -3.00107  -3.17342

@danpovey
Copy link
Contributor Author

danpovey commented May 25, 2017 via email

@GaofengCheng
Copy link
Contributor

OK. I'm testing 30, and reduce the epochs, total training time is another factor that we should take into consideration...

@GaofengCheng
Copy link
Contributor

System               tdnn1e_sp_bi_ihmali tdnn1e_shrink40_4epochs_sp_bi_ihmali tdnn1e_shrink20_6epochs_sp_bi_ihmali tdnn1e_shrink30_6epochs_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali tdnn1e_shrink60_6epochs_sp_bi_ihmali tdnn1e_shrink80_6epochs_sp_bi_ihmali
WER on dev        39.2      39.0      37.0      37.8      38.2      39.6      41.2
WER on eval        42.8      42.8      41.0      41.7      42.3      43.6      45.3
Final train prob      -0.235518 -0.234807   -0.2008 -0.215646 -0.227921 -0.245898 -0.263461
Final valid prob      -0.275605 -0.274932 -0.258208 -0.264393 -0.271848 -0.283888 -0.294465
Final train prob (xent)      -2.75633  -2.75303  -2.45822  -2.57581  -2.69137  -2.88203  -3.07493
Final valid prob (xent)      -2.88854  -2.88553  -2.65624  -2.74177  -2.84269  -3.00107  -3.17342

@danpovey

@danpovey
Copy link
Contributor Author

danpovey commented May 25, 2017 via email

@GaofengCheng
Copy link
Contributor

@danpovey Ok

@danpovey
Copy link
Contributor Author

danpovey commented May 27, 2017 via email

@jtrmal
Copy link
Contributor

jtrmal commented May 27, 2017 via email

@GaofengCheng
Copy link
Contributor

@danpovey @jtrmal re-edit and share through web-link (google table)

@adrianbg
Copy link
Contributor

Is there any information available on how this approximates parameter-level l2 regularization?

@danpovey
Copy link
Contributor Author

danpovey commented Aug 25, 2017 via email

@adrianbg
Copy link
Contributor

With gradient descent it'd be balanced against the rest of the loss function. Is there some aspect of this implementation performing that function?

@danpovey
Copy link
Contributor Author

It's an additive term in the loss function. I think you're not thinking it through carefully. I don't have time to discuss further.

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants