-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[egs,scripts] Add, and use the --proportional-shrink option (approximates l2 regularization). #1627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
THIS IS IMPORTANT! Everyone (especially @GaofengCheng as I'm hoping you can look into this experimentally), When I was messing around with the image classification setup I realized that l2 regularization was important. I was never a fan on l2 as it's meaningless as an actual regularization method (it doesn't make sense as the models, both CNNs and TDNNs, are invariant to the scale of the parameters)... but instead it seems to help for a different reason-- it keeps the network learning for longer, by affecting the size of the parameter matrices (making them smaller so they learn faster) and by shrinking the magnitude of the output layer so we continue to get derivatives to propagate back. I implemented l2 as a form of shrinkage/parameter scaling, done every training iteration, since nnet3 doesn't support l2 regularization in the training code (it's tricky to integrate with natural gradient). This is the --proportional-shrink option (it sets the parameter scale as 1.0 - current-learning-rate * proportional-shrink). Anyway, using this option we can use more epochs without getting excessive overtraining. As you can see, I did this on the mini-librispeech setup and got huge improvements, of 2.5% and 2% absolute for the different language models we decode with (more than 10% relative). That setup was not well tuned, but still it's a huge improvement. I suspect that this shrinkage/l2 regularization may work particularly well with TDNNs, because, like CNNs, they are scale invariant, but who knows; it might work equally well with LSTMs. A couple of things to bear in mind:
|
|
Great news!
… 在 2017年5月17日,13:30,Daniel Povey ***@***.***> 写道:
THIS IS IMPORTANT!
Everyone (especially @GaofengCheng <https://github.com/gaofengcheng> as I'm hoping you can look into this experimentally),
I haven't really had a chance to look into this or tune this properly but I am letting you know early because I think it will have a big impact on our ASR numbers on all our setups.
When I was messing around with the image classification setup I realized that l2 regularization was important. I was never a fan on l2 as it's meaningless as an actual regularization method (it doesn't make sense as the models, both CNNs and TDNNs, are invariant to the scale of the parameters)... but instead it seems to help for a different reason-- it keeps the network learning for longer, by affecting the size of the parameter matrices (making them smaller so they learn faster) and by shrinking the magnitude of the output layer so we continue to get derivatives to propagate back. I implemented l2 as a form of shrinkage/parameter scaling, done every training iteration, since nnet3 doesn't support l2 regularization in the training code (it's tricky to integrate with natural gradient). This is the --proportional-shrink option (it sets the parameter scale as 1.0 - current-learning-rate * proportional-shrink).
Anyway, using this option we can use more epochs without getting excessive overtraining. As you can see, I did this on the mini-librispeech setup and got huge improvements, of 2.5% and 2% absolute for the different language models we decode with (more than 10% relative). That setup was not well tuned, but still it's a huge improvement.
I suspect that this shrinkage/l2 regularization may work particularly well with TDNNs, because, like CNNs, they are scale invariant, but who knows; it might work equally well with LSTMs. A couple of things to bear in mind:
I have not added this option to train_dnn.py or train_rnn.py. Anyone who needs those scripts to support that option can add it; and the changes will be similar to the changes I am making to chain/train.py.
When tuning both the CNN setups and this setup, I found that the results can be a little sensitive to the precise value of the --proportional-shrink option. If you keep doubling it, it will keep getting better then suddenly a lot worse. Tuning more precisely than to within a factor of 2 may be a good idea.
In general we expect that for setups with more data, we might want to use a smaller proportional-shrink value. So 120 may be on the high side, you may get best results with 50, or maybe 25 for large data... but all this is guessing, you will have to experiment with it.
When you use this option, more epochs may be usable, but we don't want to go overboard. I used 10 on mini-librispeech because it's so tiny in the first place.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#1627 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADKpxC3oNVt6_M_QZVgdDUgXZOJvNZsuks5r6oXqgaJpZM4NdYuX>.
|
|
@danpovey amazing results, will it be OK for me to test this on AMI IHM with TDNN , CNN, BLSTMs and CNN-TDNN-LSTMs firstly. Then on SWBD, I will merge this method with my existing CNN-TDNN-LSTMs experiments, as a new kind of regularization tech together with dropout, backstitch etc. |
|
sure...
…On Wed, May 17, 2017 at 1:59 AM, Gaofeng Cheng ***@***.***> wrote:
@danpovey <https://github.com/danpovey> amazing results, will it be OK
for me to test this on AMI IHM with TDNN , CNN, BLSTMs and CNN-TDNN-LSTMs
firstly. Then on SWBD, I will merge this method with my existing
CNN-TDNN-LSTMs experiments, as a new kind of regularization tech together
with dropout, backstitch etc.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7WDCVQW8SHtVsfi5nCflXougs3iks5r6oy7gaJpZM4NdYuX>
.
|
|
I manually pushed a change to modify the proportional-shrink value from 120 to 150, as it improved the results further. Against the original baseline without proportional-shrink: |
|
When I try 120 or even 60 (on my particular training set), I get an error message: |
|
Yes, that's expected; it depends on the learning rate and the num-jobs you
use.
…On Fri, May 19, 2017 at 1:04 PM, Michael Newman ***@***.***> wrote:
When I try 120 or even 60 (on my particular training set), I get an error
message:
Exception: proportional-shrink=60.0 is too large, it gives
shrink-value=0.04
I have to crank down to 30 to get it to run.
Is this to be expected?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu24uL94D0eJE34Ja96TZN4yqt7FVks5r7cuGgaJpZM4NdYuX>
.
|
|
@GaofengCheng |
|
@jtrmal I'm testing it with dropout and backstitch separately on AMI SDM ... it needs about one more day to finish |
|
@danpovey |
|
Cool.
Tune the --proportional-shrink option; it's quite sensitive to the exact
value. You can keep doubling it until it gets worse, then tune more
precisely about the optimum (e.g. try 40, then 80, then if it gets worse,
try 60 or 30).
…On Tue, May 23, 2017 at 8:26 AM, Gaofeng Cheng ***@***.***> wrote:
@danpovey <https://github.com/danpovey>
AMI SDM
for left trained for 4 epochs, right trained for 6 epochs
System tdnn1e_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali
WER on dev 39.2 38.2
WER on eval 42.8 42.3
Final train prob -0.235518 -0.227921
Final valid prob -0.275605 -0.271848
Final train prob (xent) -2.75633 -2.69137
Final valid prob (xent) -2.88854 -2.84269
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu9ftG8RvhTjQ09l_zuAaMon5DMPQks5r8tB8gaJpZM4NdYuX>
.
|
|
|
@danpovey OK, I'll stay in 6 epochs and tune the proportion, I'm afraid for tdnn-lstm we need a much smaller value |
|
update results: |
|
OK, it's getting rapidly worse when you increase shrink above 40, so try 30
(and 20 if you haven't already).
…On Wed, May 24, 2017 at 11:03 PM, Gaofeng Cheng ***@***.***> wrote:
update results:
System tdnn1e_sp_bi_ihmali tdnn1e_shrink40_4epochs_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali(6epochs) tdnn1e_shrink60_6epochs_sp_bi_ihmali tdnn1e_shrink80_6epochs_sp_bi_ihmali
WER on dev 39.2 39.0 38.2 39.6 41.2
WER on eval 42.8 42.8 42.3 43.6 45.3
Final train prob -0.235518 -0.234807 -0.227921 -0.245898 -0.263461
Final valid prob -0.275605 -0.274932 -0.271848 -0.283888 -0.294465
Final train prob (xent) -2.75633 -2.75303 -2.69137 -2.88203 -3.07493
Final valid prob (xent) -2.88854 -2.88553 -2.84269 -3.00107 -3.17342
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuwO8sJgi9j70TDuIYn5wRTrGqXA0ks5r9O-VgaJpZM4NdYuX>
.
|
|
OK. I'm testing 30, and reduce the epochs, total training time is another factor that we should take into consideration... |
|
|
Wow- great improvement! Since 20 is the best you could try reducing it
still further, e.g. 15 or 10.
…On Thu, May 25, 2017 at 7:26 PM, Gaofeng Cheng ***@***.***> wrote:
System tdnn1e_sp_bi_ihmali tdnn1e_shrink40_4epochs_sp_bi_ihmali tdnn1e_shrink20_6epochs_sp_bi_ihmali tdnn1e_shrink30_6epochs_sp_bi_ihmali tdnn1e_shrink40_sp_bi_ihmali tdnn1e_shrink60_6epochs_sp_bi_ihmali tdnn1e_shrink80_6epochs_sp_bi_ihmali
WER on dev 39.2 39.0 37.0 37.8 38.2 39.6 41.2
WER on eval 42.8 42.8 41.0 41.7 42.3 43.6 45.3
Final train prob -0.235518 -0.234807 -0.2008 -0.215646 -0.227921 -0.245898 -0.263461
Final valid prob -0.275605 -0.274932 -0.258208 -0.264393 -0.271848 -0.283888 -0.294465
Final train prob (xent) -2.75633 -2.75303 -2.45822 -2.57581 -2.69137 -2.88203 -3.07493
Final valid prob (xent) -2.88854 -2.88553 -2.65624 -2.74177 -2.84269 -3.00107 -3.17342
@danpovey <https://github.com/danpovey>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7HBJvVN7C2Qwg0-G73sLYOZK3MXks5r9g42gaJpZM4NdYuX>
.
|
|
@danpovey Ok |
|
That table is getting very hard to read-- change how it's displayed to make
sure there is space printed between the directory names.
You might want to try smaller shrink values on the tdnn_lstm1i experiment,
like 20 or 10.
…On Fri, May 26, 2017 at 8:13 PM, Gaofeng Cheng ***@***.***> wrote:
System tdnn1e_sp_bi_ihmalitdnn1e_shrink40_4epochs_sp_bi_ihmalitdnn1e_shrink10_4epochs_sp_bi_ihmalitdnn1e_shrink10_6epochs_sp_bi_ihmalitdnn1e_shrink20_6epochs_sp_bi_ihmalitdnn1e_shrink30_6epochs_sp_bi_ihmalitdnn1e_shrink40_sp_bi_ihmalitdnn1e_shrink60_6epochs_sp_bi_ihmalitdnn1e_shrink80_6epochs_sp_bi_ihmali
WER on dev 39.2 39.0 37.5 36.9 37.0 37.8 38.2 39.6 41.2
WER on eval 42.8 42.8 41.3 41.0 41.0 41.7 42.3 43.6 45.3
Final train prob -0.235518 -0.234807 -0.195525 -0.182205 -0.2008 -0.215646 -0.227921 -0.245898 -0.263461
Final valid prob -0.275605 -0.274932 -0.258708 -0.25545 -0.258208 -0.264393 -0.271848 -0.283888 -0.294465
Final train prob (xent) -2.75633 -2.75303 -2.42821 -2.32757 -2.45822 -2.57581 -2.69137 -2.88203 -3.07493
Final valid prob (xent) -2.88854 -2.88553 -2.63458 -2.57967 -2.65624 -2.74177 -2.84269 -3.00107 -3.17342
System tdnn_lstm1i_tdnn_lstm_shrink30_6epochs_sp_bi_ihmali_ld5
WER on dev 37.4
WER on eval 40.1
Final train prob -0.197789
Final valid prob -0.257715
Final train prob (xent) -2.0103
Final valid prob (xent) -2.27506
baseline for 4epoch tdnn-lstm-1i is 37.4 / 40.8
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu5UfmlX0LC-FQ_k5JUTxD5NjK0QHks5r92rCgaJpZM4NdYuX>
.
|
|
I think even a google docs table might be suitable.
y.
On Fri, May 26, 2017 at 10:07 PM, Daniel Povey <[email protected]>
wrote:
… That table is getting very hard to read-- change how it's displayed to make
sure there is space printed between the directory names.
You might want to try smaller shrink values on the tdnn_lstm1i experiment,
like 20 or 10.
On Fri, May 26, 2017 at 8:13 PM, Gaofeng Cheng ***@***.***>
wrote:
> System tdnn1e_sp_bi_ihmalitdnn1e_shrink40_4epochs_sp_bi_
ihmalitdnn1e_shrink10_4epochs_sp_bi_ihmalitdnn1e_shrink10_
6epochs_sp_bi_ihmalitdnn1e_shrink20_6epochs_sp_bi_
ihmalitdnn1e_shrink30_6epochs_sp_bi_ihmalitdnn1e_shrink40_
sp_bi_ihmalitdnn1e_shrink60_6epochs_sp_bi_ihmalitdnn1e_
shrink80_6epochs_sp_bi_ihmali
> WER on dev 39.2 39.0 37.5 36.9 37.0 37.8 38.2 39.6 41.2
> WER on eval 42.8 42.8 41.3 41.0 41.0 41.7 42.3 43.6 45.3
> Final train prob -0.235518 -0.234807 -0.195525 -0.182205 -0.2008
-0.215646 -0.227921 -0.245898 -0.263461
> Final valid prob -0.275605 -0.274932 -0.258708 -0.25545 -0.258208
-0.264393 -0.271848 -0.283888 -0.294465
> Final train prob (xent) -2.75633 -2.75303 -2.42821 -2.32757 -2.45822
-2.57581 -2.69137 -2.88203 -3.07493
> Final valid prob (xent) -2.88854 -2.88553 -2.63458 -2.57967 -2.65624
-2.74177 -2.84269 -3.00107 -3.17342
>
>
> System tdnn_lstm1i_tdnn_lstm_shrink30_6epochs_sp_bi_ihmali_ld5
> WER on dev 37.4
> WER on eval 40.1
> Final train prob -0.197789
> Final valid prob -0.257715
> Final train prob (xent) -2.0103
> Final valid prob (xent) -2.27506
>
> baseline for 4epoch tdnn-lstm-1i is 37.4 / 40.8
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1627 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu5UfmlX0LC-FQ_
k5JUTxD5NjK0QHks5r92rCgaJpZM4NdYuX>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX4wz4iMTIZ0aOh3h6rj00-uTyRxMks5r94V-gaJpZM4NdYuX>
.
|
|
Is there any information available on how this approximates parameter-level l2 regularization? |
|
Not really, but it's pretty obvious. If you were doing l2 with gradient
descent you'd be scaling down the parameters slightly on each minibatch.
In this method it's like doing that, but doing it periodically (and by a
larger amount each time) instead of on every single minibatch.
…On Fri, Aug 25, 2017 at 3:22 PM, adrianbg ***@***.***> wrote:
Is there any information available on how this approximates
parameter-level l2 regularization?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1627 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu1M9a7pI5WFQVDAOlT7Yv8PAg-wnks5sb0kggaJpZM4NdYuX>
.
|
|
With gradient descent it'd be balanced against the rest of the loss function. Is there some aspect of this implementation performing that function? |
|
It's an additive term in the loss function. I think you're not thinking it through carefully. I don't have time to discuss further. |
…ates l2 regularization). (kaldi-asr#1627)
No description provided.