-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Max change per component #918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
OK will do. |
|
Starting the review. |
| %WER 16.0 | 4459 42989 | 85.6 9.9 4.5 1.6 16.0 52.7 | exp/nnet3/lstm_bidirectional_sp/decode_eval2000_sw1_tg/score_10_0.0/eval2000_hires.ctm.filt.sys | ||
| %WER 19.6 | 2628 21594 | 82.5 12.1 5.5 2.1 19.6 54.8 | exp/nnet3/lstm_bidirectional_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.callhm.filt.sys | ||
| %WER 20.7 | 2628 21594 | 81.4 12.9 5.7 2.2 20.7 56.8 | exp/nnet3/lstm_bidirectional_sp/decode_eval2000_sw1_tg/score_10_0.0/eval2000_hires.ctm.callhm.filt.sys | ||
| # bidirectional LSTM with the same configuration as the above experiment, with self-repair of all nonliearities and clipgradient, and max-change-per-component activated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please preserve these results somewhere, may be in the actual script next to the max-change option or at the bottom of the results file.
c13e553 to
675d726
Compare
675d726 to
e417fce
Compare
src/nnet3/nnet-training.cc
Outdated
| << " change too big: " << std::sqrt(dot_prod) << " > " | ||
| << "--max-change=" << max_param_change_per_comp | ||
| << ", scaling by " << scale_factors(i); | ||
| } else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google style guides mandates braces on else if 'if' had braces.
|
@freewym, what is the status of this commit? Are we still waiting on experiments? |
|
Yes. Once the babel jobs are complete, I will resume the experiments. |
|
@freewym Is this ready for review ? |
|
Yes, but I am still running experiments on different dataset trying to On Thursday, September 15, 2016, Vijayaditya Peddinti <
Sent from my iPhone |
|
I have tested "max-change per component" with the current thresholds (1.5 for final layer, and 0.75 for others) on swbd, ami ihm, tedium, and babel georgian using BLSTM xent model. WERs of all of these experiments get improved over the global-max-change baseline or remain the same, although the llhoods get a little worse. Specifically: %WER 14.9 | 4459 42989 | 86.7 9.1 4.3 1.6 14.9 50.6 | exp/nnet3/lstm_bidirectional_maxchange_test_nomax_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.filt.sys on ami ihm: %WER 22.6 | 13098 94486 | 80.6 11.8 7.6 3.2 22.6 55.8 | -0.592 | exp/ihm/nnet3_cleaned/lstm_bidirectional_maxchange_nomax_sp/decode_dev/ascore_10/dev_hires.ctm.filt.sys on tedium: %WER 11.5 | 507 17783 | 90.2 7.1 2.8 1.7 11.5 79.9 | -0.317 | exp/nnet3_cleaned/lstm_bidirectional_maxchange_nomax_sp/decode_dev/score_10_0.0/ctm.filt.filt.sys on babel Georgia: %WER 46.2 | 19252 60586 | 57.9 32.4 9.7 4.2 46.2 31.6 | -1.241 | exp/nnet3/lstm_bidirectional_sp/decode_dev10h.pem/score_12/penalty_0.5/dev10h.pem.ctm.sys after the changes to nnet-training.cc were reviewed and confirmed, I will apply similar changes to nnet-chain-training.cc |
|
Do you have an idea how this affects TDNN or any other feed-forward network On Mon, Sep 19, 2016 at 4:35 PM, Yiming Wang [email protected]
|
|
I only did tdnn comparison on swbd, max-change-per-component gets 0.1-0.2 improvement |
|
Could you check what changes when moving from global-max-change --Vijay On Mon, Sep 19, 2016 at 4:40 PM, Yiming Wang [email protected]
|
|
Usually BLstm{1|2|3}_{forward,backward}_W_c-xr hit the upper bound of max-change-per-componnet and get rescaled. Hard to tell from clipped-proportion. |
|
There are conflicts in this branch. |
|
Will resolve it. |
118f18a to
59fb914
Compare
src/nnet3/nnet-simple-component.cc
Outdated
|
|
||
| BaseFloat clipped_proportion = | ||
| (count_ > 0 ? static_cast<BaseFloat>(num_clipped_) / count_ : 0); | ||
| BaseFloat clipped_proportion = (count_ > 0 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to_update->count_ here?
And I don't see how this change is related to the rest of this PR.
|
It's irrelevant. I will restore this part in this PR and make it in a On Tue, Oct 11, 2016 at 2:40 PM, Daniel Povey [email protected]
Yiming Wang |
src/nnet3/nnet-training.cc
Outdated
| bool is_gradient = false; // setting this to true would disable the | ||
| // natural-gradient updates. | ||
| SetZero(is_gradient, delta_nnet_); | ||
| const int32 num_ucs = NumUpdatableComponents(*delta_nnet_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this num_updatable
src/nnet3/nnet-training.cc
Outdated
| } | ||
| } | ||
| KALDI_ASSERT(i == scale_factors.Dim()); | ||
| param_delta = std::sqrt(param_delta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a good practice to change variables in-place like this because it confuses the meaning of a variables.
Better to rename param_delta above this to param_delta_squared, and have BaseFloat param_delta = std::sqrt(param_delta_squared);
| param_delta *= scale; | ||
| if (param_delta > config_.max_param_change) { | ||
| if (param_delta - param_delta != 0.0) { | ||
| KALDI_WARN << "Infinite parameter change, will not apply."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to mention the component name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the "global delta".
src/nnet3/nnet-training.cc
Outdated
| } else { | ||
| scale *= config_.max_param_change / param_delta; | ||
| num_max_changes_global_applied_++; | ||
| KALDI_LOG << "Parameters change too big: " << param_delta << " > " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should rework how the log messages are printed. Here you print out a log message per minibatch if you apply the global max change, but not if you apply the per-component max change. If the quota of verboseness is one log message of reasonable length per minibatch, then you could do a better job of compressing more information in that. How about you print out one message if any max-change was applied, and find a way to neatly summarize how many components had a per-component max-change limit applied (and maybe print the component-name of the one that had the smallest scale applied, together with its max-change value), and also say what the global max-change limit was, if any, and what the global max-change was.
src/nnet3/nnet-training.cc
Outdated
| ans = ans || info.PrintTotalStats(name); | ||
| } | ||
| if (delta_nnet_ != NULL) | ||
| PrintMaxChangesStats(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to PrintMaxChangeStats()
src/nnet3/nnet-training.h
Outdated
| /// Applies per-component max-changes and global max-change to all updatable | ||
| /// components in *delta_nnet_, and use *delta_nnet_ to update parameters | ||
| /// in *nnet_. | ||
| void UpdateParamsWithMaxChanges(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename Changes() -> Change()
src/nnet3/nnet-training.h
Outdated
| int32 num_minibatches_processed_; | ||
|
|
||
| // stats for max-changes. | ||
| std::vector<int32> num_max_changes_per_component_applied_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename changes->change in these 2 variables
b2671d0 to
e3687dc
Compare
| KALDI_LOG << "Parameters change too big: " << param_delta << " > " | ||
| << "--max-param-change=" << config_.max_param_change | ||
| << ", scaling by " << config_.max_param_change / param_delta; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpovey This is now the way how the log message prints out. I decided to split it in two lines since a single line seems too long to include these info. How do you think of the log message?
|
appied->applied. How about this: Dan On Tue, Oct 11, 2016 at 9:18 PM, Yiming Wang [email protected]
|
e3687dc to
a63dd7a
Compare
73ce4c9 to
7c9eeb7
Compare
|
I tested the max-change-per-component on chain training using tdnn and blstm on swbd. And they gave almost the same results as global max-change. |
egs/swbd/s5c/RESULTS
Outdated
|
|
||
| # bidirectional LSTM with the same configuration as the above experiment, plus self-repair of all nonliearities and clipgradient activated | ||
| # bidirectional LSTM with the same configuration as the above experiment, with self-repair of all nonliearities and clipgradient, and max-change-per-component activated | ||
| %WER 10.2 | 1831 21395 | 90.8 6.1 3.2 1.0 10.2 44.4 | exp/nnet3/lstm_bidirectional_sp/decode_eval2000_sw1_fsh_fg/score_11_0.0/eval2000_hires.ctm.swbd.filt.sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make sure the results are in the same order as those above, for easy comparison (i.e. whole-test-set firs).
| --recurrent-projection-dim 256 \ | ||
| --non-recurrent-projection-dim 256 \ | ||
| --label-delay $label_delay \ | ||
| --self-repair-scale-clipgradient 1.0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the default, and how does this relate to your other changs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 is the default value set for self-repair-scale-clipgradient in make_configs.py. Here I just explicitly tell the user there is such option.
egs/wsj/s5/steps/nnet3/components.py
Outdated
| 'dimension': input['dimension']} | ||
|
|
||
| def AddAffineLayer(config_lines, name, input, output_dim, ng_affine_options = ""): | ||
| def AddAffineLayer(config_lines, name, input, output_dim, ng_affine_options = "", max_change_per_component = 0.25): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that in different parts of the code, for different layer types, you have different values for max_change_per_component. Can you give me some idea of how you tuned this? Was this based just on WER, or did you also look at the diagnostics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All max-change value is 0.75 for non-fianl layers and 1.5 for the final layer. I need to change those defaults in components.py the same as those in make_configs.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All max-change-per-componnet values have been set to 0.75 (for non-final layers ) / 1.5 (for final layers)
|
I'm close to merging this, but there is one small thing to fix. In the training setups there is code like this: and this wouldn't do the right thing if the global max-change is zero but the individual max-changes are zero. |
|
Will squash after the last commit is reviewed and confirmed. |
|
Looks OK, but no need to squash, I am using github's 'squash and merge' On Wed, Oct 26, 2016 at 2:14 AM, Yiming Wang [email protected]
|
|
Yiming, actually before I merge this I want you to provide a mechanism whereby people can make it equivalent to the old models, in case they are in the middle of experiments and want it back-compatible. You can add a --max-change-per-component and --max-change-per-component-final option to the top-level make_configs.py scripts. Then people can set these to zero if they want to disable the max-change. |
According to the commit:
9569e57