-
Notifications
You must be signed in to change notification settings - Fork 5.4k
fixed "max_deriv_time unset" issue for BLSTM #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| deriv_time_opts += " --optimization.min-deriv-time={0}".format(left_deriv_truncate) | ||
| if right_deriv_truncate is not None: | ||
| deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk-width-right_deriv_truncate)) | ||
| deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk_width - 1 - right_deriv_truncate)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an issue with the code before your change, but we should still address it:
This code would only make sense if the "correct" settings of right-deriv-truncate were zero or negative.
E.g. suppose you wanted to process derivatives for up to 5 frames before and after the end of the supervision, you'd have to set --left-deriv-truncate=-5 and --right-deriv-truncate=-5.
This is kind of weird and unintuitive, IMO.
Also, it's not clear to me why these options are part of the 'chain' namespace (e.g. --chain.left-deriv-truncate) since they relate to the generic nnet3 framework and not to the chain models specifically.
What I propose is to add a new option --trainer.deriv-truncate-margin [default -1 meaning unset; but you can set it to any value >= 0].
Setting this to x >= 0 would lead it to set the command-line options --optimization.min-deriv-time=-x and --optimization.max-deriv-time=chunk_width - 1 + x
The --chain.min-deriv-time option would be retained only for back compatibility; if used it would print a warning and would set deriv-truncate-margin to the negative of that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess left-deriv-truncate was originally intended to be non-negative to truncate the deriv within chunk_width. Anyway, I have made changes to add --trainer.deriv-truncate-margin
| args.deriv_truncate_margin = -args.left_deriv_truncate | ||
| logger.warning("--chain.left-deriv-truncate (deprecated) is set by user, so --trainer.deriv-truncate-margin is set to negative of that value={0}.".format(args.deriv_truncate_margin)) | ||
|
|
||
| if (not args.deriv_truncate_margin is None) and args.deriv_truncate_margin < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since you're using None for the default, there is no need to specify that it must be >= 0. You can remove that check.
| help="Number of sequences to be processed in parallel every minibatch" ) | ||
| parser.add_argument("--trainer.deriv-truncate-margin", type=int, dest='deriv_truncate_margin', | ||
| default = None, | ||
| help="If specified, it is the number of frames that the deriv will be backproped through out of the range [0, chunk_width-1];" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please watch line length.
backproped -> backpropagated
|
Thanks- please run a test on on of those setups with setting that value to 5. |
|
I set the value to 5 ,and the min-deriv-time and max-deriv-time are set as expected. |
|
@vijayaditya, if this is OK with you I can merge now. |
|
OK good, but I was more wondering about the effect on WER. On Tue, Nov 1, 2016 at 8:59 PM, Yiming Wang notifications@github.com
|
|
I will run it to completion. |
| help="Number of sequences to be processed in parallel every minibatch" ) | ||
| parser.add_argument("--trainer.deriv-truncate-margin", type=int, dest='deriv_truncate_margin', | ||
| default = None, | ||
| help="If specified, it is the number of frames that the deriv will be backpropagated through " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deriv --> derivative. Please provide an example of how this parameter is used.
help="If specified, it is the number of time steps the derivative will be backpropagated through. It takes the values between [0, chunk_width - 1].
e.g. During BLSTM model training if the chunk-width is 150, chunk-left-context is 40 and chunk-right-context is 40 specifying --trainer.deriv-truncate-margin as ......\| if args.chunk_right_context < 0: | ||
| raise Exception("--egs.chunk-right-context should be non-negative") | ||
|
|
||
| if not args.left_deriv_truncate is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recommend using the option --trainer.deriv-truncate-margin.
| @@ -463,10 +467,10 @@ def TrainOneIteration(dir, iter, srand, egs_dir, | |||
| TrainNewModels(dir, iter, srand, num_jobs, num_archives_processed, num_archives, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use named arguments to avoid user errors during function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting use named arguments for all of them? if so we might also need to use named arguments for other function calls for the same reason. I think it might not be necessary since this function is called once within only this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting that you use named arguments while calling the function and not change the function definition.
We have been constantly updating the argument list for these function, so better change it to use named arguments to avoid user errors. Vimal and I have been modifying all the function calls with more than a few arguments to use named arguments so I would recommend that here too.
|
@vijayaditya, merge this when you think it's ready. |
|
@freewym I am assuming you took care of blstm scripts in local for all egs. I will merge once you rebase. |
|
Vijay, lately I have been using the "squash and merge" button. You can On Thu, Nov 3, 2016 at 6:20 PM, Vijayaditya Peddinti <
|
|
Right now the fixed BLSTM on swbd is 0.3 worse in WER. I am testing if extending the backprop to some more frames would help. |
|
Remember to look at the WERs on all of eval2000 (subset numbers add no On Thu, Nov 3, 2016 at 6:24 PM, Yiming Wang notifications@github.com
|
|
I am comfortable using squash and merge if the branch is up-to-date. But in this case I am concerned about the staleness of the branch. I sometimes find auto-merges can mess up the logic, so I am usually recommending that the developers run their unit-tests once they rebase. |
|
OK I guess. On Thu, Nov 3, 2016 at 6:27 PM, Vijayaditya Peddinti <
|
|
They are both worse, by 0.3 and <0.18 respectively %WER 15.2 | 4459 42989 | 86.4 9.2 4.4 1.6 15.2 51.5 | exp/nnet3/lstm_bidirectional_max_deriv_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.filt.sys %WER 14.9 | 4459 42989 | 86.7 9.1 4.2 1.6 14.9 50.7 | exp/nnet3/lstm_bidirectional_adversary0.0_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.filt.sys |
|
Was the objective function worse than the baseline? |
|
Yes, also a little worse in objf: exp/nnet3/lstm_bidirectional_adversary0.0_sp: |
adb336a to
bb54853
Compare
|
@freewym, have the experiments with margin=5 finished? |
|
Its WER is worse by 0.2(for eval2000) and 0.13 (for train_dev) using blstm chain model on swbd. I am increasing the margin to 20. |
|
It's possible that it's just random noise-- you might want to rerun the On Sat, Nov 5, 2016 at 4:55 PM, Yiming Wang notifications@github.com
|
|
@freewym, have you got any further results on this? |
|
On ami ihm, the WER is margin=10 < margin=5 < "old" setup using blstm+xent model, which shows the fix can at least achieve the same performance on this data. I am now testing on sdm1. |
bb54853 to
0049568
Compare
0049568 to
0482e82
Compare
|
@freewym, let me know when you think this is ready to merge. |
| deriv_time_opts += " --optimization.min-deriv-time={0}".format(left_deriv_truncate) | ||
| if right_deriv_truncate is not None: | ||
| deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk-width-right_deriv_truncate)) | ||
| if not left_deriv_truncate is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpovey do you think it is better to pass in {min|max}_deriv_time instead of {left|right}_deriv_truncate? In that way 1) we don't need to pass in the argument chunk_width all along the way 2) we can compute the deriv_time in a much more outer function like in Train(); 3) it is consistent with train_rnn.py
|
That would be fine with me. On Wed, Nov 16, 2016 at 10:01 PM, Yiming Wang notifications@github.com
|
2e03607 to
52fabe5
Compare
|
@danpovey BLSTM+xent on sdm1 using the margin of 10 improves WER by 0.6 on dev and 0.2 on eval, respectively. All of those tests are using the old ClipGradientComponent. I think it is ready to merge. Perhaps I need to further tune the zeroing threshold in BackpropTruncationComponent with this fix. |
|
I'll merge this now, you can make a separate simple commit to change the On Thu, Nov 17, 2016 at 1:22 PM, Yiming Wang notifications@github.com
|
|
@vimalmanohar You may have to make the changes in #1066 |
No description provided.