Skip to content

Conversation

@freewym
Copy link
Contributor

@freewym freewym commented Nov 1, 2016

No description provided.

deriv_time_opts += " --optimization.min-deriv-time={0}".format(left_deriv_truncate)
if right_deriv_truncate is not None:
deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk-width-right_deriv_truncate))
deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk_width - 1 - right_deriv_truncate))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an issue with the code before your change, but we should still address it:
This code would only make sense if the "correct" settings of right-deriv-truncate were zero or negative.
E.g. suppose you wanted to process derivatives for up to 5 frames before and after the end of the supervision, you'd have to set --left-deriv-truncate=-5 and --right-deriv-truncate=-5.
This is kind of weird and unintuitive, IMO.
Also, it's not clear to me why these options are part of the 'chain' namespace (e.g. --chain.left-deriv-truncate) since they relate to the generic nnet3 framework and not to the chain models specifically.
What I propose is to add a new option --trainer.deriv-truncate-margin [default -1 meaning unset; but you can set it to any value >= 0].
Setting this to x >= 0 would lead it to set the command-line options --optimization.min-deriv-time=-x and --optimization.max-deriv-time=chunk_width - 1 + x
The --chain.min-deriv-time option would be retained only for back compatibility; if used it would print a warning and would set deriv-truncate-margin to the negative of that value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess left-deriv-truncate was originally intended to be non-negative to truncate the deriv within chunk_width. Anyway, I have made changes to add --trainer.deriv-truncate-margin

args.deriv_truncate_margin = -args.left_deriv_truncate
logger.warning("--chain.left-deriv-truncate (deprecated) is set by user, so --trainer.deriv-truncate-margin is set to negative of that value={0}.".format(args.deriv_truncate_margin))

if (not args.deriv_truncate_margin is None) and args.deriv_truncate_margin < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since you're using None for the default, there is no need to specify that it must be >= 0. You can remove that check.

help="Number of sequences to be processed in parallel every minibatch" )
parser.add_argument("--trainer.deriv-truncate-margin", type=int, dest='deriv_truncate_margin',
default = None,
help="If specified, it is the number of frames that the deriv will be backproped through out of the range [0, chunk_width-1];"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please watch line length.
backproped -> backpropagated

@danpovey
Copy link
Contributor

danpovey commented Nov 1, 2016

Thanks- please run a test on on of those setups with setting that value to 5.

@freewym
Copy link
Contributor Author

freewym commented Nov 2, 2016

I set the value to 5 ,and the min-deriv-time and max-deriv-time are set as expected.

@danpovey
Copy link
Contributor

danpovey commented Nov 2, 2016

@vijayaditya, if this is OK with you I can merge now.

@danpovey
Copy link
Contributor

danpovey commented Nov 2, 2016

OK good, but I was more wondering about the effect on WER.

On Tue, Nov 1, 2016 at 8:59 PM, Yiming Wang notifications@github.com
wrote:

I set the value to 5 ,and the min-deriv-time and max-deriv-time are set as
expected.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu-8RDcCRYonCO1o7FRX4zDxyKv_hks5q5-CKgaJpZM4KmmiU
.

@freewym
Copy link
Contributor Author

freewym commented Nov 2, 2016

I will run it to completion.

help="Number of sequences to be processed in parallel every minibatch" )
parser.add_argument("--trainer.deriv-truncate-margin", type=int, dest='deriv_truncate_margin',
default = None,
help="If specified, it is the number of frames that the deriv will be backpropagated through "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deriv --> derivative. Please provide an example of how this parameter is used.

help="If specified, it is the number of time steps the derivative will be backpropagated through. It takes the values between [0, chunk_width - 1].
 e.g. During BLSTM model training if the chunk-width is 150, chunk-left-context is 40 and chunk-right-context is 40 specifying  --trainer.deriv-truncate-margin as ......\

if args.chunk_right_context < 0:
raise Exception("--egs.chunk-right-context should be non-negative")

if not args.left_deriv_truncate is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recommend using the option --trainer.deriv-truncate-margin.

@@ -463,10 +467,10 @@ def TrainOneIteration(dir, iter, srand, egs_dir,
TrainNewModels(dir, iter, srand, num_jobs, num_archives_processed, num_archives,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named arguments to avoid user errors during function call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting use named arguments for all of them? if so we might also need to use named arguments for other function calls for the same reason. I think it might not be necessary since this function is called once within only this script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting that you use named arguments while calling the function and not change the function definition.

We have been constantly updating the argument list for these function, so better change it to use named arguments to avoid user errors. Vimal and I have been modifying all the function calls with more than a few arguments to use named arguments so I would recommend that here too.

@danpovey
Copy link
Contributor

danpovey commented Nov 3, 2016

@vijayaditya, merge this when you think it's ready.

@vijayaditya
Copy link
Contributor

@freewym I am assuming you took care of blstm scripts in local for all egs. I will merge once you rebase.

@danpovey
Copy link
Contributor

danpovey commented Nov 3, 2016

Vijay, lately I have been using the "squash and merge" button. You can
edit the list of commit names in a little pop-up box if there are
objectionable things there.

On Thu, Nov 3, 2016 at 6:20 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

@freewym https://github.com/freewym I am assuming you took care of
blstm scripts in local for all egs. I will merge once you rebase.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu3Uf8yN0docKRF4oSoHkCczSgKxhks5q6l4sgaJpZM4KmmiU
.

@freewym
Copy link
Contributor Author

freewym commented Nov 3, 2016

Right now the fixed BLSTM on swbd is 0.3 worse in WER. I am testing if extending the backprop to some more frames would help.

@danpovey
Copy link
Contributor

danpovey commented Nov 3, 2016

Remember to look at the WERs on all of eval2000 (subset numbers add no
value) and also the train_dev. It's the sum of those two WER differences
that are the most meaningful number (lowest variance, if you want to get
technical). What were the WER differences on those two test sets?

On Thu, Nov 3, 2016 at 6:24 PM, Yiming Wang notifications@github.com
wrote:

Right now the fixed BLSTM on swbd is 0.3 worse in WER. I am testing if
extending the backprop to some more frames would help.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu5_AWevylw1VuO_PXYcEXUbj2JXuks5q6l8CgaJpZM4KmmiU
.

@vijayaditya
Copy link
Contributor

I am comfortable using squash and merge if the branch is up-to-date. But in this case I am concerned about the staleness of the branch. I sometimes find auto-merges can mess up the logic, so I am usually recommending that the developers run their unit-tests once they rebase.
What would you suggest ?

@danpovey
Copy link
Contributor

danpovey commented Nov 3, 2016

OK I guess.

On Thu, Nov 3, 2016 at 6:27 PM, Vijayaditya Peddinti <
notifications@github.com> wrote:

I am comfortable using squash and merge if the branch is up-to-date. But
in this case I am concerned about the staleness of the branch. I sometimes
find auto-merges can mess up the logic, so I am usually recommending that
the developers run their unit-tests once they rebase.
What would you suggest ?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu5DuB2cfAKYL9YbY15K5pPu4sKomks5q6l_WgaJpZM4KmmiU
.

@freewym
Copy link
Contributor Author

freewym commented Nov 3, 2016

They are both worse, by 0.3 and <0.18 respectively

%WER 15.2 | 4459 42989 | 86.4 9.2 4.4 1.6 15.2 51.5 | exp/nnet3/lstm_bidirectional_max_deriv_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.filt.sys
%WER 16.3 | 4459 42989 | 85.5 9.9 4.6 1.7 16.3 53.3 | exp/nnet3/lstm_bidirectional_max_deriv_sp/decode_eval2000_sw1_tg/score_10_0.0/eval2000_hires.ctm.filt.sys
%WER 14.09 [ 6828 / 48460, 745 ins, 1794 del, 4289 sub ] exp/nnet3/lstm_bidirectional_max_deriv_sp/decode_train_dev_sw1_tg//wer_10_0.0
%WER 13.23 [ 6409 / 48460, 676 ins, 1807 del, 3926 sub ] exp/nnet3/lstm_bidirectional_max_deriv_sp/decode_train_dev_sw1_fsh_fg//wer_11_0.0

%WER 14.9 | 4459 42989 | 86.7 9.1 4.2 1.6 14.9 50.7 | exp/nnet3/lstm_bidirectional_adversary0.0_sp/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.filt.sys
%WER 16.0 | 4459 42989 | 85.7 9.8 4.5 1.7 16.0 52.7 | exp/nnet3/lstm_bidirectional_adversary0.0_sp/decode_eval2000_sw1_tg/score_10_0.0/eval2000_hires.ctm.filt.sys
%WER 13.91 [ 6739 / 48460, 730 ins, 1790 del, 4219 sub ] exp/nnet3/lstm_bidirectional_adversary0.0_sp/decode_train_dev_sw1_tg//wer_10_0.0
%WER 13.19 [ 6394 / 48460, 718 ins, 1768 del, 3908 sub ] exp/nnet3/lstm_bidirectional_adversary0.0_sp/decode_train_dev_sw1_fsh_fg//wer_10_0.0

@danpovey
Copy link
Contributor

danpovey commented Nov 3, 2016

Was the objective function worse than the baseline?
You can run local/info/nnet3_dir_info.pl, that shows the objfs very compactly.

@freewym
Copy link
Contributor Author

freewym commented Nov 3, 2016

Yes, also a little worse in objf:
exp/nnet3/lstm_bidirectional_max_deriv_sp:
loglike:train/valid[454,683,combined]=(-0.77,-0.63,-0.61/-0.94,-0.91,-0.90)

exp/nnet3/lstm_bidirectional_adversary0.0_sp:
loglike:train/valid[454,683,combined]=(-0.76,-0.61,-0.60/-0.95,-0.89,-0.88)

@freewym freewym force-pushed the max_deriv_time branch 2 times, most recently from adb336a to bb54853 Compare November 4, 2016 03:09
@danpovey
Copy link
Contributor

danpovey commented Nov 5, 2016

@freewym, have the experiments with margin=5 finished?

@freewym
Copy link
Contributor Author

freewym commented Nov 5, 2016

Its WER is worse by 0.2(for eval2000) and 0.13 (for train_dev) using blstm chain model on swbd. I am increasing the margin to 20.

@danpovey
Copy link
Contributor

danpovey commented Nov 5, 2016

It's possible that it's just random noise-- you might want to rerun the
baseline with a different srand seed. And check that nothing changed
regarding, for instance, per-component max change.

On Sat, Nov 5, 2016 at 4:55 PM, Yiming Wang notifications@github.com
wrote:

Its WER is worse by 0.2(for eval2000) and 0.13 (for train_dev) using blstm
chain model on swbd. I am increasing the margin to 20.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu_BmEQ3NHERbk-0W901kQKb1oIfpks5q7O0zgaJpZM4KmmiU
.

@danpovey
Copy link
Contributor

@freewym, have you got any further results on this?

@freewym
Copy link
Contributor Author

freewym commented Nov 13, 2016

On ami ihm, the WER is margin=10 < margin=5 < "old" setup using blstm+xent model, which shows the fix can at least achieve the same performance on this data. I am now testing on sdm1.

@danpovey
Copy link
Contributor

@freewym, let me know when you think this is ready to merge.

deriv_time_opts += " --optimization.min-deriv-time={0}".format(left_deriv_truncate)
if right_deriv_truncate is not None:
deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk-width-right_deriv_truncate))
if not left_deriv_truncate is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey do you think it is better to pass in {min|max}_deriv_time instead of {left|right}_deriv_truncate? In that way 1) we don't need to pass in the argument chunk_width all along the way 2) we can compute the deriv_time in a much more outer function like in Train(); 3) it is consistent with train_rnn.py

@danpovey
Copy link
Contributor

That would be fine with me.

On Wed, Nov 16, 2016 at 10:01 PM, Yiming Wang notifications@github.com
wrote:

@freewym commented on this pull request.

In egs/wsj/s5/steps/nnet3/chain/train.py
#1165 (review):

@@ -340,10 +348,10 @@ def TrainNewModels(dir, iter, srand, num_jobs, num_archives_processed, num_archi
# but we use the same script for consistency with FF-DNN code

 deriv_time_opts=""
  • if left_deriv_truncate is not None:
  •    deriv_time_opts += " --optimization.min-deriv-time={0}".format(left_deriv_truncate)
    
  • if right_deriv_truncate is not None:
  •    deriv_time_opts += " --optimization.max-deriv-time={0}".format(int(chunk-width-right_deriv_truncate))
    
  • if not left_deriv_truncate is None:

@danpovey https://github.com/danpovey do you think it is better to pass
in {min|max}_deriv_time instead of {left|right}_deriv_truncate? In that way

  1. we don't need to pass in the argument chunk_width all along the way 2)
    we can compute the deriv_time in a much more outer function like in
    Train(); 3) it is consistent with train_rnn.py


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1165 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu77WLK7EsALT8eCq8CAR5mMpJD2Sks5q-8NzgaJpZM4KmmiU
.

@freewym
Copy link
Contributor Author

freewym commented Nov 17, 2016

@danpovey BLSTM+xent on sdm1 using the margin of 10 improves WER by 0.6 on dev and 0.2 on eval, respectively. All of those tests are using the old ClipGradientComponent. I think it is ready to merge. Perhaps I need to further tune the zeroing threshold in BackpropTruncationComponent with this fix.

@danpovey
Copy link
Contributor

I'll merge this now, you can make a separate simple commit to change the
zeroing threshold.

On Thu, Nov 17, 2016 at 1:22 PM, Yiming Wang notifications@github.com
wrote:

@danpovey https://github.com/danpovey BLSTM+xent on sdm1 using the
margin of 10 improves WER by 0.6 on dev and 0.2 on eval, respectively. All
of those tests are using the old ClipGradientComponent. I think it is ready
to merge. Perhaps I need to further tune the zeroing threshold in
BackpropTruncationComponent with this fix.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1165 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu7pfV9P9nr9IlypfZx4sMbaGakHmks5q_JtfgaJpZM4KmmiU
.

@danpovey danpovey merged commit 5874bc4 into kaldi-asr:master Nov 17, 2016
@freewym
Copy link
Contributor Author

freewym commented Nov 17, 2016

@vimalmanohar You may have to make the changes in #1066

@freewym freewym deleted the max_deriv_time branch November 18, 2016 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants