-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Augmentation recipe for swbd #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does it bring any improvement to use speed and volume perturbation in addition to adding noises? |
|
Tom-- it would be good if you give us a little background on this, On Wed, Oct 12, 2016 at 5:53 AM, Rémi Francis [email protected]
|
egs/swbd/s5c/RESULTS
Outdated
| %WER 11.6 | 1831 21395 | 89.7 7.0 3.3 1.4 11.6 47.0 | exp/chain/tdnn_7d_sp/decode_eval2000_sw1_tg/score_10_1.0/eval2000_hires.ctm.swbd.filt.sys | ||
|
|
||
|
|
||
| # results with chain TDNNs (2 epoch training on data reverberated with room impulse responses) (see local/chain/multi_condition/run_tdnn_7b.sh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify at this point how many repetitions of the data there were, and whether or not it included speed perturbation?
| stage=1 | ||
| train_stage=-10 | ||
| get_egs_stage=-10 | ||
| speed_perturb=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see that by default this uses the speed-perturbed data as input. you might want to have a comment at the top that clarifies this.
| @@ -0,0 +1,78 @@ | |||
| #!/bin/bash | |||
|
|
|||
| # Copyright 2014 Johns Hopkins University (author: Vijayaditya Peddinti) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will be necessary to add this script once you make those other changes.
I'll talk to David separately to see if we can come up with a roadmap to make it unnecessary to have alignments, for purposes of training the speaker-id systems... this is very inconvenient. I don't want to have a situation where the need for these alignments is making the scripts super complicated [see my other, long, comment].
| --num-replications $num_data_reps \ | ||
| --max-noises-per-minute 1 \ | ||
| --source-sampling-rate 8000 \ | ||
| data/${clean_data_dir} data/${train_set} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments after looking at the script in more detail:
If we want to use this script as a starting point for how to do data reverberation it's going to be quite hard.
For a start, it has a very messy dependency on the regular run_ivector_common.sh, because it assumes that the train_nodup_sp directory already exists. There are scenarios where we'd want to run the reverberation stuff from scratch without running the regular run_ivector_common.sh. It would be better to have this script do both the speed perturbing and the RIR stuff; to avoid overwriting features that might have been already written by run_ivector_common.sh, you could put that step first so the --stage parameter can be used to avoid redoing stuff, and have the script die if the feats.scp exists in the _sp directory (and would be overwritten if the script were to run).
To keep things simple, you can just make the speed-perturbing non-optional. I don't think anyone will want to do this without speed perturbing; if you want to do so for your own experiments you can do it locally.
Also, this script seems to have a very detailed dependency on the Switchboard setup, because of the way you prepare the data subset for the lda_mllt stage, and the way you re-use the alignments.
The LDA+MLLT transform is extremely non-critical, because it only affects the diagonal GMM which is just used to initialize the full GMM and to pre-prune when getting GMM posteriors. And the number of parameters is tiny. You could use a very small data subset for that, and it would be best to eliminate all of the detailed dependency on the Switchboard setup. What I think would make more sense is, after you create the _mix data, to select a fairly small subset of it (with a num-utts that can be specified on the command line, like 20k), and to dump parallel features with regular MFCC and _hires, and then use a supplied model [which should be defined at the top of the script, and not specified in the body of the script], to get the alignments so you can train a system on the _hires features.
Also, with regard to the way you're mixing it with the un-perturbed data (_mix)...
is there a reason why it's better to mix it manually like that, instead of specifying (say) a number less than 1 for the following:
--speech-rvb-probability 1
--pointsource-noise-addition-probability 1
--isotropic-noise-addition-probability 1
and maybe using 2 replications instead of 1? It seems to me that that would be more elegant and simpler, if the results were the same.
| unzip rirs_noises.zip | ||
| fi | ||
|
|
||
| # corrupt the data to generate reverberated data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please clarify via a comment that this doesn't do any real computation, it just writes commands in the wav.scp file.
egs/swbd/s5c/RESULTS
Outdated
| # results with chain TDNNs (2 epoch training on data reverberated with room impulse responses) (see local/chain/multi_condition/run_tdnn_7b.sh) | ||
| %WER 10.0 | 1831 21395 | 91.0 6.0 3.0 1.1 10.0 43.8 | exp/chain/tdnn_7b_sp_rvb1_mix/decode_eval2000_sw1_fsh_fg/score_10_0.5/eval2000_hires.ctm.swbd.filt.sys | ||
| %WER 20.0 | 2628 21594 | 82.1 11.7 6.2 2.1 20.0 55.6 | exp/chain/tdnn_7b_sp_rvb1_mix/decode_eval2000_sw1_fsh_fg/score_10_0.0/eval2000_hires.ctm.callhm.filt.sys | ||
| %WER 15.0 | 4459 42989 | 86.5 8.8 4.7 1.6 15.0 50.7 | exp/chain/tdnn_7b_sp_rvb1_mix/decode_eval2000_sw1_fsh_fg/score_10_0.5/eval2000_hires.ctm.filt.sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to parse these results in context if you make it compatible with the numbers just above and below, by putting the whole-test-set number (15.0) first, the 10.0 number second, and removing the 20.0 number [people rarely quote the callhome-only subset... I don't mind normally but this is just for consistency with the numbers above and below]
|
@danpovey Regarding your suggestion of using a smaller value for --speech-rvb-probability, let say 0.5, |
|
I responded by email a day or two ago, but github lost my comment. Resending: How about adding an option --include-original-data in your script, so it will know to always include one copy of the original (checking first that the num-copies > 1)? |
|
@danpovey I have thought about that too, but then the feature will have to be extracted again for the original data. is that fine ? |
|
I replied by email, but github reply-by-email is very flaky right now, so I'm posting it directly too. That's OK. |
|
@danpovey Do you think it is critical to let the perturbed data be included in the training of UBM and ivector-extractor ? Actually i am verifying these recently and the result is a little bit worse if perturbed data is not included. |
|
[reposting since my email was lost.] |
…ript; modify swbd-rvb script
|
How about adding an option --include-original-data in your script, so it On Fri, Oct 21, 2016 at 12:49 AM, Tom Ko [email protected] wrote:
|
|
That's OK. On Sat, Oct 22, 2016 at 9:16 PM, Tom Ko [email protected] wrote:
|
|
I think you should generate all the perturbed data, then if the amount of On Sat, Oct 22, 2016 at 9:36 PM, Tom Ko [email protected] wrote:
|
|
@danpovey Do you have time to see if further modification is needed ? |
|
[reposting directly since github is still delaying my email], |
| help="Sampling rate of the source data. If a positive integer is specified with this option, " | ||
| "the RIRs/noises will be resampled to the rate of the source data.") | ||
| parser.add_argument("--include-original-data", type=str, help="If true, the output data includes one copy of the original data", | ||
| choices=['true', 'false'], default = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. If you have string-valued choices then you have to have default = 'true', and check it with args.include_original_data == 'true'.
|
Can you please run a version of this based on the 7e script, which is the current best? |
|
I'll try to in a day or two, but I was kind of hoping Vijay would chime in. On Wed, Oct 26, 2016 at 11:47 PM, Tom Ko [email protected] wrote:
|
|
@danpovey these are the result of running data reverberation on 7e script: the callhm result is improved from 7b to 7e, i can't find the 7e baseline result on RESULTS |
|
It's at the top of the script itself. The comparable number to your 14.6% Can you please move the script to local/chain/tuning/run_tdnn_7f.sh, make On Fri, Oct 28, 2016 at 11:40 PM, Tom Ko [email protected] wrote:
|
|
... also, you can make a comment that the difference may not be 100% due to On Fri, Oct 28, 2016 at 11:49 PM, Daniel Povey [email protected] wrote:
|
|
but per-component max-change is already added in 7e , so from 7e to 7f only reverberation is added |
|
No, the 7e script and the results in it are from before the per-component On Sat, Oct 29, 2016 at 12:05 AM, Tom Ko [email protected] wrote:
|
|
then maybe i will rerun the normal 7e script with un-reverberant data to check the improvement gain by per-component max-change and reverberation separately |
|
|
||
| stage=1 | ||
| num_data_reps=1 # number of reverberated copies of data to generate | ||
| clean_data_dir=train_nodup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the fact that your script changes this variable by adding _sp to it, because it could mislead a reader into thinking that they understand what the variable is.
Better to use different variable names-- you could call this 'input_data_dir', and have 'clean_data_dir' be either
| speed_perturb=true | ||
| dir=exp/chain/tdnn_7b # Note: _sp will get added to this if $speed_perturb == true. | ||
| decode_iter= | ||
| iv_dir=exp/nnet3_rvb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please call this something like ivector_dir if that's what you mean by 'iv'.
| # TDNN options | ||
| # this script uses the new tdnn config generator so it needs a final 0 to reflect that the final layer input has no splicing | ||
| # smoothing options | ||
| pool_window= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These pooling options were deprecated long ago-- please remove them. When you get the final 7f script, please also 'diff' with the 7e script and see if there are any other respects in which your script is outdated-- I want it as similar as possible with 7e.. You may need to change some things and rerun.
| # if we are using the speed-perturbed data we need to generate | ||
| # alignments for it. | ||
| # Also the data reverberation will be done in this script/ | ||
| echo local/nnet3/multi_condition/run_ivector_common.sh --stage $stage \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this 'echo' should not be there.
| sort -u $rvb_lat_dir/temp/combined_lats.scp > $rvb_lat_dir/temp/combined_lats_sorted.scp | ||
|
|
||
| lattice-copy scp:$rvb_lat_dir/temp/combined_lats_sorted.scp "ark:|gzip -c >$rvb_lat_dir/lat.1.gz" || exit 1; | ||
| echo "1" > $rvb_lat_dir/num_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it would be extremely slow for large data sets, it's all in one job.
In any case, all the chain-training script (get_egs) does with the lattices is to immediately copy them to ark,scp format, which is what you've done here. So better to change get_egs.sh so that it requires either lat.*.gz or lat.scp to exist. If lat.scp exists, all get_egs.sh has to do is copy it to the right directory.
|
If you do that, then please change the numbering and make the rerun 7f, and On Sat, Oct 29, 2016 at 12:21 AM, Tom Ko [email protected] wrote:
|
|
In that case, 7f (rerun) script will be exactly the same as 7e script. |
Yes.
|
|
@danpovey Here are the comparison from 7e->7f and 7f-7g System 7f 7g For 7e result, I just manually copy it from the top of the 7e script. You can see there is no obvious improvement from 7e -> 7f (adding per-component max-change ). I don't know if this was due to the randomness between different runs. Do you still want to add the 7f script where the 7g represents the reverberation script ? |
|
Yes, please create both scripts, it will more accurately document the Dan On Mon, Oct 31, 2016 at 9:55 PM, Tom Ko [email protected] wrote:
|
|
This should be very close to ready to merge, or maybe ready. |
| # current best 'chain' models with TDNNs (see local/chain/run_tdnn_7d.sh) | ||
| %WER 10.4 | 1831 21395 | 90.7 6.1 3.2 1.2 10.4 44.6 | exp/chain/tdnn_7d_sp/decode_eval2000_sw1_fsh_fg/score_11_1.0/eval2000_hires.ctm.swbd.filt.sys | ||
| %WER 11.6 | 1831 21395 | 89.7 7.0 3.3 1.4 11.6 47.0 | exp/chain/tdnn_7d_sp/decode_eval2000_sw1_tg/score_10_1.0/eval2000_hires.ctm.swbd.filt.sys | ||
| # current best 'chain' models with TDNNs (see local/chain/run_tdnn_7g.sh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the model converge after 2 epochs of training ? Could you please post the log-likelihood plots here.
| @@ -0,0 +1,210 @@ | |||
| #!/bin/bash | |||
|
|
|||
| # 7e is as 7f, but adding the max-change-per-component to the neural net training | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7f is as 7e
|
|
||
| # TDNN options | ||
| # this script uses the new tdnn config generator so it needs a final 0 to reflect that the final layer input has no splicing | ||
| # smoothing options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what smoothing are you referring to ?
| # which leads to better results | ||
| # This script assumes a mixing of the original training data with its reverberated copy | ||
| # and results in a 2-fold training set. Thus the number of epochs is halved to | ||
| # keep the same training time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment describing what happens if you train for more number of epochs.
|
|
||
|
|
||
| # TDNN options | ||
| # this script uses the new tdnn config generator so it needs a final 0 to reflect that the final layer input has no splicing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this comment to the splice indexes specification.
| rm -r data/temp1 data/temp2 | ||
|
|
||
| mfccdir=mfcc_perturbed | ||
| steps/make_mfcc.sh --cmd "$train_cmd" --nj 50 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment describing why you need these features.
|
|
||
| clean_data_dir=${input_data_dir}_sp | ||
| else | ||
| clean_data_dir=${input_data_dir} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here saying we recommend speed perturbation as the gains are significant.
| # if --include-original-data is true, the original data will be mixed with its reverberated copies | ||
| python steps/data/reverberate_data_dir.py \ | ||
| --prefix "rev" \ | ||
| --rir-set-parameters "0.3, simulated_rirs_8k/smallroom/rir_list" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to the other 0.1 probability mass ? could you add a comment here describing how these weights are used ?
| if [ $stage -le 5 ]; then | ||
| steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ | ||
| --splice-opts "--left-context=3 --right-context=3" \ | ||
| 5500 90000 data/train_100k_nodup_hires \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you want to train the lda_mllt transform on a mix of reverberated and clean data ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijayaditya just train the lda_mllt transform on clean data is good enough, and this can avoid copying the alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK in that case could you add a comment here saying the same, this would help avoid any confusions.
|
|
||
| if reverberate_opts == "": | ||
|
|
||
| # prefix with index 0, e.g. rvb0_swb0035, stangs for the original data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# prefix using index 0 is reserved for original data e.g. rvb0_swb0035 corresponds to the swb0035 recording in original data
|
That's OK with me. The LDA+MLLT is the least critical part of the whole On Fri, Nov 4, 2016 at 12:23 PM, Tom Ko [email protected] wrote:
|
| # smoothing options | ||
| self_repair_scale=0.00001 | ||
| # training options | ||
| num_epochs=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get a chance to check the log-likelihood values at the end of training ? Did the training converge, is there no improvement from running the training for few more epochs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijayaditya I have checked that there is no improvement from training for more epochs. I guess we have already shown the convergence and the likelihood values in our paper.
vijayaditya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will merge after the two requested minor changes have been made.
| . ./cmd.sh | ||
|
|
||
| stage=1 | ||
| stage=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to change this back ?
| if [ $stage -le 5 ]; then | ||
| steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ | ||
| --splice-opts "--left-context=3 --right-context=3" \ | ||
| 5500 90000 data/train_100k_nodup_hires \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK in that case could you add a comment here saying the same, this would help avoid any confusions.
egs/swbd/s5c/local/chain/run_tdnn.sh
Outdated
| @@ -1 +1 @@ | |||
| tuning/run_tdnn_7e.sh No newline at end of file | |||
| tuning/run_tdnn_7g.sh No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking this question as we will not be able to compare our results with other papers. We don't already do it anyway as we use speed-perturbation. So @tomkocse could you just add a commented line in this script
#for swbd recipe without the reverberation of training data use the following script
# it is similar to run_tdnn_7g.sh except for the run_ivector_common.sh being called.
# tuning/run_tdnn_7f.sh|
Actually, I'm not sure, regarding making it the preferred Switchboard Dan On Mon, Nov 7, 2016 at 12:08 PM, Vijayaditya Peddinti <
|
|
What about moving the reverberated recipe (7g) to local/chain/multi_condition then go on making the non-reverberated one (7f) the preferred recipe? |
|
OK. On Mon, Nov 7, 2016 at 9:32 PM, Tom Ko [email protected] wrote:
|
No description provided.