-
Notifications
You must be signed in to change notification settings - Fork 5.4k
multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV … #2137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
342a3f3 to
4998968
Compare
egs/multi_en/s5/run.sh
Outdated
| --cmd "$train_cmd" --nj 40 \ | ||
| $data exp/make_mfcc/$c/train || exit 1; | ||
| steps/compute_cmvn_stats.sh \ | ||
| $data exp/make_mfcc/$c/train || exit 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exiting here is not very useful. Better to do || touch $dir/.error and check for that later.
egs/multi_en/s5/run.sh
Outdated
| # We prepare the dictionary in data/local/dict_combined. | ||
| local/prepare_dict.sh $swbd $tedlium2 | ||
| local/g2p/train_g2p.sh --stage 0 --silence-phones "data/local/dict_combined/silence_phones.txt" data/local/dict_combined exp/g2p | ||
| wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment that you are waiting for train_g2p.sh to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you should check for errors if it has failed.
4998968 to
8295e35
Compare
|
fixed |
egs/fisher_swbd/s5/path.sh
Outdated
| @@ -1,4 +1,4 @@ | |||
| export KALDI_ROOT=`pwd`/../../.. | |||
| export KALDI_ROOT=../../kaldi | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change the standard, which is the same in all recipes. Also fix this in multi-en and elsewhere you changed.
egs/fisher_swbd/s5/run.sh
Outdated
|
|
||
| # prepare fisher data and put it under data/train_fisher | ||
| local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \ | ||
| local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation.
| steps/align_si.sh --boost-silence 1.25 --nj 5 --cmd "$train_cmd" \ | ||
| data/train_clean_5 data/lang_nosp exp/mono exp/mono_ali_train_clean_5 | ||
| fi | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not meant to be changed.
egs/multi_en/s5/path.sh
Outdated
| @@ -1,7 +1,8 @@ | |||
| export KALDI_ROOT=`pwd`/../../.. | |||
| export KALDI_ROOT=../../kaldi | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this.
egs/swbd/s5c/path.sh
Outdated
| @@ -1,4 +1,4 @@ | |||
| export KALDI_ROOT=`pwd`/../../.. | |||
| export KALDI_ROOT=/export/b04/xzhang/kaldi | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix this
| --speed-perturb $speed_perturb \ | ||
| --generate-alignments $speed_perturb || exit 1; | ||
|
|
||
| exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
| # configs for 'chain' | ||
| stage=12 | ||
| train_stage=-10 | ||
| stage=15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo this change.
egs/wsj/s5/utils/mkgraph.sh
Outdated
| tscale=1.0 | ||
| loopscale=0.1 | ||
|
|
||
| unkscale=1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I unintentionally committed some files I didn't plan to commit..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finished cleaning up
41d52dd to
7b5d0e8
Compare
egs/multi_en/s5/run.sh
Outdated
| if [ $stage -le 8 ]; then | ||
| # get rid of spk2gender files because not all corpora have them | ||
| rm -f data/*/train/spk2gender | ||
| rm data/*/train/spk2gender 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be -f. Otherwise the script might exit because of set -e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like using -f to make 'rm' be quiet, because it overrides when people change modes to save something. Instead, do || true
egs/multi_en/s5/run.sh
Outdated
| cp data/local/dict_combined/{extra_questions,nonsilence_phones,silence_phones,optional_silence}.txt $dict_dir | ||
| local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt | ||
| local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus \ | ||
| data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt || touch $dict_dir/.error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be running in the background. So better to use || exit
| data=data/$c/train | ||
| steps/make_mfcc.sh --mfcc-config conf/mfcc.conf \ | ||
| --cmd "$train_cmd" --nj 40 \ | ||
| $data exp/make_mfcc/$c/train || touch $data/.error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you checking if there is an error in a later stage? Also you should first remove the rm -f the .error files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I actually made it back to || exit 1; since I think we should quit if the feature extraction fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| exit 1 will have no effect inside a subshell running in background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
egs/multi_en/s5/run.sh
Outdated
| local/wsj_format_data.sh | ||
| utils/copy_data_dir.sh --spk_prefix wsj_ --utt_prefix wsj_ data/wsj/train_si284 data/wsj/train | ||
| rm -rf data/wsj/train_si284 | ||
| rm -r data/wsj/train_si284 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without -f the script will exit if error since you gave set -e
7b5d0e8 to
0d881e8
Compare
danpovey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple comments..
| multi_a tri5a tg_sp_eval2000.si || %WER 30.5 | 4459 42989 | 73.7 20.3 6.0 4.2 30.5 67.8 | exp/multi_a/tri5a/decode_tg_sp_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys | ||
| multi_a tri5b tg_eval2000 || %WER 24.3 | 4459 42989 | 79.3 15.7 5.0 3.6 24.3 63.5 | exp/multi_a/tri5b/decode_tg_eval2000/score_11_0.0/eval2000.ctm.filt.sys | ||
| multi_a tri5b tg_eval2000.si || %WER 30.7 | 4459 42989 | 73.6 20.4 6.0 4.3 30.7 68.1 | exp/multi_a/tri5b/decode_tg_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys | ||
| # On eval2000 the final GMM results is 24.5, which is better than the above result (24.9). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this comment doesn't belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the results seems lightly worse than before. Is there an explanation for this? Is it the case that for other test sets it was better?
What exactly is better about this new setup than how it was before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed some bugs of acronym normalization in swbd training data prep (previously these acronyms were wrongly normalized as OOVs), and I fixed the way we apply G2P to OOVs in training data, i.e. we won't apply G2P to words containing special characters. But these changes are too tiny to affect WERs. The major change which may affect WER is we now have 72h more training data (hub4 97). Since there are mismatch between hub4's bn style and eval2000's style, there could be some degradation at GMM level (actually looking tri5a.si and tri5b.si it's slightly better than before). I think this could be compensated at chain model level. BTW I intentionally made that comment there...
…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data
f245faa to
1d4959a
Compare
|
ready to merge I think |
…, OOV … (kaldi-asr#2137) * multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data
…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data. @vimalmanohar can you please review it? Thanks!