-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[egs, script] Zeroth-Korean: Korean open-source corpus and its script #2296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wonkyuml
commented
Mar 20, 2018
- It's using 51 hrs Korean audio data and pre-trained lm(using external text corpus)
- It contains the recent TDNN training recipe.
danpovey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like you haven't done much quality control.
egs/zeroth_korean/s5/conf/queue.conf
Outdated
| @@ -0,0 +1,10 @@ | |||
| # Default configuration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't include this, this is somethign specific to your queue.
| @@ -0,0 +1,299 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names should start from 1a.
| @@ -0,0 +1,302 @@ | |||
| #!/bin/bash | |||
|
|
|||
| set -e -o pipefail | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again- names should be consecutive starting from 1a.n And you should include results in a comment at the top of the script, preferably the output of some script like local/chain/compare_wer.sh.
egs/zeroth_korean/s5/run.sh
Outdated
| # Check list before start | ||
| # 1. locale setup | ||
| # 2. pre-installed package: awscli, Morfessor-2.0.1, flac, sox, same cuda library, unzip | ||
| # 3. pre-install or symbolic link for easy going: rirs_noises.zip (takes pretty long time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again--very vague.
egs/zeroth_korean/s5/run.sh
Outdated
| # 1. locale setup | ||
| # 2. pre-installed package: awscli, Morfessor-2.0.1, flac, sox, same cuda library, unzip | ||
| # 3. pre-install or symbolic link for easy going: rirs_noises.zip (takes pretty long time) | ||
| # 4. parameters: nCPU, num_jobs_initial, num_jobs_final, --max-noises-per-minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this means.
egs/zeroth_korean/s5/run.sh
Outdated
| # it takes long time and do this again after computing silence prob. | ||
| # you can do comment out here this time | ||
|
|
||
| #utils/build_const_arpa_lm.sh data/local/lm/zeroth.lm.tg.arpa.gz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is part of the script it shouldn't be commented out. You can add a --state option in the run.sh to enable running after partial run.
egs/zeroth_korean/s5/run.sh
Outdated
| data/$test exp/tri4b/decode_{tgsmall,fglarge}_$test | ||
| done | ||
|
|
||
| # align train_clean_100 using the tri4b model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is outdated
| # --splice-indexes "layer0/-1:0:1 layer1/-2:1 layer2/-4:2" \ | ||
| # --num-hidden-layers 3 \ | ||
| # --splice-indexes "layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3" \ | ||
| steps/nnet2/train_multisplice_accel2.sh --stage $train_stage \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why you'd include old nnet2 recipes in this PR. The nnet3 recipes must be substantially better by now. In any case you should be including results.
| data/lang exp/nnet2_online/extractor "$dir" ${dir}_online || exit 1; | ||
| fi | ||
|
|
||
| #if [ $stage -le 11 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of commented stuff here. It feels like you have just added whatever was lying around in a directory to a PR, without really checking it.
| fi | ||
| cp $trans $trans".old" | ||
| awk '{print $1}' $trans".old" > $trans"_tmp_index" | ||
| cut -d' ' -f2- $trans".old" |\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before using external scrips like morfessor you should add a script in somewhere like tools/extras/ that can automatically install it, and make it so your script can use it from that location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new pull request for morfessor installation script, Dan.
plz check the PR
#2299
|
Is this ready to merge, as far as you know? |
- simplified the script - delete unnecessary scripts and comments
|
Let me know when you think this is ready to review again. |
|
Ok. It's now ready to review. |
| linear-component name=tdnn5l dim=256 $linear_opts | ||
| relu-batchnorm-layer name=tdnn5 $opts dim=1280 input=Append(tdnn5l, tdnn3l) | ||
| linear-component name=tdnn6l dim=256 $linear_opts input=Append(-3,0) | ||
| relu-batchnorm-layer name=tdnn6 $opts input=Append(0,3) dim=1280 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model is probably too large, given that you only have 51 hours of data.
Can you please try to base your recipe on the current TDNN recipe from WSJ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed recipe based on the current WSJ TDNN-F recipe. It uses much smaller parameter as you suggested. It gets better, which is good.
|
Sure I will do it.
Thanks,
Wonkyum Lee
… On Jul 11, 2018, at 11:34 AM, Daniel Povey ***@***.***> wrote:
@danpovey commented on this pull request.
In egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh:
> + # as the layer immediately preceding the fixed-affine-layer to enable
+ # the use of short notation for the descriptor
+ fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
+
+ # the first splicing is moved before the lda layer, so no splicing here
+ relu-batchnorm-layer name=tdnn1 $opts dim=1280
+ linear-component name=tdnn2l dim=256 $linear_opts input=Append(-1,0)
+ relu-batchnorm-layer name=tdnn2 $opts input=Append(0,1) dim=1280
+ linear-component name=tdnn3l dim=256 $linear_opts
+ relu-batchnorm-layer name=tdnn3 $opts dim=1280
+ linear-component name=tdnn4l dim=256 $linear_opts input=Append(-1,0)
+ relu-batchnorm-layer name=tdnn4 $opts input=Append(0,1) dim=1280
+ linear-component name=tdnn5l dim=256 $linear_opts
+ relu-batchnorm-layer name=tdnn5 $opts dim=1280 input=Append(tdnn5l, tdnn3l)
+ linear-component name=tdnn6l dim=256 $linear_opts input=Append(-3,0)
+ relu-batchnorm-layer name=tdnn6 $opts input=Append(0,3) dim=1280
This model is probably too large, given that you only have 51 hours of data.
Can you please try to base your recipe on the current TDNN recipe from WSJ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
danpovey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
| @@ -0,0 +1,277 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These recipes should have results at the top in a comment. Please create a script like compare_wer.sh that prints out the results and diagnostics; you can copy and modify it from another setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tdnn and tdnn_opgru recipe has results and diagnostics now. Thanks for suggestion. Also, I created compare_wer.sh for this.
| @@ -0,0 +1,63 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a file s5/README.txt that explains something about the data: what type of data, how much of it, how you can obtain it, what the license is, things like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added README.txt
| chunk_width=150,110,100 | ||
|
|
||
| # training options | ||
| num_jobs_initial=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from remove_egs, I prefer that you put the rest of these (num-jobs, num-epochs, minibatch-size, learning rates), directly as args to the script; there's no need for variables. The same goes for chunk_width and max_param_change and xent_regularize (xent_regularize should probably be zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. I took those parameters out from variables. xent_regularize was 0.1 for the script that I referred. Do you still think zero makes sense for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry I was mixing it up. xent_regularize should be 0.1. It's chain-regularize that we are now making zero (in setups with l2).
| num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}') | ||
| learning_rate_factor=$(echo "print 0.5/$xent_regularize" | python) | ||
| opts="l2-regularize=0.002" | ||
| linear_opts="orthonormal-constraint=1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you adjust the recipe, be sure to take all of these variables from where you are copying the recipe from. And don't forget the dropout schedule and the corresponding option to train.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I try to make sure that I copy all of variables from the baseline recipe. Now TDNN has dropout schedule.
| --chain.lm-opts="--num-extra-lm-states=2000" \ | ||
| --trainer.max-param-change $max_param_change \ | ||
| --trainer.num-epochs $num_epochs \ | ||
| --trainer.frames-per-iter 1500000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest 2 million for the frames-per-iter; the 5 million in the WSJ example is probably too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I based TDNN-F recipe and changed frames-per-iter to 2m.
|
|
||
| mkdir exp -p exp/nnet3 | ||
|
|
||
| steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern scripts use PCA instead of LDA+MLLT. It's a substantial simplification and doesn't affect the results. Perhaps you copied this script from somewhere not fully up-to-date, like tedlium s5_r2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it
| awk '{for (i=2; i<=NF; ++i) { print $i; gsub(/[0-9]/, "", $i); print $i}}' $lexicon_raw_nosil |\ | ||
| sort -u |\ | ||
| perl -e 'while(<>){ | ||
| chop; m:^([^\d]+)(\d*)$: || die "Bad phone $_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please indent these a bit. they read like bash commands if they start at the very left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation done
| @@ -0,0 +1,36 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please call this update_segmentation.sh, for consistency in naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it
egs/zeroth_korean/s5/path.sh
Outdated
| export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH | ||
| [ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1 | ||
| . $KALDI_ROOT/tools/config/common_path.sh | ||
| export LC_ALL=ko_KR.UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem. LC_ALL should always be set to C. Most people may not even have the Korean locale installed.
Why is this even needed?
If you have scripts that depend on the locale, you should change them so they don't. With python3, it's possible to rework the I/O so that it treats input streams using a certain encoder/decoder (such as "utf-8").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. @jty016 will think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this dependency by using morfessor's option -e
egs/zeroth_korean/s5/run.sh
Outdated
|
|
||
| if [ $stage -le 11 ]; then | ||
|
|
||
| echo "#### SAT again on train_clean ###########" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all echo statements should start with "$0:" so it's clear which script produced the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now all echo statements start with "$0"
|
It still needs more work. I will ping you when it's fully ready to review again. Thanks for review. |
|
@wonkyuml Sorry for late update. plz check if it is okay now |
|
Thanks! I will review this week. |
|
@danpovey could you review this again? |
|
OK, will try to do it to-morrow |
danpovey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It looks very good generally, there are some very small issues. The better-tuned TDNN-F experiments are not necessary for me to merge it though, you can do it later if you have time.
| done | ||
| exit 0 | ||
|
|
||
| # tdnn_1a is a kind of factorized TDNN, with skip connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These systems seem to be underfitting-- I don't see any train/valid difference. Therefore you might want to choose a larger config for this TDNN-F system. (the one with OPGRU is already quite large but you could try more epochs).
|
|
||
| # the first splicing is moved before the lda layer, so no splicing here | ||
| relu-batchnorm-dropout-layer name=tdnn1 $tdnn_opts dim=1024 | ||
| tdnnf-layer name=tdnnf2 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were you I would try a slightly larger system here... e.g. increase 1024 -> 1280, 192->256 and 128->160. It's not necessary for it to be merged though; you can do it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I am training large system. If large system is better than this, I will update result and recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suggested, increasing parameter size helped! I am trying to increase a bit more.
| exit 1 | ||
| fi | ||
|
|
||
| spk_file=$src/../AUDIO_INFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't very much like how you are going one level up from "$src". If one-level-up is the real source, you should give that directory. But make sure that entire directory is downloaded from one place. If it's multiple downloads then make them 2 separate arguments to this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole data is downloaded in one place. It was trying to refer meta data while preparing data directory. Anyway, I fixed it.
| test=${src_dir}_test_${lm_suffix} | ||
| mkdir -p $test | ||
| cp -r ${src_dir}/* $test | ||
| gunzip -c $lm_dir/zeroth.lm.${lm_suffix}.arpa.gz | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you pruned one of these a bit too aggressively-- what is the size of the graph? The pre/post-rescore WERs here differ more than I would normally expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HCLG.fst is not that small. 676M for chain graph and 1.8G for tri-phone GMM system. Rescoring LM is even bigger though. Actually vocab is pretty huge 500k as Korean is agglutinative language.
| steps/compute_cmvn_stats.sh data/${datadir}_hires || exit 1; | ||
| done | ||
|
|
||
| # We need to build a small system just because we need the LDA+MLLT transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these comments are out of date, we are using PCA not LDA+MLLT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I corrected it.
|
@hhadian... since Korean is an agglutinative language (and I believe it's phonetically spelled), I wonder whether a BPE-based recipe might do much better on this. I don't recall right now the steps required to do this. I wonder if you could advise what steps might be required to test this out? Have we done this on ASR recipes yet, or only on OCR? |
|
also @DongjiGao or @hainan-xv might have some experience with this. |
|
Yes, it's phonetically spelled although pronunciation slightly changes depending on the right before character. Any advice would be appreciated! |
|
No we have not yet tried it on ASR.
|
|
Thanks @hhadian I will try BPE later. will create separate PR if it works. |
|
TDNN-F looks good as of now. @danpovey It you're ok, it's ready to be merged. |
|
thanks! |