[egs, script] Zeroth-Korean: Korean open-source corpus and its script #2296

wonkyuml · 2018-03-20T18:42:04Z

It's using 51 hrs Korean audio data and pre-trained lm(using external text corpus)
It contains the recent TDNN training recipe.

danpovey

This feels like you haven't done much quality control.

danpovey · 2018-03-20T18:48:24Z

egs/zeroth_korean/s5/conf/queue.conf

@@ -0,0 +1,10 @@
+# Default configuration


don't include this, this is somethign specific to your queue.

danpovey · 2018-03-20T18:48:42Z

egs/zeroth_korean/s5/local/chain/multi_condition/run_tdnn_1b.sh

@@ -0,0 +1,299 @@
+#!/bin/bash
+


names should start from 1a.

danpovey · 2018-03-20T18:49:14Z

egs/zeroth_korean/s5/local/chain/multi_condition/run_tdnn_1n.sh

@@ -0,0 +1,302 @@
+#!/bin/bash
+
+set -e -o pipefail


again- names should be consecutive starting from 1a.n And you should include results in a comment at the top of the script, preferably the output of some script like local/chain/compare_wer.sh.

danpovey · 2018-03-20T18:52:15Z

egs/zeroth_korean/s5/run.sh

+# Check list before start
+# 1. locale setup
+# 2. pre-installed package: awscli, Morfessor-2.0.1, flac, sox, same cuda library, unzip
+# 3. pre-install or symbolic link for easy going: rirs_noises.zip (takes pretty long time)


again--very vague.

danpovey · 2018-03-20T18:52:22Z

egs/zeroth_korean/s5/run.sh

+# 1. locale setup
+# 2. pre-installed package: awscli, Morfessor-2.0.1, flac, sox, same cuda library, unzip
+# 3. pre-install or symbolic link for easy going: rirs_noises.zip (takes pretty long time)
+# 4. parameters: nCPU, num_jobs_initial, num_jobs_final, --max-noises-per-minute


not sure what this means.

danpovey · 2018-03-20T18:53:42Z

egs/zeroth_korean/s5/run.sh

+# it takes long time and do this again after computing silence prob.
+# you can do comment out here this time
+
+#utils/build_const_arpa_lm.sh data/local/lm/zeroth.lm.tg.arpa.gz \


if this is part of the script it shouldn't be commented out. You can add a --state option in the run.sh to enable running after partial run.

danpovey · 2018-03-20T18:54:45Z

egs/zeroth_korean/s5/run.sh

+    data/$test exp/tri4b/decode_{tgsmall,fglarge}_$test
+done
+
+# align train_clean_100 using the tri4b model


comment is outdated

danpovey · 2018-03-20T18:59:12Z

egs/zeroth_korean/s5/local/online/run_nnet2_ms.sh

+	#	--splice-indexes "layer0/-1:0:1 layer1/-2:1 layer2/-4:2" \
+	#	--num-hidden-layers 3 \
+	#	--splice-indexes "layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3" \
+	steps/nnet2/train_multisplice_accel2.sh --stage $train_stage \


I don't know why you'd include old nnet2 recipes in this PR. The nnet3 recipes must be substantially better by now. In any case you should be including results.

danpovey · 2018-03-20T18:59:57Z

egs/zeroth_korean/s5/local/online/run_nnet2_ms.sh

+    data/lang exp/nnet2_online/extractor "$dir" ${dir}_online || exit 1;
+fi
+
+#if [ $stage -le 11 ]; then


There is a lot of commented stuff here. It feels like you have just added whatever was lying around in a directory to a PR, without really checking it.

danpovey · 2018-03-20T19:01:22Z

egs/zeroth_korean/s5/local/updateSegmentation.sh

+fi
+cp $trans $trans".old"
+awk '{print $1}' $trans".old" > $trans"_tmp_index"
+cut -d' ' -f2- $trans".old" |\


before using external scrips like morfessor you should add a script in somewhere like tools/extras/ that can automatically install it, and make it so your script can use it from that location.

I added a new pull request for morfessor installation script, Dan.
plz check the PR
#2299

danpovey · 2018-07-03T03:28:32Z

Is this ready to merge, as far as you know?

- simplified the script - delete unnecessary scripts and comments

danpovey · 2018-07-10T18:11:26Z

Let me know when you think this is ready to review again.

wonkyuml · 2018-07-11T17:49:44Z

Ok. It's now ready to review.

danpovey · 2018-07-11T18:34:27Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

+  linear-component name=tdnn5l dim=256 $linear_opts
+  relu-batchnorm-layer name=tdnn5 $opts dim=1280 input=Append(tdnn5l, tdnn3l)
+  linear-component name=tdnn6l dim=256 $linear_opts input=Append(-3,0)
+  relu-batchnorm-layer name=tdnn6 $opts input=Append(0,3) dim=1280


This model is probably too large, given that you only have 51 hours of data.
Can you please try to base your recipe on the current TDNN recipe from WSJ?

I changed recipe based on the current WSJ TDNN-F recipe. It uses much smaller parameter as you suggested. It gets better, which is good.

wonkyuml · 2018-07-11T18:48:22Z

Sure I will do it. Thanks, Wonkyum Lee

…

On Jul 11, 2018, at 11:34 AM, Daniel Povey ***@***.***> wrote: @danpovey commented on this pull request. In egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh: > + # as the layer immediately preceding the fixed-affine-layer to enable + # the use of short notation for the descriptor + fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat + + # the first splicing is moved before the lda layer, so no splicing here + relu-batchnorm-layer name=tdnn1 $opts dim=1280 + linear-component name=tdnn2l dim=256 $linear_opts input=Append(-1,0) + relu-batchnorm-layer name=tdnn2 $opts input=Append(0,1) dim=1280 + linear-component name=tdnn3l dim=256 $linear_opts + relu-batchnorm-layer name=tdnn3 $opts dim=1280 + linear-component name=tdnn4l dim=256 $linear_opts input=Append(-1,0) + relu-batchnorm-layer name=tdnn4 $opts input=Append(0,1) dim=1280 + linear-component name=tdnn5l dim=256 $linear_opts + relu-batchnorm-layer name=tdnn5 $opts dim=1280 input=Append(tdnn5l, tdnn3l) + linear-component name=tdnn6l dim=256 $linear_opts input=Append(-3,0) + relu-batchnorm-layer name=tdnn6 $opts input=Append(0,3) dim=1280 This model is probably too large, given that you only have 51 hours of data. Can you please try to base your recipe on the current TDNN recipe from WSJ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey

Some more comments.

danpovey · 2018-07-11T18:36:42Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

@@ -0,0 +1,277 @@
+#!/bin/bash
+


These recipes should have results at the top in a comment. Please create a script like compare_wer.sh that prints out the results and diagnostics; you can copy and modify it from another setup.

Both tdnn and tdnn_opgru recipe has results and diagnostics now. Thanks for suggestion. Also, I created compare_wer.sh for this.

danpovey · 2018-07-11T18:37:25Z

egs/zeroth_korean/s5/RESULTS

@@ -0,0 +1,63 @@
+#!/bin/bash


Please add a file s5/README.txt that explains something about the data: what type of data, how much of it, how you can obtain it, what the license is, things like that.

I added README.txt

danpovey · 2018-07-11T18:39:27Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

+chunk_width=150,110,100
+
+# training options
+num_jobs_initial=3


Apart from remove_egs, I prefer that you put the rest of these (num-jobs, num-epochs, minibatch-size, learning rates), directly as args to the script; there's no need for variables. The same goes for chunk_width and max_param_change and xent_regularize (xent_regularize should probably be zero).

yup. I took those parameters out from variables. xent_regularize was 0.1 for the script that I referred. Do you still think zero makes sense for this?

oh sorry I was mixing it up. xent_regularize should be 0.1. It's chain-regularize that we are now making zero (in setups with l2).

danpovey · 2018-07-11T18:40:43Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

+  num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}')
+  learning_rate_factor=$(echo "print 0.5/$xent_regularize" | python)
+  opts="l2-regularize=0.002"
+  linear_opts="orthonormal-constraint=1.0"


when you adjust the recipe, be sure to take all of these variables from where you are copying the recipe from. And don't forget the dropout schedule and the corresponding option to train.py.

OK. I try to make sure that I copy all of variables from the baseline recipe. Now TDNN has dropout schedule.

danpovey · 2018-07-11T18:41:07Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

+    --chain.lm-opts="--num-extra-lm-states=2000" \
+    --trainer.max-param-change $max_param_change \
+    --trainer.num-epochs $num_epochs \
+    --trainer.frames-per-iter 1500000 \


I suggest 2 million for the frames-per-iter; the 5 million in the WSJ example is probably too large.

yes. I based TDNN-F recipe and changed frames-per-iter to 2m.

danpovey · 2018-07-11T18:48:05Z

egs/zeroth_korean/s5/local/nnet3/run_ivector_common.sh

+
+  mkdir exp -p exp/nnet3
+
+  steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \


Modern scripts use PCA instead of LDA+MLLT. It's a substantial simplification and doesn't affect the results. Perhaps you copied this script from somewhere not fully up-to-date, like tedlium s5_r2.

danpovey · 2018-07-11T18:48:49Z

egs/zeroth_korean/s5/local/prepare_dict.sh

+awk '{for (i=2; i<=NF; ++i) { print $i; gsub(/[0-9]/, "", $i); print $i}}' $lexicon_raw_nosil |\
+	sort -u |\
+	perl -e 'while(<>){
+chop; m:^([^\d]+)(\d*)$: || die "Bad phone $_";


please indent these a bit. they read like bash commands if they start at the very left.

indentation done

danpovey · 2018-07-11T18:49:22Z

egs/zeroth_korean/s5/local/updateSegmentation.sh

@@ -0,0 +1,36 @@
+#!/bin/bash


please call this update_segmentation.sh, for consistency in naming.

danpovey · 2018-07-11T18:50:59Z

egs/zeroth_korean/s5/path.sh

+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
+. $KALDI_ROOT/tools/config/common_path.sh
+export LC_ALL=ko_KR.UTF-8


This is a problem. LC_ALL should always be set to C. Most people may not even have the Korean locale installed.
Why is this even needed?
If you have scripts that depend on the locale, you should change them so they don't. With python3, it's possible to rework the I/O so that it treats input streams using a certain encoder/decoder (such as "utf-8").

OK. @jty016 will think about this.

I removed this dependency by using morfessor's option -e

danpovey · 2018-07-11T18:51:47Z

egs/zeroth_korean/s5/run.sh

+
+if [ $stage -le 11 ]; then
+
+  echo "#### SAT again on train_clean ###########"


all echo statements should start with "$0:" so it's clear which script produced the output.

now all echo statements start with "$0"

wonkyuml · 2018-07-13T23:46:52Z

It still needs more work. I will ping you when it's fully ready to review again. Thanks for review.

…eroth_egs

lucas-jo · 2018-08-20T16:07:52Z

@wonkyuml Sorry for late update. plz check if it is okay now

wonkyuml · 2018-08-21T21:46:40Z

Thanks! I will review this week.

wonkyuml · 2018-08-28T16:48:53Z

@danpovey could you review this again?

danpovey · 2018-08-28T23:25:52Z

OK, will try to do it to-morrow

danpovey

Thanks. It looks very good generally, there are some very small issues. The better-tuned TDNN-F experiments are not necessary for me to merge it though, you can do it later if you have time.

danpovey · 2018-08-31T02:09:48Z

egs/zeroth_korean/s5/RESULTS

+done
+exit 0
+
+# tdnn_1a is a kind of factorized TDNN, with skip connections.


These systems seem to be underfitting-- I don't see any train/valid difference. Therefore you might want to choose a larger config for this TDNN-F system. (the one with OPGRU is already quite large but you could try more epochs).

danpovey · 2018-08-31T02:11:57Z

egs/zeroth_korean/s5/local/chain/tuning/run_tdnn_1a.sh

+
+  # the first splicing is moved before the lda layer, so no splicing here
+  relu-batchnorm-dropout-layer name=tdnn1 $tdnn_opts dim=1024
+  tdnnf-layer name=tdnnf2 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1


If I were you I would try a slightly larger system here... e.g. increase 1024 -> 1280, 192->256 and 128->160. It's not necessary for it to be merged though; you can do it later.

ok. I am training large system. If large system is better than this, I will update result and recipe.

As you suggested, increasing parameter size helped! I am trying to increase a bit more.

danpovey · 2018-08-31T02:13:45Z

egs/zeroth_korean/s5/local/data_prep.sh

+   exit 1
+fi
+
+spk_file=$src/../AUDIO_INFO


I don't very much like how you are going one level up from "$src". If one-level-up is the real source, you should give that directory. But make sure that entire directory is downloaded from one place. If it's multiple downloads then make them 2 separate arguments to this script.

Whole data is downloaded in one place. It was trying to refer meta data while preparing data directory. Anyway, I fixed it.

danpovey · 2018-08-31T02:14:35Z

egs/zeroth_korean/s5/local/format_lms.sh

+  test=${src_dir}_test_${lm_suffix}
+  mkdir -p $test
+  cp -r ${src_dir}/* $test
+  gunzip -c $lm_dir/zeroth.lm.${lm_suffix}.arpa.gz | \


I wonder if you pruned one of these a bit too aggressively-- what is the size of the graph? The pre/post-rescore WERs here differ more than I would normally expect.

HCLG.fst is not that small. 676M for chain graph and 1.8G for tri-phone GMM system. Rescoring LM is even bigger though. Actually vocab is pretty huge 500k as Korean is agglutinative language.

danpovey · 2018-08-31T02:15:00Z

egs/zeroth_korean/s5/local/nnet3/run_ivector_common.sh

+    steps/compute_cmvn_stats.sh data/${datadir}_hires  || exit 1;
+  done
+
+  # We need to build a small system just because we need the LDA+MLLT transform


these comments are out of date, we are using PCA not LDA+MLLT.

yes. I corrected it.

danpovey · 2018-08-31T16:51:32Z

@hhadian... since Korean is an agglutinative language (and I believe it's phonetically spelled), I wonder whether a BPE-based recipe might do much better on this. I don't recall right now the steps required to do this. I wonder if you could advise what steps might be required to test this out? Have we done this on ASR recipes yet, or only on OCR?

danpovey · 2018-08-31T16:51:55Z

also @DongjiGao or @hainan-xv might have some experience with this.

wonkyuml · 2018-08-31T17:05:13Z

Yes, it's phonetically spelled although pronunciation slightly changes depending on the right before character. Any advice would be appreciated!

hhadian · 2018-08-31T18:18:10Z

No we have not yet tried it on ASR.
There are basically 4 steps:

train the BPE model (as in stage 2 of egs/iam/v2/run_end2end.sh)
convert the training (and optionally test) transcriptions to BPE (also as in stage 2 of egs/iam/v2/run_end2end.sh)
prepare the lang and add final optional silence as in stage 4 of egs/iam/v2/run_end2end.sh
add a proper local/wer_hyp_filter file (or a local/wer_output_filter if you also converted test transcriptions) so that scoring works correctly. See egs/iam/v2/local/wer_output_filter.

wonkyuml · 2018-08-31T18:24:12Z

Thanks @hhadian I will try BPE later. will create separate PR if it works.

wonkyuml · 2018-08-31T20:53:42Z

TDNN-F looks good as of now.

@danpovey It you're ok, it's ready to be merged.

danpovey · 2018-08-31T21:47:47Z

thanks!

…asr#2296)

Wonkyum Lee and others added 19 commits August 19, 2016 17:51

Merge remote-tracking branch 'kaldi-asr/master'

06f0bc8

Merge remote-tracking branch 'kaldi-asr/master'

d03fa19

Merge remote-tracking branch 'kaldi-asr/master'

efdd432

Merge remote-tracking branch 'kaldi-asr/master'

cc19bbf

Merge remote-tracking branch 'kaldi-asr/master'

1887b1d

Merge remote-tracking branch 'kaldi-asr/master'

954bdb5

Merge remote-tracking branch 'kaldi-asr/master'

87d3740

Merge remote-tracking branch 'kaldi-asr/master'

c0c8039

Merge remote-tracking branch 'kaldi-asr/master'

41f3d33

Merge remote-tracking branch 'kaldi-asr/master'

c4ec4a2

Merge remote-tracking branch 'kaldi-asr/master'

ac8551f

Merge remote-tracking branch 'kaldi-asr/master'

f83073b

Merge remote-tracking branch 'kaldi-asr/master'

df3826d

Merge remote-tracking branch 'kaldi-asr/master'

ceb9d32

Merge remote-tracking branch 'kaldi-asr/master'

ff1a880

Merge remote-tracking branch 'kaldi-asr/master'

0e5f310

Merge remote-tracking branch 'kaldi-asr/master'

ee2612d

initial setting

13071f1

main script

d2856ba

danpovey reviewed Mar 20, 2018

View reviewed changes

wonkyuml added 3 commits July 9, 2018 20:01

cleaning script

9dc8f81

- simplified the script - delete unnecessary scripts and comments

cmd.sh cleaninig

6bd06af

run.sh script fix

b00a813

add RESULTS page with minor typo fix

8b499a5

danpovey reviewed Jul 11, 2018

View reviewed changes

omit $mfccdir

3526f60

Lucas Jo and others added 7 commits August 20, 2018 15:10

removed locale dependency

6d19ab2

removed locale dependency

11d1a07

Merge branch 'zeroth_egs' of https://github.com/wonkyuml/kaldi into z…

f9dca8f

…eroth_egs

changed filename

c75b1f8

re-indented with no tab

d1b2277

changed to use PCA instead of LDA+MLLT

17378f2

added -bash on echo statements

90400dd

Wonkyum Lee added 3 commits August 27, 2018 14:43

fix pointing update_segmentation.sh in run.sh

6d010aa

simplified and added echo statement

bd8094f

results updated

7b55b5f

danpovey reviewed Aug 31, 2018

View reviewed changes

Wonkyum Lee added 3 commits August 30, 2018 22:52

data prep interface change

cb817af

cosmetic fix for ivector script

6259aed

increase parameter for TDNN-F

7e14701

danpovey merged commit 66145ea into kaldi-asr:master Aug 31, 2018

dpriver pushed a commit to dpriver/kaldi that referenced this pull request Sep 13, 2018

[egs] Zeroth-Korean: Korean open-source corpus and its script (kaldi-…

bb36595

…asr#2296)

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs] Zeroth-Korean: Korean open-source corpus and its script (kaldi-…

06b1338

…asr#2296)


		mkdir exp -p exp/nnet3

		steps/train_lda_mllt.sh --cmd "$train_cmd" --num-iters 13 \


		if [ $stage -le 11 ]; then

		echo "#### SAT again on train_clean ###########"

[egs, script] Zeroth-Korean: Korean open-source corpus and its script #2296

[egs, script] Zeroth-Korean: Korean open-source corpus and its script #2296

Uh oh!

Conversation

wonkyuml commented Mar 20, 2018

Uh oh!

danpovey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Jul 3, 2018

Uh oh!

danpovey commented Jul 10, 2018

Uh oh!

wonkyuml commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonkyuml commented Jul 11, 2018 via email

Uh oh!

danpovey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!