lexicon learning: update missing defaults and help message #1360

gorinars · 2017-01-20T11:01:51Z

Some small modifications to reduce user confusion when running wsj/s5/steps/dict/learn_lexicon.sh

It seems the script will not work without 'oov_symbol', so it might make sense to make it the required parameter. Somewhat the opposite stands for dir=$7, which could possibly live somewhere in $dest_dict/tmp allowing the user to change the default location

I do not know if it is better to merge, or just revise a little more the last commit 03e6b92 of @xiaohui-zhang

danpovey · 2017-01-20T20:28:31Z

@xiaohui-zhang, what do you think?
Regarding the last arg, it could be made optional, you'd put it like
[ <tmp-dir> ]
in the usage message and change the condition to
if [ $# -lt 6 ] || [ $# -gt 7 ]; then

xiaohui-zhang · 2017-01-20T21:39:08Z

I agree with @danpovey . If the user want to keep the intermediate outputs for different parameter settings, one should set the tmp-dir outside. Otherwise it could be set as the default like ${src_mdl_dir}_lex_learn_work (following the convention of data cleanup). @gorinars Can you please add this change? Then we can merge I think.

gorinars · 2017-01-21T08:14:47Z

@xiaohui-zhang done

danpovey · 2017-01-21T18:30:07Z

OK, I think there are still some issues (not with your changes; with the original script).

Firstly, I like scripts to have an "e.g.:" line in the usage message where they give an example command line. You can loosely base this on the example script that Samuel checked in, and
use it as an opportunity to show that the 'oov_symbol' option is probably required.

Also, when 'oov_symbol' is passed to prepare_lang.sh, it needs to be put in double quotes. Otherwise the angle brackets would crash the script.

I think rather than have a nonempty default for oov_symbol, it would be better for the script to check that a value is passed in. I know this is against the spirit of options, but it's back compatible with the existing example. E.g.

if [ -z "$oov_symbol" ]; then
   echo "$0: the --oov-symbol option is required."
   exit 1
fi

danpovey · 2017-01-24T03:42:01Z

@gorinars, do you have time to make the requested changes? Or should I ask @xiaohui-zhang?

gorinars · 2017-01-24T04:32:49Z

@danpovey, yes, I have. The problem is that it seems there are other problems with this script that I found when testing it further.

I wanted to take a couple of days for further testing and then wrap-up all the required modifications. But I can include now your last suggestion and create another PR when finishing debugging. Do you prefer it to be done this way?

danpovey · 2017-01-24T04:39:29Z

It's OK, I can wait for your other changes; I just wanted to know you were still working on it.

…

On Mon, Jan 23, 2017 at 11:32 PM, Arseniy Gorin ***@***.***> wrote: @danpovey <https://github.com/danpovey>, yes, I have. The problem is that it seems there are other problems with this script that I found when testing it further. I wanted to take a couple of days for further testing and then wrap-up all the required modifications. But I can include now your last suggestion and create another PR when finishing debugging. Do you prefer it to be done this way? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1360 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu4-8ThPgYhV2E0YkOR7Ex3zZgUNsks5rVX7zgaJpZM4LpL0w> .

gorinars · 2017-01-24T04:43:00Z

OK, thanks. I will try to complete this week.

gorinars · 2017-01-24T11:46:02Z

@danpovey
I made the modifications discussed. I also added 2 lines to infer the number of states and Gaussian densities from the initial model. Please let me know if more modifications are required (or if I need to rebase).

I had one issue with the current version, but I think it is rather my experiment setup and not this scripts.
The problem was with steps/cleanup/internal/get_non_scored_words.py.
It could not find '<UNK>' that I have in data/lang/oov.txt associated with phone GBG that is located in data/lang/phones/nonsilence.txt (and not in data/lang/phones/silence.txt). I fixed it locally by adding <UNK> GBG to non_scored_entries manually.

Because it seems quite not conventional way of using garbage phone, I do not include such hack in this PR

danpovey · 2017-01-24T19:39:53Z

Thanks!! Merging.

xiaohui-zhang · 2017-01-24T21:04:43Z

thanks a lot @gorinars

KarenAssaraf · 2017-01-25T16:30:51Z

@gorinars @xiaohui-zhang :

I am running run_learn_lex.sh from opt/kaldi-git/egs/tedlium/s5.
I use the following parameters:
1. oov_symbol=""
2. an existing g2p in exp/tri3: g2p_mdl_dir=exp/tri3
3. ref_dict=data/local/dict
4. data=data/train
5. min_prob=0.4
6. prior_mean="0.7,0.2,0.1"
7. prior_counts_tot=15
8. variants_prob_mass=0.6
9. variants_prob_mass_ref=0.95
10. dir=exp/tri3_lex_work
11. nj=35
12. decode_nj=30

It seems like the g2p outputs an empty pronunciation, because I get the following error:
Checking exp/tri3_lex_work/dict_expanded_train/lexicon.txt
--> reading exp/tri3_lex_work/dict_expanded_train/lexicon.txt
--> ERROR: lexicon.txt contains word ngu with empty pronunciation.

This is indeed the only word with an empty pronunciation.
How can it be that in “g2p_prons_for_oov_train.txt” it gives an empty pronunciation for “ngu” (among other pronunciations for the same word)? Apparently, “ngu” is in my data/train (and indeed neither in ref_vocab or in target_vocab).

I manage to run the script forcing empty pronunciations entries removal from file "$data/lexicon_oov_g2p.txt". Added a sed command in run_learn_lex.sh, just before the call to learn_lex.sh.

xiaohui-zhang · 2017-01-25T16:53:51Z

Hi @KarenAssaraf , there might be problems with the g2p model you trained (e.g. insufficient training data), so that it wasn't able to generate pronunciations for all words. Can you run apply_g2p again and check whether it gives you warnings?

danpovey · 2017-01-25T18:18:05Z

I think she's saying it generated an *empty* pronunciation (i.e. not zero pronunciations) for a word. We need to filter these out.

…

On Wed, Jan 25, 2017 at 11:53 AM, Xiaohui Zhang ***@***.***> wrote: Hi @KarenAssaraf <https://github.com/KarenAssaraf> , there might be problems with the g2p model you trained (e.g. insufficient training data), so that it wasn't able to generate pronunciations for all words. Can you run apply_g2p again and check whether it gives you warnings? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#1360 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9SENyqtm-O7QC_SNGyX6CjR-SdXks5rV34hgaJpZM4LpL0w> .

KarenAssaraf · 2017-01-29T08:52:23Z

Yes. That's what I meant. I see the filtering is already merged. Thanks!

update missing defaults and help message

d92e334

apply suggestions from kaldi-asr#1360

cbe53c0

address changes + infer leaves/gauss

aced28a

gorinars force-pushed the pr_dict_learn branch from d8ec1c9 to aced28a Compare January 24, 2017 11:34

danpovey merged commit 99b7d96 into kaldi-asr:master Jan 24, 2017

gorinars deleted the pr_dict_learn branch January 24, 2017 19:41

lexicon learning: update missing defaults and help message #1360

lexicon learning: update missing defaults and help message #1360

Uh oh!

Conversation

gorinars commented Jan 20, 2017

Uh oh!

danpovey commented Jan 20, 2017

Uh oh!

xiaohui-zhang commented Jan 20, 2017

Uh oh!

gorinars commented Jan 21, 2017

Uh oh!

danpovey commented Jan 21, 2017

Uh oh!

danpovey commented Jan 24, 2017

Uh oh!

gorinars commented Jan 24, 2017

Uh oh!

danpovey commented Jan 24, 2017 via email

Uh oh!

gorinars commented Jan 24, 2017

Uh oh!

gorinars commented Jan 24, 2017

Uh oh!

danpovey commented Jan 24, 2017

Uh oh!

xiaohui-zhang commented Jan 24, 2017

Uh oh!

KarenAssaraf commented Jan 25, 2017

Uh oh!

xiaohui-zhang commented Jan 25, 2017

Uh oh!

danpovey commented Jan 25, 2017 via email

Uh oh!

KarenAssaraf commented Jan 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants