Skip to content

Conversation

@gorinars
Copy link
Contributor

Some small modifications to reduce user confusion when running wsj/s5/steps/dict/learn_lexicon.sh

It seems the script will not work without 'oov_symbol', so it might make sense to make it the required parameter. Somewhat the opposite stands for dir=$7, which could possibly live somewhere in $dest_dict/tmp allowing the user to change the default location

I do not know if it is better to merge, or just revise a little more the last commit 03e6b92 of @xiaohui-zhang

@danpovey
Copy link
Contributor

@xiaohui-zhang, what do you think?
Regarding the last arg, it could be made optional, you'd put it like
[ <tmp-dir> ]
in the usage message and change the condition to
if [ $# -lt 6 ] || [ $# -gt 7 ]; then

@xiaohui-zhang
Copy link
Contributor

I agree with @danpovey . If the user want to keep the intermediate outputs for different parameter settings, one should set the tmp-dir outside. Otherwise it could be set as the default like ${src_mdl_dir}_lex_learn_work (following the convention of data cleanup). @gorinars Can you please add this change? Then we can merge I think.

@gorinars
Copy link
Contributor Author

@xiaohui-zhang done

@danpovey
Copy link
Contributor

OK, I think there are still some issues (not with your changes; with the original script).

Firstly, I like scripts to have an "e.g.:" line in the usage message where they give an example command line. You can loosely base this on the example script that Samuel checked in, and
use it as an opportunity to show that the 'oov_symbol' option is probably required.

Also, when 'oov_symbol' is passed to prepare_lang.sh, it needs to be put in double quotes. Otherwise the angle brackets would crash the script.

I think rather than have a nonempty default for oov_symbol, it would be better for the script to check that a value is passed in. I know this is against the spirit of options, but it's back compatible with the existing example. E.g.

if [ -z "$oov_symbol" ]; then
   echo "$0: the --oov-symbol option is required."
   exit 1
fi

@danpovey
Copy link
Contributor

@gorinars, do you have time to make the requested changes? Or should I ask @xiaohui-zhang?

@gorinars
Copy link
Contributor Author

@danpovey, yes, I have. The problem is that it seems there are other problems with this script that I found when testing it further.

I wanted to take a couple of days for further testing and then wrap-up all the required modifications. But I can include now your last suggestion and create another PR when finishing debugging. Do you prefer it to be done this way?

@danpovey
Copy link
Contributor

danpovey commented Jan 24, 2017 via email

@gorinars
Copy link
Contributor Author

OK, thanks. I will try to complete this week.

@gorinars
Copy link
Contributor Author

@danpovey
I made the modifications discussed. I also added 2 lines to infer the number of states and Gaussian densities from the initial model. Please let me know if more modifications are required (or if I need to rebase).

I had one issue with the current version, but I think it is rather my experiment setup and not this scripts.
The problem was with steps/cleanup/internal/get_non_scored_words.py.
It could not find '<UNK>' that I have in data/lang/oov.txt associated with phone GBG that is located in data/lang/phones/nonsilence.txt (and not in data/lang/phones/silence.txt). I fixed it locally by adding <UNK> GBG to non_scored_entries manually.

Because it seems quite not conventional way of using garbage phone, I do not include such hack in this PR

@danpovey
Copy link
Contributor

Thanks!! Merging.

@danpovey danpovey merged commit 99b7d96 into kaldi-asr:master Jan 24, 2017
@gorinars gorinars deleted the pr_dict_learn branch January 24, 2017 19:41
@xiaohui-zhang
Copy link
Contributor

thanks a lot @gorinars

@KarenAssaraf
Copy link

@gorinars @xiaohui-zhang :

I am running run_learn_lex.sh from opt/kaldi-git/egs/tedlium/s5.
I use the following parameters:
1. oov_symbol=""
2. an existing g2p in exp/tri3: g2p_mdl_dir=exp/tri3
3. ref_dict=data/local/dict
4. data=data/train
5. min_prob=0.4
6. prior_mean="0.7,0.2,0.1"
7. prior_counts_tot=15
8. variants_prob_mass=0.6
9. variants_prob_mass_ref=0.95
10. dir=exp/tri3_lex_work
11. nj=35
12. decode_nj=30

It seems like the g2p outputs an empty pronunciation, because I get the following error:
Checking exp/tri3_lex_work/dict_expanded_train/lexicon.txt
--> reading exp/tri3_lex_work/dict_expanded_train/lexicon.txt
--> ERROR: lexicon.txt contains word ngu with empty pronunciation.

This is indeed the only word with an empty pronunciation.
How can it be that in “g2p_prons_for_oov_train.txt” it gives an empty pronunciation for “ngu” (among other pronunciations for the same word)? Apparently, “ngu” is in my data/train (and indeed neither in ref_vocab or in target_vocab).

I manage to run the script forcing empty pronunciations entries removal from file "$data/lexicon_oov_g2p.txt". Added a sed command in run_learn_lex.sh, just before the call to learn_lex.sh.

@xiaohui-zhang
Copy link
Contributor

Hi @KarenAssaraf , there might be problems with the g2p model you trained (e.g. insufficient training data), so that it wasn't able to generate pronunciations for all words. Can you run apply_g2p again and check whether it gives you warnings?

@danpovey
Copy link
Contributor

danpovey commented Jan 25, 2017 via email

@KarenAssaraf
Copy link

Yes. That's what I meant. I see the filtering is already merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants