-
Notifications
You must be signed in to change notification settings - Fork 5.4k
lexicon learning: update missing defaults and help message #1360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xiaohui-zhang, what do you think? |
|
I agree with @danpovey . If the user want to keep the intermediate outputs for different parameter settings, one should set the tmp-dir outside. Otherwise it could be set as the default like ${src_mdl_dir}_lex_learn_work (following the convention of data cleanup). @gorinars Can you please add this change? Then we can merge I think. |
|
@xiaohui-zhang done |
|
OK, I think there are still some issues (not with your changes; with the original script). Firstly, I like scripts to have an "e.g.:" line in the usage message where they give an example command line. You can loosely base this on the example script that Samuel checked in, and Also, when 'oov_symbol' is passed to prepare_lang.sh, it needs to be put in double quotes. Otherwise the angle brackets would crash the script. I think rather than have a nonempty default for oov_symbol, it would be better for the script to check that a value is passed in. I know this is against the spirit of options, but it's back compatible with the existing example. E.g. |
|
@gorinars, do you have time to make the requested changes? Or should I ask @xiaohui-zhang? |
|
@danpovey, yes, I have. The problem is that it seems there are other problems with this script that I found when testing it further. I wanted to take a couple of days for further testing and then wrap-up all the required modifications. But I can include now your last suggestion and create another PR when finishing debugging. Do you prefer it to be done this way? |
|
It's OK, I can wait for your other changes; I just wanted to know you were
still working on it.
…On Mon, Jan 23, 2017 at 11:32 PM, Arseniy Gorin ***@***.***> wrote:
@danpovey <https://github.com/danpovey>, yes, I have. The problem is that
it seems there are other problems with this script that I found when
testing it further.
I wanted to take a couple of days for further testing and then wrap-up all
the required modifications. But I can include now your last suggestion and
create another PR when finishing debugging. Do you prefer it to be done
this way?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1360 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu4-8ThPgYhV2E0YkOR7Ex3zZgUNsks5rVX7zgaJpZM4LpL0w>
.
|
|
OK, thanks. I will try to complete this week. |
d8ec1c9 to
aced28a
Compare
|
@danpovey I had one issue with the current version, but I think it is rather my experiment setup and not this scripts. Because it seems quite not conventional way of using garbage phone, I do not include such hack in this PR |
|
Thanks!! Merging. |
|
thanks a lot @gorinars |
|
I am running run_learn_lex.sh from opt/kaldi-git/egs/tedlium/s5. It seems like the g2p outputs an empty pronunciation, because I get the following error: This is indeed the only word with an empty pronunciation. I manage to run the script forcing empty pronunciations entries removal from file "$data/lexicon_oov_g2p.txt". Added a sed command in run_learn_lex.sh, just before the call to learn_lex.sh. |
|
Hi @KarenAssaraf , there might be problems with the g2p model you trained (e.g. insufficient training data), so that it wasn't able to generate pronunciations for all words. Can you run apply_g2p again and check whether it gives you warnings? |
|
I think she's saying it generated an *empty* pronunciation (i.e. not zero
pronunciations) for a word. We need to filter these out.
…On Wed, Jan 25, 2017 at 11:53 AM, Xiaohui Zhang ***@***.***> wrote:
Hi @KarenAssaraf <https://github.com/KarenAssaraf> , there might be
problems with the g2p model you trained (e.g. insufficient training data),
so that it wasn't able to generate pronunciations for all words. Can you
run apply_g2p again and check whether it gives you warnings?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1360 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu9SENyqtm-O7QC_SNGyX6CjR-SdXks5rV34hgaJpZM4LpL0w>
.
|
|
Yes. That's what I meant. I see the filtering is already merged. Thanks! |
Some small modifications to reduce user confusion when running wsj/s5/steps/dict/learn_lexicon.sh
It seems the script will not work without 'oov_symbol', so it might make sense to make it the required parameter. Somewhat the opposite stands for
dir=$7, which could possibly live somewhere in $dest_dict/tmp allowing the user to change the default locationI do not know if it is better to merge, or just revise a little more the last commit 03e6b92 of @xiaohui-zhang