-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Added phonetisaurus-based g2p scripts #2730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
61d9560
771a556
55227df
4e75be7
a5d60d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| #!/bin/bash | ||
| # Copyright 2014 Johns Hopkins University (Author: Yenda Trmal) | ||
| # Copyright 2016 Xiaohui Zhang | ||
| # 2018 Ruizhe Huang | ||
| # Apache 2.0 | ||
|
|
||
| # This script applies a trained Phonetisarus G2P model to | ||
| # synthesize pronunciations for missing words (i.e., words in | ||
| # transcripts but not the lexicon), and output the expanded lexicon. | ||
| # The user could specify either nbest or pmass option | ||
| # to determine the number of output pronunciation variants, | ||
| # or use them together to get the intersection of two options. | ||
|
|
||
| # Begin configuration section. | ||
| stage=0 | ||
| nbest= # Generate up to $nbest variants | ||
| pmass= # Generate so many variants to produce $pmass ammount, like 90%, of the prob mass | ||
| # End configuration section. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huangruizhe Can you add "thresh" as an option here? Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh (Sorry I just realized today that I already wrote a script like the current one 2 years ago.. ) Also, please explain a bit more about the nbest and pmass options, also by referring to the above script. Thanks!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
|
|
||
| echo "$0 $@" # Print the command line for logging | ||
|
|
||
| [ -f ./path.sh ] && . ./path.sh; # source the path. | ||
| . utils/parse_options.sh || exit 1; | ||
|
|
||
| set -u | ||
| set -e | ||
|
|
||
| if [ $# != 3 ]; then | ||
| echo "Usage: $0 [options] <g2p-model> <word-list> <lexicon-out>" | ||
| echo "... where <g2p-model> is the trained g2p model." | ||
| echo " <word-list> is a list of words whose pronunciation is to be generated." | ||
| echo " <lexicon-out> output lexicon, whose format is <word>\t<prob>\t<pronunciation> for each line." | ||
| echo "e.g.: $0 --nbest 1 exp/g2p/model.fst exp/g2p/oov_words.txt data/local/dict_nosp/lexicon.txt" | ||
| echo "" | ||
| echo "main options (for others, see top of script file)" | ||
| echo " --nbest <int> # Maximum number of hypotheses to produce. By default, nbest=20" | ||
| echo " --pmass <float> # Select the maximum number of hypotheses summing to a total mass of pmass amount, within [0, 1], for a word. By default, pmass=1.0" | ||
| exit 1; | ||
| fi | ||
|
|
||
| model=$1 | ||
| word_list=$2 | ||
| out_lexicon=$3 | ||
| out_lexicon_failed="${out_lexicon}.failed" | ||
|
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/apply_g2p.sh also.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
| [ -z $pmass ] && [ -z $nbest ] && echo "$0: nbest or/and pmass should be specified." && exit 1; | ||
|
|
||
| # three options: 1) nbest, 2) pmass, 3) nbest+pmass, | ||
| nbest=${nbest:-20} # if nbest is not specified, set it to 20, due to Phonetisaurus mechanism | ||
| pmass=${pmass:-1.0} # if pmass is not specified, set it to 1.0, due to Phonetisaurus mechanism | ||
|
|
||
| [[ ! $nbest =~ ^[1-9][0-9]*$ ]] && echo "$0: nbest should be a positive integer." && exit 1; | ||
|
|
||
| echo "$0: Synthesizing pronunciations for words in $word_list based on nbest=$nbest and pmass=$pmass" | ||
| phonetisaurus-apply --pmass $pmass --nbest $nbest --model $model --thresh 5 --accumulate --verbose --prob --word_list $word_list \ | ||
| 1>$out_lexicon | ||
|
|
||
| echo "$0: Completed. Synthesized lexicon for new words is in $out_lexicon" | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @huangruizhe Can you address Yenda's earlier comment: generating a list of failed words in a file and point it to the user in the echo message? The warning message from phonetisaurus is not consolidated into a file. So the user may miss it and want to find those words in a file. Actually I noticed your "out_lexicon_failed" is not used at all.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
| exit 0 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Copyright 2017 Intellisist, Inc. (Author: Navneeth K) | ||
| # 2017 Xiaohui Zhang | ||
| # 2018 Ruizhe Huang | ||
| # Apache License 2.0 | ||
|
|
||
| # This script trains a g2p model using Phonetisaurus. | ||
|
|
||
| stage=0 | ||
| encoding='utf-8' | ||
| only_words=true | ||
| silence_phones= | ||
|
|
||
| echo "$0 $@" # Print the command line for logging | ||
|
|
||
| [ -f ./path.sh ] && . ./path.sh; # source the path. | ||
| . utils/parse_options.sh || exit 1; | ||
|
|
||
| set -u | ||
| set -e | ||
|
|
||
| if [ $# != 2 ]; then | ||
| echo "Usage: $0 [options] <lexicon-in> <work-dir>" | ||
| echo " where <lexicon-in> is the training lexicon (one pronunciation per " | ||
| echo " word per line, with lines like 'hello h uh l ow') and" | ||
| echo " <work-dir> is directory where the models will be stored" | ||
| echo "e.g.: $0 --silence-phones data/local/dict/silence_phones.txt data/local/dict/lexicon.txt exp/g2p/" | ||
| echo "" | ||
| echo "main options (for others, see top of script file)" | ||
| echo " --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs." | ||
| echo " --silence-phones <silphones-list> # e.g. data/local/dict/silence_phones.txt." | ||
| echo " # A list of silence phones, one or more per line" | ||
| echo " # Relates to --only-words option" | ||
| echo " --only-words (true|false) (default: true) # If true, exclude silence words, i.e." | ||
| echo " # words with one or multiple phones which are all silence." | ||
| exit 1; | ||
| fi | ||
|
|
||
| lexicon=$1 | ||
| wdir=$2 | ||
|
|
||
| [ ! -f $lexicon ] && echo "Cannot find $lexicon" && exit | ||
|
|
||
| isuconv=`which uconv` | ||
| if [ -z $isuconv ]; then | ||
| echo "uconv was not found. You must install the icu4c package." | ||
| exit 1; | ||
| fi | ||
|
|
||
| mkdir -p $wdir | ||
|
|
||
|
|
||
| # For input lexicon, remove pronunciations containing non-utf-8-encodable characters, | ||
| # and optionally remove words that are mapped to a single silence phone from the lexicon. | ||
| if [ $stage -le 0 ]; then | ||
| if $only_words && [ ! -z "$silence_phones" ]; then | ||
| awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i; if(!(s in a)) print $1" "s}' \ | ||
| $silence_phones $lexicon | \ | ||
| awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' | \ | ||
| uconv -f "$encoding" -t "$encoding" -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt | ||
| else | ||
| awk '{printf("%s\t",$1); for (i=2;i<NF;i++){printf("%s ",$i);} printf("%s\n",$NF);}' $lexicon | \ | ||
| uconv -f "$encoding" -t "$encoding" -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt | ||
| fi | ||
| fi | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check whether phonetisaurus is installed here. Please refer to /export/b19/xzhang/tedlium/s5_r2/steps/dict/train_g2p.sh also.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
| if [ $stage -le 1 ]; then | ||
| # Align lexicon stage. Lexicon is assumed to have first column tab separated | ||
| phonetisaurus-align --input=$wdir/lexicon_tab_separated.txt --ofile=${wdir}/aligned_lexicon.corpus || exit 1; | ||
| fi | ||
|
|
||
| if [ $stage -le 2 ]; then | ||
| # Convert aligned lexicon to arpa using make_kn_lm.py, a re-implementation of srilm's ngram-count functionality. | ||
| ./utils/lang/make_kn_lm.py -ngram-order 7 -text ${wdir}/aligned_lexicon.corpus -lm ${wdir}/aligned_lexicon.arpa | ||
| fi | ||
|
|
||
| if [ $stage -le 3 ]; then | ||
| # Convert the arpa file to FST. | ||
| phonetisaurus-arpa2wfst --lm=${wdir}/aligned_lexicon.arpa --ofile=${wdir}/model.fst | ||
| fi | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.