-
Notifications
You must be signed in to change notification settings - Fork 5.4k
HUB4 English Broadcast News recipe #2027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll change the name to hub4_english |
| } | ||
| next; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
| else {&pusho($back);} | ||
| } | ||
| next; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
|
Guys, let's wait with those csr_hub4_utils
<#2027 (comment)> files
-- I think VImal included them from the corpus, so I'm not completely
convinced we can include them (licensing and IP issues)
If there were modifications, then I think we should just publish a patch.
@danpovey?
y.
…On Tue, Nov 21, 2017 at 3:17 PM, Xiaohui Zhang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl
<#2027 (comment)>:
> +
+ if($field<$#input || $back =~ /[!?]/)
+ { $back =~ s/^\.//; } # rm .
+ else # end of sent
+ { $back =~ s/^\.(\'s)/$1./;
+ if($back =~ /\..*\./) # 2 dots
+ {$back=~s/\.([^\.]*)/$1/;}
+ }
+
+ if($back)
+ { if($back !~ /^[\w]/)
+ {&appendo($back);}
+ else {&pusho($back);}
+ }
+ next;
+
remove empty line
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2027 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKisX--iVDrOJvF7Pe8ThXXGmHu1Gjrkks5s4y_HgaJpZM4QmPO5>
.
|
| #local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/dev95_text.list > $dir/dev95_transcripts.txt 2> $dir/parse_sgml_dev95.log || exit 1 | ||
| #local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/eval95_test.list > $dir/eval95_transcripts.txt 2> $dir/parse_sgml_eval95.log || exit 1 | ||
| # | ||
| #exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove those commented lines or explain what's the purpose of them?
|
is there a reason this is still WIP? |
|
This is ready for commit. |
egs/hub4_english/s5/run.sh
Outdated
| # audio recordings of HUB4 with raw unaligned transcripts into short segments | ||
| # with aligned transcripts for training new ASR models. | ||
|
|
||
| # First run the data preparation stages in WSJ run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script from here seems like it would be better as part of local/run_segmentation_wsj.sh. E.g. wsj_base could be made anargument to that script. And all this could be commented out. Unless I misunderstand something. I assume this isn't actually needed by the rest of this script, or it would be at the top...?
| # This is similar to get_ctm.sh, but gets the | ||
| # CTM at the utterance-level. | ||
|
|
||
| # This is similar to get_ctm.sh, but gets the CTM at the utterance-level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script seems like it's not too closely related to the cleanup scripts and could be moved to steps/get_ctm_fast.sh. Clarify that it differs from get_ctm.sh because it only uses a single lmwt, operates in parallel over multiple lattices, outputs the ctm to a separate directory, and doesn't have the option to convert the ctm from an utterance-level ctm to a per-recording ctm.
You should make sure there is a suitable "see also" in get_ctm.sh.
| @@ -0,0 +1,70 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new python scripts we prefer python3. (and you can remove compatibility things). This will mean it's not necessary to verify separately that it's python3 compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we'll remove the text-manipulating aspects of this script and replace them with a separate call to apply_map.pl.
Unfortunately the way the text formats for Kaldi are designed don't mesh well with python3's default 'str' type. The "text" file format is encoding-agnostic: any encoding is allowed, as long as (when interpreted as a bytestream) whitespace characters represent space, and newline represents newline. And because it's never needed by Kaldi itself, we never have to specify the encoding.
It might actually be possible to deal with this in python3 using its 'byte-string' type. From a search it looks like it is possible to do things like splitting byte-strings. This might be a good test case for figuring out the right way to do this in python, because as python3 becomes more and more the default, we need to make sure it doesn't break things.
|
I get encoding issues in python3, which I don't get in python2. There is
some text with non-ascii characters, but since LC_ALL=C is used in general,
the file is said to be in ascii. Python3 tries to then read the file in
ascii and fails.
Vimal
On Thu, Feb 8, 2018 at 4:28 PM Daniel Povey ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/wsj/s5/utils/data/internal/combine_segments_to_recording.py
<#2027 (comment)>:
> @@ -0,0 +1,70 @@
+#!/usr/bin/env python
For new python scripts we prefer python3. (and you can remove
compatibility things). This will mean it's not necessary to verify
separately that it's python3 compatible.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2027 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEATV8cxePZr9Eke19qbx14AXVOKTEtLks5tS2cTgaJpZM4QmPO5>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
OK, well we 100% need to be compatible with python3. The way it's done in
python3 may be different but you'll have to ask around; I don't know.
Dan
On Thu, Feb 8, 2018 at 5:32 PM, Vimal Manohar <[email protected]>
wrote:
… I get encoding issues in python3, which I don't get in python2. There is
some text with non-ascii characters, but since LC_ALL=C is used in general,
the file is said to be in ascii. Python3 tries to then read the file in
ascii and fails.
Vimal
On Thu, Feb 8, 2018 at 4:28 PM Daniel Povey ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In egs/wsj/s5/utils/data/internal/combine_segments_to_recording.py
> <#2027 (comment)>:
>
> > @@ -0,0 +1,70 @@
> +#!/usr/bin/env python
>
> For new python scripts we prefer python3. (and you can remove
> compatibility things). This will mean it's not necessary to verify
> separately that it's python3 compatible.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2027 (review)
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AEATV8cxePZr9Eke19qbx14AXVOKTEtLks5tS2cTgaJpZM4QmPO5>
> .
>
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2027 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuzvNbt03RG-3uuDQDF_v_IkNbmV4ks5tS3XvgaJpZM4QmPO5>
.
|
| # audio recordings of HUB4 with raw unaligned transcripts into short segments | ||
| # with aligned transcripts for training new ASR models. | ||
|
|
||
| # local/run_segmentation_wsj.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have set this up like a "tuning" directory, with _a.sh and _b.sh and a soft link. But after looking at the scripts, I don't think this setup makes sense because they are for two different scenarios. Perhaps you could call them e.g.
local/run_segmentation_wsj_unsup.sh local/run_segmentation_wsj_sup.sh,
have commented-out invocations to both of them in the run.sh, and explain in the run.sh and in the respective scripts how they differ.
|
This is ready to be merged. |
danpovey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making progress. Some more smallish comments.
egs/wsj/s5/steps/dict/train_g2p.sh
Outdated
| if $only_words && [ ! -z "$silence_phones" ]; then | ||
| awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \ | ||
| $silence_phones > $wdir/lexicon_onlywords.txt | ||
| $lexicon $silence_phones > $wdir/lexicon_onlywords.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this awk script was pre-existing, but it's rather ugly. Can you rewrite it to make it clearer? I prefer perl or awk to avoid a super-long python script. Also the script seems to remove words where the pronunciation was seen before, which I doubt was the intention. And it's not clear to me that the order of 'lexicon' vs. 'silence_phones' as you have it, is correct anyway. Bear in mind that it's possible that there could be multiple silence phones on a single line, in the $silence_phones file, which I think this script does not handle properly.
You can use the following as inspiration (reading from a file directly is less ugly than the FNR trick):
# Get training data with OOV words (w.r.t. our current vocab) replaced with <unk>,
# as well as adding </s> symbols at the end of each sentence
cat $train_text | awk -v w=$dir/wordlist.all \
'BEGIN{while((getline<w)>0) v[$1]=1;}
{for (i=2;i<=NF;i++) if ($i in v) printf $i" ";else printf "<unk> ";print ""}' | sed 's=$= </s>=g' \
| utils/shuffle_list.pl | gzip -c > $dir/all.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaohui-zhang would have idea about this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpovey do we allow multiple silence phones to occur in the same row in data/local/dict/silence_phones.txt ? I thought they have to be put in different columns.
| #! /bin/bash | ||
|
|
||
| # Copyright 2016 Vimal Manohar | ||
| # Copyright 2016-18 Vimal Manohar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer -2018
| local/lm/merge_word_counts.py 1 | sort -k 1,1nr > $dir/data/work/final.wordlist_counts | ||
|
|
||
| if [ ! -z "$vocab_size" ]; then | ||
| awk -v sz=$vocab_size 'BEGIN{count=-1;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traditionally you wouldn't use up this much vertical space in an inline awk script.
| ) || exit 1; | ||
|
|
||
| num_dev_sentences=4500 | ||
| RANDOM=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unused variable
| @@ -0,0 +1 @@ | |||
| tuning/run_segmentation_wsj_a.sh No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this points to the "a" script. Check that this is what you intended.
| # For nnet3 and chain results after cleanup, see the scripts in | ||
| # local/nnet3/run_tdnn.sh and local/chain/run_tdnn.sh | ||
|
|
||
| # GMM Results for speaker-independent (SI) and speaker adaptive training (SAT) systems on dev and test sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to add results here? Add them or remove this comment.
| export PATH=$PATH:`pwd`/local/dict | ||
|
|
||
| if [ $stage -le 3 ]; then | ||
| cat $wordlist | python -c ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could replace this inline python script with:
utils/filter_scp.pl --exclude $wordlist <$orig_wordlist > $dir/oovlist
|
|
||
| # Remove ; and , from words, if they are present; these | ||
| # might crash our scripts, as they are used as separators there. | ||
| filter_dict.pl $dir/dict.cmu > $dir/f/dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have inherited some old WSJ scripts for extending a dictionary. I believe this works worse than g2p.py, when we tested. Is this really needed?
| for x in $dir/score_*/$name.ctm; do | ||
| cp $x $dir/tmpf; | ||
| cat $dir/tmpf | grep -i -v -E '<NOISE|SPOKEN_NOISE>' | \ | ||
| grep -i -v -E ' (UH|UM|EH|MM|HM|AH|HUH|HA|ER|OOF|HEE|ACH|EEE|EW)$' | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these things really appear in the BN setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of them can appear. There are many test sets. Some can include conversational speech.
| mkdir -p tools | ||
| pip install -t tools/beautifulsoup4 beautifulsoup4 | ||
| fi | ||
| export PYTHONPATH=$PWD/tools/beautifulsoup4:$PYTHONPATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment explaining why this is needed.
|
Regarding multiple silence phones in the same row of dict/silence_phones.txt:
I think they are allowed but it may not be the normal case (it would mean
you'd share the root).
…On Mon, Feb 12, 2018 at 11:08 AM, Xiaohui Zhang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/wsj/s5/steps/dict/train_g2p.sh
<#2027 (comment)>:
> @@ -45,7 +45,7 @@ mkdir -p $wdir/log
# Optionally remove words that are mapped to a single silence phone from the lexicon.
if $only_words && [ ! -z "$silence_phones" ]; then
awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \
- $silence_phones > $wdir/lexicon_onlywords.txt
+ $lexicon $silence_phones > $wdir/lexicon_onlywords.txt
@danpovey <https://github.com/danpovey> do we allow multiple silence
phones to occur in the same row in data/local/dict/silence_phones.txt ? I
thought they have to be put in different columns.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2027 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu9gguyRM4fpar0bc3u5vpD0x97XYks5tUGHngaJpZM4QmPO5>
.
|
|
understand. thanks |
egs/wsj/s5/steps/dict/train_g2p.sh
Outdated
| if $only_words && [ ! -z "$silence_phones" ]; then | ||
| awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \ | ||
| $silence_phones > $wdir/lexicon_onlywords.txt | ||
| awk -v w=$silence_phones \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this. Let me know if this is fine. @xiaohui-zhang @danpovey
egs/wsj/s5/steps/dict/train_g2p.sh
Outdated
| awk -v w=$silence_phones \ | ||
| 'BEGIN{while((getline<w)>0) {for(i=1;i<=NF;i++) sil[$i]=1;}} | ||
| { p=$2; for(i=3;i<=NF;i++) p=p" "$i; | ||
| if(!(p in sil)) print $1" "p }' $lexicon > $wdir/lexicon_onlywords.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this be simplified to: { if (!(NF == 2 && $2 in sil)) print; }
and I'd prefer to rename w to s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. otherwise looks good. Thanks Vimal!
|
I fixed the issues, and it is ready to be committed. |
|
Thanks! Merging. |
Basic GMM recipe for English broadcast news. Needs to be tested.
@xiaohui-zhang Please review it.