Skip to content

Conversation

@vimalmanohar
Copy link
Contributor

Basic GMM recipe for English broadcast news. Needs to be tested.
@xiaohui-zhang Please review it.

@jtrmal
Copy link
Contributor

jtrmal commented Nov 21, 2017 via email

@vimalmanohar
Copy link
Contributor Author

I'll change the name to hub4_english

@vimalmanohar vimalmanohar changed the title WIP: Broadcast News recipe WIP: HUB4 English Broadcast News recipe Nov 21, 2017
}
next;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

else {&pusho($back);}
}
next;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

@jtrmal
Copy link
Contributor

jtrmal commented Nov 21, 2017 via email

#local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/dev95_text.list > $dir/dev95_transcripts.txt 2> $dir/parse_sgml_dev95.log || exit 1
#local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/eval95_test.list > $dir/eval95_transcripts.txt 2> $dir/parse_sgml_eval95.log || exit 1
#
#exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove those commented lines or explain what's the purpose of them?

@danpovey
Copy link
Contributor

is there a reason this is still WIP?

@vimalmanohar
Copy link
Contributor Author

This is ready for commit.

@vimalmanohar vimalmanohar changed the title WIP: HUB4 English Broadcast News recipe HUB4 English Broadcast News recipe Feb 8, 2018
# audio recordings of HUB4 with raw unaligned transcripts into short segments
# with aligned transcripts for training new ASR models.

# First run the data preparation stages in WSJ run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script from here seems like it would be better as part of local/run_segmentation_wsj.sh. E.g. wsj_base could be made anargument to that script. And all this could be commented out. Unless I misunderstand something. I assume this isn't actually needed by the rest of this script, or it would be at the top...?

# This is similar to get_ctm.sh, but gets the
# CTM at the utterance-level.

# This is similar to get_ctm.sh, but gets the CTM at the utterance-level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script seems like it's not too closely related to the cleanup scripts and could be moved to steps/get_ctm_fast.sh. Clarify that it differs from get_ctm.sh because it only uses a single lmwt, operates in parallel over multiple lattices, outputs the ctm to a separate directory, and doesn't have the option to convert the ctm from an utterance-level ctm to a per-recording ctm.

You should make sure there is a suitable "see also" in get_ctm.sh.

@@ -0,0 +1,70 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new python scripts we prefer python3. (and you can remove compatibility things). This will mean it's not necessary to verify separately that it's python3 compatible.

Copy link
Contributor

@danpovey danpovey Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we'll remove the text-manipulating aspects of this script and replace them with a separate call to apply_map.pl.
Unfortunately the way the text formats for Kaldi are designed don't mesh well with python3's default 'str' type. The "text" file format is encoding-agnostic: any encoding is allowed, as long as (when interpreted as a bytestream) whitespace characters represent space, and newline represents newline. And because it's never needed by Kaldi itself, we never have to specify the encoding.
It might actually be possible to deal with this in python3 using its 'byte-string' type. From a search it looks like it is possible to do things like splitting byte-strings. This might be a good test case for figuring out the right way to do this in python, because as python3 becomes more and more the default, we need to make sure it doesn't break things.

@vimalmanohar
Copy link
Contributor Author

vimalmanohar commented Feb 8, 2018 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 8, 2018 via email

# audio recordings of HUB4 with raw unaligned transcripts into short segments
# with aligned transcripts for training new ASR models.

# local/run_segmentation_wsj.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have set this up like a "tuning" directory, with _a.sh and _b.sh and a soft link. But after looking at the scripts, I don't think this setup makes sense because they are for two different scenarios. Perhaps you could call them e.g.
local/run_segmentation_wsj_unsup.sh local/run_segmentation_wsj_sup.sh,
have commented-out invocations to both of them in the run.sh, and explain in the run.sh and in the respective scripts how they differ.

@vimalmanohar
Copy link
Contributor Author

This is ready to be merged.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making progress. Some more smallish comments.

if $only_words && [ ! -z "$silence_phones" ]; then
awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \
$silence_phones > $wdir/lexicon_onlywords.txt
$lexicon $silence_phones > $wdir/lexicon_onlywords.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this awk script was pre-existing, but it's rather ugly. Can you rewrite it to make it clearer? I prefer perl or awk to avoid a super-long python script. Also the script seems to remove words where the pronunciation was seen before, which I doubt was the intention. And it's not clear to me that the order of 'lexicon' vs. 'silence_phones' as you have it, is correct anyway. Bear in mind that it's possible that there could be multiple silence phones on a single line, in the $silence_phones file, which I think this script does not handle properly.
You can use the following as inspiration (reading from a file directly is less ugly than the FNR trick):

# Get training data with OOV words (w.r.t. our current vocab) replaced with <unk>,
# as well as adding </s> symbols at the end of each sentence
cat $train_text | awk -v w=$dir/wordlist.all \
  'BEGIN{while((getline<w)>0) v[$1]=1;}
  {for (i=2;i<=NF;i++) if ($i in v) printf $i" ";else printf "<unk> ";print ""}' | sed 's=$= </s>=g' \
  | utils/shuffle_list.pl | gzip -c > $dir/all.gz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaohui-zhang would have idea about this part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpovey do we allow multiple silence phones to occur in the same row in data/local/dict/silence_phones.txt ? I thought they have to be put in different columns.

#! /bin/bash

# Copyright 2016 Vimal Manohar
# Copyright 2016-18 Vimal Manohar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer -2018

local/lm/merge_word_counts.py 1 | sort -k 1,1nr > $dir/data/work/final.wordlist_counts

if [ ! -z "$vocab_size" ]; then
awk -v sz=$vocab_size 'BEGIN{count=-1;}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traditionally you wouldn't use up this much vertical space in an inline awk script.

) || exit 1;

num_dev_sentences=4500
RANDOM=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused variable

@@ -0,0 +1 @@
tuning/run_segmentation_wsj_a.sh No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this points to the "a" script. Check that this is what you intended.

# For nnet3 and chain results after cleanup, see the scripts in
# local/nnet3/run_tdnn.sh and local/chain/run_tdnn.sh

# GMM Results for speaker-independent (SI) and speaker adaptive training (SAT) systems on dev and test sets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to add results here? Add them or remove this comment.

export PATH=$PATH:`pwd`/local/dict

if [ $stage -le 3 ]; then
cat $wordlist | python -c '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could replace this inline python script with:
utils/filter_scp.pl --exclude $wordlist <$orig_wordlist > $dir/oovlist


# Remove ; and , from words, if they are present; these
# might crash our scripts, as they are used as separators there.
filter_dict.pl $dir/dict.cmu > $dir/f/dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have inherited some old WSJ scripts for extending a dictionary. I believe this works worse than g2p.py, when we tested. Is this really needed?

for x in $dir/score_*/$name.ctm; do
cp $x $dir/tmpf;
cat $dir/tmpf | grep -i -v -E '<NOISE|SPOKEN_NOISE>' | \
grep -i -v -E ' (UH|UM|EH|MM|HM|AH|HUH|HA|ER|OOF|HEE|ACH|EEE|EW)$' | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these things really appear in the BN setup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of them can appear. There are many test sets. Some can include conversational speech.

mkdir -p tools
pip install -t tools/beautifulsoup4 beautifulsoup4
fi
export PYTHONPATH=$PWD/tools/beautifulsoup4:$PYTHONPATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment explaining why this is needed.

@danpovey
Copy link
Contributor

danpovey commented Feb 12, 2018 via email

@xiaohui-zhang
Copy link
Contributor

understand. thanks

if $only_words && [ ! -z "$silence_phones" ]; then
awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \
$silence_phones > $wdir/lexicon_onlywords.txt
awk -v w=$silence_phones \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this. Let me know if this is fine. @xiaohui-zhang @danpovey

awk -v w=$silence_phones \
'BEGIN{while((getline<w)>0) {for(i=1;i<=NF;i++) sil[$i]=1;}}
{ p=$2; for(i=3;i<=NF;i++) p=p" "$i;
if(!(p in sil)) print $1" "p }' $lexicon > $wdir/lexicon_onlywords.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this be simplified to: { if (!(NF == 2 && $2 in sil)) print; }
and I'd prefer to rename w to s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. otherwise looks good. Thanks Vimal!

@vimalmanohar
Copy link
Contributor Author

I fixed the issues, and it is ready to be committed.

@danpovey
Copy link
Contributor

Thanks! Merging.

@danpovey danpovey merged commit 5ea9b0d into kaldi-asr:master Feb 13, 2018
LvHang pushed a commit to LvHang/kaldi that referenced this pull request Apr 14, 2018
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants