Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
3b4428b
basic directory structure
jtrmal Oct 30, 2017
467cdeb
basic data setup ready
jtrmal Oct 31, 2017
ae370ad
adding scoring script
jtrmal Oct 31, 2017
92bd9eb
resolve utf-8 encoding and some other details
jtrmal Nov 6, 2017
33d5f36
do fix_data_dir after parametrixation
jtrmal Nov 6, 2017
0a0436a
make 1A language default again
jtrmal Nov 6, 2017
e826675
material:add text filter for scoring
jtrmal Nov 21, 2017
34ff3c1
material: fix path.sh
jtrmal Nov 21, 2017
8e3481a
script changes up to triphone training
freewym Nov 6, 2017
04dc97e
tuning of triphone systems
freewym Nov 22, 2017
7a2114c
added recipe for tagalog
freewym Dec 13, 2017
51768ed
added more
freewym Dec 13, 2017
4755373
RNNLM for material
hainan-xv Jan 6, 2018
389dcde
change chain model path
hainan-xv Jan 6, 2018
258f7ba
minor change
hainan-xv Jan 13, 2018
0061f07
minor change
hainan-xv Jan 13, 2018
daf8e56
reoriganized the scripts structures to allow to specifying language n…
freewym Jan 16, 2018
8b3e34c
create one single rnnlm script for all material languages
freewym Jan 17, 2018
c05b977
fix stage numbers in the tdnn-lstm recipe
freewym Jan 19, 2018
78c821b
remove $language subdir in /exp and /data
freewym Jan 20, 2018
2abbfa2
fix issues related to num of params checks for some scripts
freewym Feb 7, 2018
b5af256
added decoding scripts for ANALYSIS1
Jan 25, 2018
1a8d80e
added scripts to compute WER for decoding ANALYSIS1
Jan 31, 2018
60c6eed
removed exit 0
Feb 4, 2018
b0b098f
remove in exp/ and data/
Feb 7, 2018
0145421
minor fixes
Feb 9, 2018
64165fe
added audio_path to conf
Feb 13, 2018
10360e6
added scripts that produce the results for the site visit and cleanup
freewym Feb 9, 2018
c0ae6b1
added support to decode test dev/eval1
freewym Mar 1, 2018
2016d8c
added sentence segmentation
Apr 24, 2018
2e5d018
bug fixes for the path to tagalog DEV data
freewym Apr 6, 2018
1b17c1f
adds eval2 decoding; adds tdnn1b recipes
freewym Apr 21, 2018
442e6bb
adds analysis2 decoding
Jun 14, 2018
eb225b1
material scripts
hainan-xv Jun 14, 2018
711198e
clean up src a bit
hainan-xv Jun 14, 2018
0ca089f
clean up src a bit2
hainan-xv Jun 14, 2018
3e1c880
material: Temporarily fixed scoring
vimalmanohar Jul 10, 2018
9cd33e6
material: Cleanup scoring scripts
vimalmanohar Jul 10, 2018
399496a
material scripts
hainan-xv Jul 17, 2018
d705fe3
merge with latest base
hainan-xv Jul 17, 2018
a42b09f
t push origin material_basicMerge branch 'hainan-xv-material_basic' i…
Jul 18, 2018
b98b248
remove some of the _2 affixes
hainan-xv Jul 20, 2018
1f9fa82
add decoding for eval3
hainan-xv Jul 21, 2018
9ffb000
Update convert_lexicon.pl
jtrmal Aug 27, 2018
4cd2ea5
added semisupervised training scripts (might need changes according t…
freewym Sep 5, 2018
187b18c
removed files with _2 suffix
Sep 11, 2018
212a8f7
Merge branch 'master' of https://github.com/kaldi-asr/kaldi into mate…
hainan-xv Sep 15, 2018
cc3cd1d
add mono data
hainan-xv Sep 16, 2018
250ba57
fix a bug for decoding; officially working
hainan-xv Sep 17, 2018
28dc919
basic directory structure
jtrmal Oct 30, 2017
bf66894
basic data setup ready
jtrmal Oct 31, 2017
e3db74e
adding scoring script
jtrmal Oct 31, 2017
b06655e
resolve utf-8 encoding and some other details
jtrmal Nov 6, 2017
0599652
do fix_data_dir after parametrixation
jtrmal Nov 6, 2017
fa75313
make 1A language default again
jtrmal Nov 6, 2017
de1b53d
material:add text filter for scoring
jtrmal Nov 21, 2017
93ebc80
material: fix path.sh
jtrmal Nov 21, 2017
bdc70ec
script changes up to triphone training
freewym Nov 6, 2017
2c46288
tuning of triphone systems
freewym Nov 22, 2017
d281ede
added recipe for tagalog
freewym Dec 13, 2017
dbc0723
added more
freewym Dec 13, 2017
eab09c7
RNNLM for material
hainan-xv Jan 6, 2018
a7f73a3
change chain model path
hainan-xv Jan 6, 2018
77ce4dc
minor change
hainan-xv Jan 13, 2018
0b0e47d
minor change
hainan-xv Jan 13, 2018
c9b3e75
reoriganized the scripts structures to allow to specifying language n…
freewym Jan 16, 2018
a8d77fc
create one single rnnlm script for all material languages
freewym Jan 17, 2018
ffba239
fix stage numbers in the tdnn-lstm recipe
freewym Jan 19, 2018
9ee662d
remove $language subdir in /exp and /data
freewym Jan 20, 2018
a80382b
fix issues related to num of params checks for some scripts
freewym Feb 7, 2018
6cd4049
added decoding scripts for ANALYSIS1
Jan 25, 2018
809e3f3
added scripts to compute WER for decoding ANALYSIS1
Jan 31, 2018
5e06fe4
removed exit 0
Feb 4, 2018
d7cf604
remove in exp/ and data/
Feb 7, 2018
63c2b93
minor fixes
Feb 9, 2018
bb97a3f
added audio_path to conf
Feb 13, 2018
38d3445
added scripts that produce the results for the site visit and cleanup
freewym Feb 9, 2018
1601312
added support to decode test dev/eval1
freewym Mar 1, 2018
cc32463
added sentence segmentation
Apr 24, 2018
c85ddf3
bug fixes for the path to tagalog DEV data
freewym Apr 6, 2018
22109d7
adds eval2 decoding; adds tdnn1b recipes
freewym Apr 21, 2018
6452a5f
adds analysis2 decoding
Jun 14, 2018
ec7ec63
material scripts
hainan-xv Jun 14, 2018
b37c754
clean up src a bit
hainan-xv Jun 14, 2018
f1e44da
clean up src a bit2
hainan-xv Jun 14, 2018
170da49
material: Temporarily fixed scoring
vimalmanohar Jul 10, 2018
9c61ab2
material: Cleanup scoring scripts
vimalmanohar Jul 10, 2018
63022e6
material scripts
hainan-xv Jul 17, 2018
46f34a4
remove some of the _2 affixes
hainan-xv Jul 20, 2018
e6e21bb
add decoding for eval3
hainan-xv Jul 21, 2018
bb06a69
Update convert_lexicon.pl
jtrmal Aug 27, 2018
d380c46
added semisupervised training scripts (might need changes according t…
freewym Sep 5, 2018
c5b58f8
removed files with _2 suffix
Sep 11, 2018
e80d7b1
adding monodata to material
hainan-xv Sep 25, 2018
d80f022
Merge branch 'material_basic' into material_fix2
mahsa7823 Sep 25, 2018
902fd89
Merge branch 'hainan-xv-material_fix4' into material_basic
Sep 25, 2018
865b507
updated run.sh with the instrictions
Oct 1, 2018
b1d9ef0
changing how LM preparation was done for material
hainan-xv Oct 8, 2018
8014b51
changing how LM preparation was done for material, merge with latest …
hainan-xv Oct 8, 2018
aef4bf3
added Somali config
Oct 22, 2018
3d07b4e
Merge branch 'hainan-xv-material_new_lm2' into material_basic
Oct 22, 2018
df46dc9
updated somali.cong with mono and number_mapping paths
Nov 19, 2018
22d03b5
added local/normalize_numbers.py
Nov 23, 2018
7a83d7e
support for mono2, create output_nbest directory
mahsa7823 Dec 12, 2018
e4b77fe
updated WER results in local/chain/tuning/run_tdnn_1b.sh and local/rn…
mahsa7823 Feb 20, 2019
54840c8
clean-up semisupervised training scripts for material
freewym May 26, 2019
042c974
clean up
mahsa7823 May 27, 2019
682d483
clean up semisup
mahsa7823 May 27, 2019
0f428ba
change configs
mahsa7823 May 27, 2019
7f07182
anonymize paths
mahsa7823 May 27, 2019
f92372b
Merge branch 'master' into material_basic
mahsa7823 May 27, 2019
2de2e42
added README and RESULTS. De-anonymization.
mahsa7823 May 29, 2019
aac8c9a
Merge branch 'material_basic' of https://github.com/mahsa7823/kaldi i…
mahsa7823 May 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added egs/material/README
Empty file.
35 changes: 35 additions & 0 deletions egs/material/s5/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
About the MATERIAL corpus:

The MATERIAL project:
https://www.iarpa.gov/index.php/research-programs/material
https://www.nist.gov/itl/iad/mig/openclir-evaluation

The speech data in the MATERIAL corpus consist of four data sets for each
language: train (BUILD), development (BUILD-dev), test (ANALYSIS1 and ANALYSIS2),
and unlabeled evaluation audio (EVAL{1,2,3}). The train, development, test, and
evaluation data contain around 40, 10, 20, and 250 hours of audio respectively.
The train set is transcribed conversational audio that can be used for training
an ASR system. It consists of some in 8-bit a-law .sph (Sphere) files and some
in .wav files with 24-bit samples. The development set is transcribed
conversational audio that can be used as development data for training to tune
model parameters. The test data come in long unsegmented files. The reference
transcripts for the test set is provided, hence, one can measure WER on the test
set. The evaluation set is untranscribed audio that can be used for
semi-supervised training of the acoustic model.
Conversational speech data in the train and test sets are two-channel audio with
the two channels temporally aligned. Each audio channel is provided and
transcribed as a separate file, identified as inLine or outLine channel. Both
audio channels are interleaved in a single file and a there is a single
interleaved transcript that reflects the temporal alignments. In addition to
conversational speech, the test and evlatuion sets also contain other
genres of speech, namely news broadcast and topical broadcast, which are
single channel files.


Running the recipe:

In s5)
./run.sh --language <swahili|tagalog|somali>
./local/chain/run_tdnn.sh
./local/chain/decode_test.sh --language <swahili|tagalog|somali>
./local/rnnlm/run_tdnn_lstm.sh
51 changes: 51 additions & 0 deletions egs/material/s5/RESULTS
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
WER results for supervised and semi-supervised acoustic model training

Baseline: GMM training to create alignments and lattice-free MMI-trained neural
network with factorized TDNN. The BUILD package labeled audio is used for
supervised acoustic model training, the EVALs unlabeled audio is added for
semi-supervised acoustic model training.

Source-side bitext on the BUILD package and crawled monolingual data are used in
building the n-gram LM, RNNLM re-scoring, as well as extending the baseline lexicon.


Results for *supervised* acoustic model training:

Swahili
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 36.8 36.7 38.9
ANALYSIS1 42.5 41.3 41.4
ANALYSIS2 38.1 36.8 36.9

Tagalog
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 46.4 46.1 47.5
ANALYSIS1 52.1 51.0 50.9
ANALYSIS2 53.6 52.3 52.2

Somali
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 57.4 56.5 57.8
ANALYSIS1 61.6 57.8 57.7
ANALYSIS2 59.3 55.5 55.3


Results for *semi-supervised* acoustic model training:

Swahili
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 35.3 35.1 36.7
ANALYSIS1 35.2 34.5 34.7
ANALYSIS2 30.8 30.0 30.1

Tagalog
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 45.0 45.2 46.6
ANALYSIS1 40.8 40.1 40.1
ANALYSIS2 41.1 40.6 40.6

Somali
Baseline +RNNLM +RNNLM-nbest
BUILD-dev 56.8 56.3 57.7
ANALYSIS1 50.6 48.8 48.6
ANALYSIS2 49.8 48.2 48.2
14 changes: 14 additions & 0 deletions egs/material/s5/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine). queue.pl works with GridEngine (qsub). slurm.pl works
# with slurm. Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration. Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="queue.pl --mem 2G"
export decode_cmd="retry.pl --num-tries 3 queue.pl --mem 8G"
1 change: 1 addition & 0 deletions egs/material/s5/conf/decode.config
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# empty config, just use the defaults.
26 changes: 26 additions & 0 deletions egs/material/s5/conf/lang/somali.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# speech corpora files location
# the user should replace the values with the ones that work for their location
corpus=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/BUILD/
# test audio files to decode
audio_path_analysis1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/ANALYSIS1/audio/
audio_path_analysis2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/ANALYSIS2/audio/
audio_path_dev=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/DEV/audio/
audio_path_eval1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/EVAL1/audio/
audio_path_eval2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/EVAL2/audio/
audio_path_eval3=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1S/EVAL3/audio/
# bitext file location
bitext=$corpus/bitext/MATERIAL_BASE-1S-BUILD_bitext.txt
mono=/home/pkoehn/statmt/data/site-crawl/corpus/paracrawl-release3.2018-11-05.en-so.zipporah-20-dedup.lang-filtered.so
mono2=/home/pkoehn/statmt/data/data.statmt.org/lm/so.filtered.tok.gz
# number_mapping is a 2-column file consisting of the numbers written as digits (1st column) and letters (2nd column)
number_mapping=/home/pkoehn/experiment/material-asr-so-en/scripts/somali_1_9999.txt
# Acoustic model parameters
numShorestUtts=40000
numLeavesTri1=2000
numGaussTri1=30000
numLeavesTri2=3000
numGaussTri2=60000
numLeavesTri3=6000
numGaussTri3=80000


26 changes: 26 additions & 0 deletions egs/material/s5/conf/lang/swahili.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# speech corpora files location
# the user should replace the values with the ones that work for their location
corpus=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A-BUILD_v1.0/
# test audio files to decode
audio_path_analysis1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/ANALYSIS1/audio/
audio_path_analysis2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/ANALYSIS2/audio/
audio_path_dev=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/DEV/audio/
audio_path_eval1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/EVAL1/audio/
audio_path_eval2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/EVAL2/audio/
audio_path_eval3=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1A/EVAL3/audio/
# bitext file location
bitext=$corpus/bitext/MATERIAL_BASE-1A-BUILD_bitext.txt
mono=/home/pkoehn/statmt/data/site-crawl/mono-corpus/mono.2018-04-24.sw
mono2=
# number_mapping is a 2-column file consisting of the numbers written as digits (1st column) and letters (2nd column)
number_mapping=/home/pkoehn/experiment/material-asr-so-en/scripts/swahili_1_9999.txt
# Acoustic model parameters
numShorestUtts=40000
numLeavesTri1=2000
numGaussTri1=30000
numLeavesTri2=3000
numGaussTri2=60000
numLeavesTri3=6000
numGaussTri3=80000


26 changes: 26 additions & 0 deletions egs/material/s5/conf/lang/tagalog.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# speech corpora files location
# the user should replace the values with the ones that work for their location
corpus=/home/pkoehn/experiment/material-asr-so-en/scripts/swahili_1_9999.txt
# test audio files to decode
audio_path_analysis1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/ANALYSIS1/audio/
audio_path_analysis2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/ANALYSIS2/audio/
audio_path_dev=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/DEV/audio/
audio_path_eval1=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/EVAL1/audio/
audio_path_eval2=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/EVAL2/audio/
audio_path_eval3=/export/corpora5/MATERIAL/IARPA_MATERIAL_BASE-1B/EVAL3/audio/
# bitext file location
bitext=$corpus/bitext/MATERIAL_BASE-1B-BUILD_bitext.txt
mono=/home/pkoehn/statmt/data/site-crawl/mono-corpus/mono.2018-04-24.tl
mono2=
# number_mapping is a 2-column file consisting of the numbers written as digits (1st column) and letters (2nd column)
number_mapping=
# Acoustic model parameters
numShorestUtts=45000
numLeavesTri1=4000
numGaussTri1=60000
numLeavesTri2=5000
numGaussTri2=80000
numLeavesTri3=7000
numGaussTri3=100000


2 changes: 2 additions & 0 deletions egs/material/s5/conf/mfcc.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--use-energy=false
--sample-frequency=8000
10 changes: 10 additions & 0 deletions egs/material/s5/conf/mfcc_hires.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# config for high-resolution MFCC features, intended for neural network training.
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--use-energy=false # use average of log energy, not energy.
--sample-frequency=8000 # most of the files are 8kHz
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-freq=40 # low cutoff frequency for mel bins
--high-freq=-200 # high cutoff frequently, relative to Nyquist of 4000 (=3800)
1 change: 1 addition & 0 deletions egs/material/s5/conf/online_cmvn.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# configuration file for apply-cmvn-online, used in the script ../local/run_online_decoding.sh
1 change: 1 addition & 0 deletions egs/material/s5/conf/plp.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--sample-frequency=8000
55 changes: 55 additions & 0 deletions egs/material/s5/local/audio2wav_scp.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/usr/bin/env perl
#===============================================================================
# Copyright 2017 (Author: Yenda Trmal <jtrmal@gmail.com>)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
#===============================================================================

use strict;
use warnings;
use utf8;


my $sox = `which sox` or die "The sox binary does not exist";
chomp $sox;
my $sph2pipe = `which sph2pipe` or die "The sph2pipe binary does not exist";
chomp $sph2pipe;

while(<STDIN>) {
chomp;
my $full_path = $_;
(my $basename = $full_path) =~ s/.*\///g;

die "The filename $basename does not match the expected naming pattern!" unless $basename =~ /.*\.(wav|sph)$/;
(my $ext = $basename) =~ s/.*\.(wav|sph)$/$1/g;
(my $name = $basename) =~ s/(.*)\.(wav|sph)$/$1/g;


# name looks like this:
# MATERIAL_BASE-1A-BUILD_10002_20131130_011225_inLine.sph
# Please note that the naming pattern must match
# the pattern in create_datafiles.pl
$name =~ s/inLine.*/0/g;
$name =~ s/outLine.*/1/g;
$name =~ s/_BASE//g;
$name =~ s/-BUILD//g;

if ($ext eq "wav") {
print "$name $sox $full_path -r 8000 -c 1 -b 16 -t wav - downsample|\n";
} else {
print "$name $sph2pipe -f wav -p -c 1 $full_path|\n";
}
}


Loading