Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
22ea467
added basic file structure
Jul 21, 2017
c011a3f
initial commit for preperation of IAM dataset
aarora8 Jul 21, 2017
296d112
added scripts to extract features from image
Jul 21, 2017
f6d6727
Merge remote-tracking branch 'aarora8/IAM_eng' into iam_ocr
Jul 21, 2017
b5fbd16
removed egs/iam directory and moved necessary files to egs/iam_en to …
Jul 21, 2017
f4ca585
adding code for writing to text, wav.scp and utt2spk file
aarora8 Jul 21, 2017
82c8596
Merge remote-tracking branch 'aarora8/IAM_eng' into iam_ocr
Jul 24, 2017
cdbbefb
added code to perform splits
Jul 24, 2017
b787cfe
adding code for creating train,validation and test sets and also crea…
aarora8 Jul 25, 2017
910d710
changed so train, test, valid examples are separated
Jul 25, 2017
8ef99d6
merged aarora8/IAM_eng and fixed some minor things
Jul 25, 2017
696b21c
accidently split spk2utt incorrectly
Jul 25, 2017
dd208c3
adding changes for setting desired variance floor value
aarora8 Jul 25, 2017
3911054
cosmetic fix
aarora8 Jul 26, 2017
4938413
reading lines.txt and creating a text dictionary
aarora8 Jul 26, 2017
170f627
using ascii/lines files instead of xml
aarora8 Jul 26, 2017
aaf9dce
WIP added scripts to prepare dict and lexicon
Jul 27, 2017
2e1b600
fixed bug where due to rounding errors the images won't be resized ex…
Jul 27, 2017
3bcc639
fixing bug in initializing variance floor vector
aarora8 Jul 27, 2017
2159a85
fixing bug in initializing variance floor vector creating vector in m…
aarora8 Jul 28, 2017
99173f8
added bugfixes that aarora8/IAM_eng made
Jul 28, 2017
1549c8b
adding code for preparing character based language model
aarora8 Jul 28, 2017
f455eff
adding files for character based language model
aarora8 Jul 28, 2017
2f7f7d9
added decoding to run.sh and the necessary files for decoding into local
Jul 31, 2017
00f299c
adding decoding changes from ChunChiehChang and adding changes for tr…
aarora8 Aug 1, 2017
f748891
minor changes so that testing scripts will be easier
Aug 1, 2017
19e2e79
added scripts to use characters instead of words for models. Also ren…
Aug 10, 2017
424bb2e
added in comments the ability to use individual words in training/tes…
Aug 10, 2017
32b7030
changes to creating lexicon
Aug 11, 2017
b591fd3
added scripts to ensure lm decodes one word without uniform distribut…
Aug 15, 2017
ca1aa1a
modified script to run on individual words instead of lines of text
Aug 15, 2017
5b54ad3
enabled silence models to have two states
Aug 15, 2017
c519d25
code for handwriting word recognition for database IAM
aarora8 Sep 9, 2017
54c5e1b
fixing reset
aarora8 Sep 11, 2017
e0aed5d
updating configuration, [wip] adding code for usinf brown and lob corpus
aarora8 Sep 13, 2017
de00d4e
minor fix in train_lm code
aarora8 Sep 13, 2017
05153cc
adding modification for getting vocab from corpus, removing comments …
aarora8 Sep 21, 2017
b45598c
adding modifications for open vocab task
aarora8 Oct 2, 2017
83ba89f
cosmetic fix
aarora8 Oct 2, 2017
cdd45a4
adding modifications for upper case to lower acse and cosmetic fix
aarora8 Oct 3, 2017
4ba261b
removing exit from run.sh
aarora8 Oct 3, 2017
8593c98
adding modifications for including validation data in training
aarora8 Oct 7, 2017
55ef746
bug fix thanks Yiwen Shao
aarora8 Oct 7, 2017
a120cde
removed variance floor as it did not seem to help
Oct 7, 2017
cb4a5d3
bug fix
aarora8 Oct 7, 2017
4b0a999
removed unecessary run file
Oct 7, 2017
acc7dc8
modified score to use steps/scoring/scoring_kaldi_wer.sh
Oct 7, 2017
7127676
updated scripts to most recent version. Finished the run.sh
Oct 7, 2017
67181ea
Merge remote-tracking branch 'aarora8/iam' into iam_ocr
Oct 7, 2017
5be76ec
bug fix
aarora8 Oct 7, 2017
5ab4289
removing .txt
aarora8 Oct 7, 2017
40a5bd8
Merge remote-tracking branch 'aarora8/iam' into iam_ocr
Oct 7, 2017
86f69f8
Merge remote-tracking branch 'upstream/master' into iam_ocr
Oct 30, 2017
2a59b3a
OCR: Add IAM corpus with unk decoding support (#3)
aarora8 Oct 30, 2017
13e0a5b
Add a new English OCR database 'UW3'
ChunChiehChang Nov 2, 2017
fdd0953
Some minor fixes re IAM corpus
hhadian Nov 15, 2017
aa7c19a
Fix an issue in IAM chain recipes + add a new recipe (#6)
aarora8 Nov 20, 2017
856f0eb
removed part to prepare data for validation sets and added CER to score
Nov 29, 2017
025ce1d
merged branch from hhadian/ocr and resolved some conflicts
Nov 29, 2017
4e085a4
Some fixes based on the pull request review
aarora8 Dec 22, 2017
e243bee
Various fixes + cleaning on IAM
hhadian Dec 22, 2017
0e4f613
Fix LM estimation and add extended dictionary + other minor fixes
hhadian Dec 24, 2017
6f790ed
Add README for IAM
hhadian Dec 24, 2017
96b51d4
Add output filter for scoring
hhadian Dec 24, 2017
b914da2
Fix a bug RE switch to pyhton3
hhadian Dec 24, 2017
05fb12e
Add updated results + minor fixes
hhadian Dec 24, 2017
1e3a8c4
Remove unk decoding -- gives almost no gain
hhadian Dec 24, 2017
a08725e
Add UW3 OCR database
ChunChiehChang Dec 31, 2017
af85099
merged from hhadian ocr
Jan 1, 2018
e34fc8e
forgot to remove last of variance floor option
Jan 1, 2018
16a9104
different number of states for punctuations
Jan 10, 2018
44d3ce2
removed commented out code
Jan 12, 2018
d873784
adding updated results
Jan 12, 2018
e52df3c
removed unnecessary folder
Jan 12, 2018
bbc7b4c
removed some changes from variance floor option that is now removed
Jan 12, 2018
018600c
moved s5 to v1
Jan 19, 2018
ab5a51c
merge master branch to get hhadian commits
Jan 19, 2018
7e1a8a2
moved s5 to wrong location
Jan 19, 2018
f82aaa4
removed unused files
Jan 19, 2018
26bf5b4
changed run to use local prepare_lang.sh
Jan 19, 2018
bb6a073
Merge remote-tracking branch 'origin/master' into iam_ocr
Feb 26, 2018
340de0c
added unk and added wellington corpus. removed LOB because IAM and LO…
Mar 2, 2018
048b2e5
forgot to add some stuff for unk and also for different topo for punc…
Mar 2, 2018
52c6721
Add initial scripts for e2e ocr - not cleaned
hhadian Mar 19, 2018
92a5866
Add e2e chain script
hhadian Mar 19, 2018
ea839ad
Some fixes
hhadian Mar 19, 2018
f5cbb24
Some cleaning
hhadian Mar 19, 2018
aa6f698
removed the test words from LOB corpus. Previous commits just removed…
Mar 19, 2018
d7aa22b
Merge remote-tracking branch 'hhadian/e2e_ocr' into iam_ocr
Mar 20, 2018
c749a5b
merged upstream master
Apr 12, 2018
639d76b
adding wellington corpus and fixing some merge issues
Apr 12, 2018
e48c0ef
forgot for fix all merge conflicts
Apr 12, 2018
91ebb25
removing some uneeded files
Apr 12, 2018
750e11c
added decode_gmm option. Removed unecessary file
Apr 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions egs/iam/v1/local/prepare_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,15 @@ xml=data/local/xml
ascii=data/local/ascii
bcorpus=data/local/browncorpus
lobcorpus=data/local/lobcorpus
wcorpus=data/local/wellingtoncorpus
data_split_info=data/local/largeWriterIndependentTextLineRecognitionTask
lines_url=http://www.fki.inf.unibe.ch/DBs/iamDB/data/lines/lines.tgz
xml_url=http://www.fki.inf.unibe.ch/DBs/iamDB/data/xml/xml.tgz
data_split_info_url=http://www.fki.inf.unibe.ch/DBs/iamDB/tasks/largeWriterIndependentTextLineRecognitionTask.zip
ascii_url=http://www.fki.inf.unibe.ch/DBs/iamDB/data/ascii/ascii.tgz
brown_corpus_url=http://www.sls.hawaii.edu/bley-vroman/brown.txt
lob_corpus_url=http://ota.ox.ac.uk/text/0167.zip
wellington_corpus_loc=/export/corpora5/Wellington/WWC/
mkdir -p $download_dir data/local

# download and extact images and transcription
Expand Down Expand Up @@ -124,6 +126,38 @@ else
echo "$0: Done downloading the Brown text corpus"
fi

if [ -d $wcorpus ]; then
echo "$0: Not copying Wellington corpus as it is already there."
else
mkdir -p $wcorpus
cp -r $wellington_corpus_loc/. $wcorpus

# Combine Wellington corpora and replace some of their annotations
cat data/local/wellingtoncorpus/Section{A,B,C,D,E,F,G,H,J,K,L}.txt | \
cut -d' ' -f3- | sed "s/^[ \t]*//" > data/local/wellingtoncorpus/Wellington_annotated.txt

cat data/local/wellingtoncorpus/Wellington_annotated.txt | python3 <(
cat << EOF
import sys, io, re;
from collections import OrderedDict;
sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf8");
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf8");
dict=OrderedDict([("^",""), ("|",""), ("_",""), ("*0",""), ("*1",""), ("*2",""), ("*3",""), ("*4",""),
("*5",""), ("*6",""), ("*7",""), ("*8",""), ("*9",""), ("*@","°"), ("**=",""), ("*=",""),
("*+$",""), ("$",""), ("*+","£"), ("*-","-"), ("*/","*"), ("*|",""), ("*{","{"), ("*}","}"),
("**#",""), ("*#",""), ("*?",""), ("**\"","\""), ("*\"","\""), ("**'","'"), ("*'","'"),
("*<",""), ("*>",""), ("**[",""), ("**]",""), ("**;",""), ("*;",""), ("**:",""), ("*:",""),
("\\\0",""), ("\\\15",""), ("\\\1",""), ("\\\2",""), ("\\\3",""), ("\\\6",""), ("\\\",""),
("{0",""), ("{15",""), ("{1",""), ("{2",""), ("{3",""), ("{6","")]);
pattern = re.compile("|".join(re.escape(key) for key in dict.keys()) + "|[^\\*]\\}");
dict["}"]="";
[sys.stdout.write(pattern.sub(lambda x: dict[x.group()[1:]] if re.match('[^\\*]\\}', x.group()) else dict[x.group()], line)) for line in sys.stdin];
EOF
) > data/local/wellingtoncorpus/Wellington_annotation_removed.txt

echo "$0: Done copying Wellington corpus"
fi

mkdir -p data/{train,test,val}
file_name=largeWriterIndependentTextLineRecognitionTask

Expand Down
3 changes: 2 additions & 1 deletion egs/iam/v1/local/train_lm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ if [ $stage -le 0 ]; then
local/remove_test_utterances_from_lob.py data/test/text data/val/text \
> ${dir}/data/text/lob.txt
cat data/local/browncorpus/brown.txt >> ${dir}/data/text/brown.txt
cat data/local/wellingtoncorpus/Wellington_annotation_removed.txt >> ${dir}/data/text/wellington.txt

# use the validation data as the dev set.
# Note: the name 'dev' is treated specially by pocolm, it automatically
Expand All @@ -81,7 +82,7 @@ if [ $stage -le 0 ]; then
cut -d " " -f 2- < data/test/text > ${dir}/data/real_dev_set.txt

# get the wordlist from IAM text
cat ${dir}/data/text/{iam,lob,brown}.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c | sort -bnr > ${dir}/data/word_count
cat ${dir}/data/text/{iam,lob,brown,wellington.txt}.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c | sort -bnr > ${dir}/data/word_count
head -n $vocab_size ${dir}/data/word_count | awk '{print $2}' > ${dir}/data/wordlist
fi

Expand Down
11 changes: 6 additions & 5 deletions egs/iam/v1/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
set -e
stage=0
nj=20
decode_gmm=false
username=
password=
# iam_database points to the database path on the JHU grid. If you have not
Expand Down Expand Up @@ -78,7 +79,7 @@ if [ $stage -le 4 ]; then
data/lang exp/mono
fi

if [ $stage -le 5 ]; then
if [ $stage -le 5 ] && $decode_gmm; then
utils/mkgraph.sh --mono data/lang_test exp/mono exp/mono/graph

steps/decode.sh --nj $nj --cmd $cmd exp/mono/graph data/test \
Expand All @@ -93,7 +94,7 @@ if [ $stage -le 6 ]; then
exp/mono_ali exp/tri
fi

if [ $stage -le 7 ]; then
if [ $stage -le 7 ] && $decode_gmm; then
utils/mkgraph.sh data/lang_test exp/tri exp/tri/graph

steps/decode.sh --nj $nj --cmd $cmd exp/tri/graph data/test \
Expand All @@ -109,7 +110,7 @@ if [ $stage -le 8 ]; then
data/train data/lang exp/tri_ali exp/tri2
fi

if [ $stage -le 9 ]; then
if [ $stage -le 9 ] && $decode_gmm; then
utils/mkgraph.sh data/lang_test exp/tri2 exp/tri2/graph

steps/decode.sh --nj $nj --cmd $cmd exp/tri2/graph \
Expand All @@ -124,7 +125,7 @@ if [ $stage -le 10 ]; then
data/train data/lang exp/tri2_ali exp/tri3
fi

if [ $stage -le 11 ]; then
if [ $stage -le 11 ] && $decode_gmm; then
utils/mkgraph.sh data/lang_test exp/tri3 exp/tri3/graph

steps/decode_fmllr.sh --nj $nj --cmd $cmd exp/tri3/graph \
Expand All @@ -137,7 +138,7 @@ if [ $stage -le 12 ]; then
fi

if [ $stage -le 13 ]; then
local/chain/run_cnn_1a.sh
local/chain/run_cnn_1a.sh --lang-test lang_unk
fi

if [ $stage -le 14 ]; then
Expand Down