[RNNLM] old iteration model cleanup #2885

slckl · 2018-11-28T11:32:44Z

As outlined in issue #2869, rnnlm training currently retains model files for ALL iterations, which eats storage space very quickly. This PR implements an optional automatic cleanup of rnnlm model files (off by default).

Two cleanup strategies have been implemented: "keep_latest", which retains only a certain number of freshest iterations, and "keep_best", which looks at objf on dev set, and retains only a certain number of iterations with best objf values.

What is implemented:

cleanup logic in a separate python 3 script, which looks at iterations to clean up based on what the compute_prob log files say about these iterations. Since rnnlm-compute-prob and rnnlm-train processes work in parallel and often one finishes before the other, looking at log files felt like the simplest solution to figure out which iterations are safe to clean up.
optional invocation of cleanup script from train_rnnlm.sh with extra cleanup arguments added.
after training is done, there's a get_best_model.py script, which picks out the best model based on dev-set perplexity. This script has been updated to ignore iterations which do have a compute_prob log, but no model files.

What is not done:

I have not touched any recipes or anything of the sort, so they all should work just like before. However, there's also no demonstration anywhere how to use cleanup functionality.

Sample invocation with cleanup functionality from a training directory looks like this:

rnnlm/train_rnnlm.sh --num-jobs-initial 1 --num-jobs-final 1 \
                    --embedding_l2 $embedding_l2 \
                    --use_gpu_for_diagnostics true \
                    # the new cleanup stuff
                    --cleanup true --cleanup_strategy "keep_latest" \
                    --cleanup_keep_iters 10 \
                    # that's it for cleanup
                    --stage $train_stage --num-epochs $epochs --cmd "$cmd --mem 16G" $dir

For what it's worth, I have verified this works on a private problem, but I haven't tried any of the Kaldi out of the box recipes with this.

…rob has finished

…inor cleanup

…te_prob log instead of exiting on them

…te_prob log files

…e model files present

danpovey · 2018-11-28T18:16:27Z

Thanks! @keli78 and @GaofengCheng, I believe you may both be running RNNLM training. Would you mind testing this for me, i.e. running your experiments with these changes? don't forget to let me know once you have done so.

keli78 · 2018-11-28T19:43:26Z

Sure. I'm running a test for this now.

Ke

GaofengCheng · 2018-11-29T01:48:25Z

Trying today.

slckl · 2018-11-29T14:05:02Z

Discovered a bug in this, apparently "0.raw" and maybe something else is special, but this treats it as just another iteration and cleans it up. Committing fix in a moment.

Sorry about that.

slckl · 2018-11-29T14:32:56Z

That's the fix committed. It will now never consider files that belong to iteration 0 for cleanup.

GaofengCheng · 2018-11-30T02:08:55Z

@slckl
It threw the warnings:

Training neural net (pass 25)
rnnlm/get_best_model.py: warning: no model files found for iteration 1. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 2. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 3. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 4. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 5. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 6. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 7. Skipping.
rnnlm/get_best_model.py: warning: no model files found for iteration 8. Skipping.

slckl · 2018-11-30T06:54:56Z

Yeah, that's kind of normal. That's the script that originally goes over ALL the iterations and finds the best model. This warning just says that, while there are log files for the given iteration, there are no more model files, presumably because they were cleaned up.

I suppose that really shouldn't be a warning as it's confusing. I'll remove it.

…sing warnings as, given cleanup, it's normal for model files to be absent for most iteration

GaofengCheng · 2018-12-01T12:05:23Z

Thanks for your commit. I'm checking and re-running to test. Because of the warning, the previous code did not generate the final.mdl. It will take another day. Thanks for your work.

keli78 · 2018-12-02T15:47:08Z

Hi @slckl I got many warnings like:
rnnlm/rnnlm_cleanup.py: warning: could not parse objective function from exp/rnnlm_lstm_960cleaned_1a/log/compute_prob.25.log
rnnlm/rnnlm_cleanup.py: warning: could not parse objective function from exp/rnnlm_lstm_960cleaned_1a/log/compute_prob.20.log
......
While these log files look normal.
And after training is done, I got an error:
rnnlm/get_best_model.py: error: could not get best iteration.
I set "cleanup_keep_iters" as 5.

slckl · 2018-12-02T19:24:47Z

Hi @keli78 could you please copy & paste the contents of any of the compute_prob.XX.log files which caused the warning? The objective parsing should be the same as in get_best_model.py, and it sounds like that also fails.

Fortunately for me, the get_best_model.py code is by @danpovey so maybe it's something else, not my code at fault here, hehe.

But I'd have to see the log file to tell for sure.

slckl · 2018-12-05T07:01:55Z

Just finished a week long training of a fairly vanilla tdnn-lstm model using cleanup. After all the changes, the final model was picked out just fine.

@keli78 any updates on those logs?
I can fix the log parsing, if that's where the issue is.

GaofengCheng · 2018-12-05T09:05:41Z

@slckl
I used your latest commit, and it failed as follow:

rnnlm/train_rnnlm.sh: will train for 26 iterations
Training neural net (pass 0)
Training neural net (pass 1)
Training neural net (pass 2)
Training neural net (pass 3)
Training neural net (pass 4)
Training neural net (pass 5)
Training neural net (pass 6)
Training neural net (pass 7)
Training neural net (pass 8)
Training neural net (pass 9)
Training neural net (pass 10)
Training neural net (pass 11)
Training neural net (pass 12)
Training neural net (pass 13)
Training neural net (pass 14)
Training neural net (pass 15)
Training neural net (pass 16)
Training neural net (pass 17)
Training neural net (pass 18)
Training neural net (pass 19)
Training neural net (pass 20)
Training neural net (pass 21)
Training neural net (pass 22)
Training neural net (pass 23)
Training neural net (pass 24)
Training neural net (pass 25)
rnnlm/get_best_model.py: error: could not get best iteration.

I used your scripts/rnnlm to replace my local directory.
Anything I'm wrong?

slckl · 2018-12-05T09:17:52Z

That should have worked, hmmm. Can you please paste a sample compute_prob.XX.log from this experiment?
I'm particularly interested in lines like:

LOG (rnnlm-compute-prob[5.5.118~1-2489b]:PrintStatsOverall():rnnlm-core-training.cc:118) Overall objf is (-6.574 + -0.0367) = -6.611 over 9.471e+05 words (weighted) in 246 minibatches; exact = (-6.574 + -0.01359) = -6.587

GaofengCheng · 2018-12-05T09:19:51Z

@slckl FYI

# Running on cs04
# Started at Wed Dec 5 14:11:49 CST 2018
# rnnlm-get-egs --eos-symbol=200006 --bos-symbol=200005 --brk-symbol=200007 --vocab-size=200008 exp/rnnlm_lstm_1a_new_scripts/text/dev.txt ark:- | rnnlm-compute-prob exp/rnnlm_lstm_1a_new_scripts/25.raw "rnnlm-get-word-embedding exp/rnnlm_lstm_1a_new_scripts/word_feats.txt exp/rnnlm_lstm_1a_new_scripts/feat_embedding.25.mat -|" ark:- 
rnnlm-compute-prob exp/rnnlm_lstm_1a_new_scripts/25.raw 'rnnlm-get-word-embedding exp/rnnlm_lstm_1a_new_scripts/word_feats.txt exp/rnnlm_lstm_1a_new_scripts/feat_embedding.25.mat -|' ark:- 
rnnlm-get-egs --eos-symbol=200006 --bos-symbol=200005 --brk-symbol=200007 --vocab-size=200008 exp/rnnlm_lstm_1a_new_scripts/text/dev.txt ark:- 
LOG (rnnlm-compute-prob[5.5]:SelectGpuId():cu-device.cc:128) Manually selected to compute on CPU.
LOG (rnnlm-get-egs[5.5]:Process():rnnlm-example.cc:703) Processed 5624 lines of input.
rnnlm-get-word-embedding exp/rnnlm_lstm_1a_new_scripts/word_feats.txt exp/rnnlm_lstm_1a_new_scripts/feat_embedding.25.mat - 
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 0 to 9 is (-5.043 + -0.03641) = -5.079 over 3.784e+04 words (weighted); exact = (-5.043 + -0.01707) = -5.06
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 10 to 19 is (-5.025 + -0.03117) = -5.056 over 3.795e+04 words (weighted); exact = (-5.025 + -0.01265) = -5.037
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 20 to 29 is (-4.999 + -0.03596) = -5.035 over 3.778e+04 words (weighted); exact = (-4.999 + -0.01689) = -5.016
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 30 to 39 is (-5.009 + -0.03428) = -5.043 over 3.798e+04 words (weighted); exact = (-5.009 + -0.01588) = -5.025
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 40 to 49 is (-4.974 + -0.03614) = -5.01 over 3.781e+04 words (weighted); exact = (-4.974 + -0.01762) = -4.991
LOG (rnnlm-get-egs[5.5]:~RnnlmExampleCreator():rnnlm-example.cc:347) Combined 5624/9194 chunks/sequences into 52 minibatches (0 chunks left over)
LOG (rnnlm-get-egs[5.5]:~RnnlmExampleCreator():rnnlm-example.cc:352) Overall there were 22.2195 words per chunk; 176.808 chunks per minibatch.
LOG (rnnlm-compute-prob[5.5]:PrintStatsThisInterval():rnnlm-core-training.cc:100) Objf for minibatches 50 to 51 is (-4.957 + -0.03568) = -4.992 over 4211 words (weighted); exact = (-4.957 + -0.01686) = -4.974
LOG (rnnlm-compute-prob[5.5]:PrintStatsOverall():rnnlm-core-training.cc:118) Overall objf is (-5.009 + -0.03481) = -5.044 over 1.936e+05 words (weighted) in 52 minibatches; exact = (-5.009 + -0.01604) = -5.025
LOG (rnnlm-compute-prob[5.5]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.0413 seconds taken in nnet3 compilation total (breakdown: 0.0125 compilation, 0.0214 optimization, 0.000981 shortcut expansion, 0.0042 checking, 0 computing indexes, 0.00228 misc.) + 0 I/O.
-5.04362
# Accounting: time=10038 threads=1
# Finished at Wed Dec 5 16:59:07 CST 2018 with status 0

slckl · 2018-12-05T09:28:33Z

That looks very much ok.
Are the model files actually present in the training dir? Perhaps cleanup has cleaned up too much, as now get_best_model.py silently skips iterations without model files.

GaofengCheng · 2018-12-05T09:38:55Z

Your scripts didn't generate the report.txt.
I don't know whether this is a bug?

slckl · 2018-12-05T09:43:53Z

report.txt is generated after get_best_model.py does its thing. So if get_best_model.py fails, then train_rnnlm.sh never gets to that, so that's unrelated.

Can you confirm that word_embedding.25.mat and 25.raw are actually still present in the training directory?

GaofengCheng · 2018-12-05T09:47:14Z

No.
Only
feat_embedding.25.mat
and
25.raw

Maybe you confused word_embedding and feat_embedding?

slckl · 2018-12-05T09:50:29Z

Ah, I changed get_best_model.py to check for word_embedding in model names to verify that iteration has model files present, since it doesn't find anything like that, it skips all iterations. Let me fix that real quick.
I assume that you also have feat_embedding.24.mat, feat_embedding.23.mat etc all the way down to feat_embedding.1.mat?

GaofengCheng · 2018-12-05T09:52:13Z

Yes, I have.

slckl · 2018-12-05T09:52:48Z

Yup, so cleanup didn't consider those as well. Sorry about that.

Fixing right now.

…st_model.py

slckl · 2018-12-05T10:03:54Z

There, that should now consider feat_embedding files as well. Could you please retry this with latest changes?

You can just resume training at, say, iteration 24, then cleanup and get_best_model.py should both run, and everything should be ok.

keli78 · 2018-12-06T21:19:26Z

Hi @slckl I tested your newest version and the previous error is gone.

slckl · 2018-12-07T07:23:21Z

Good to hear, thanks for testing.
I tried running a feature-based LM training yesterday, and the cleanup did work for me there as well now.

@cbtpkzm

* [build] Allow configure script to handle package-based OpenBLAS (kaldi-asr#2618) * [egs] updating local/make_voxceleb1.pl so that it works with newer versions of VoxCeleb1 (kaldi-asr#2684) * [egs,scripts] Remove unused --nj option from some scripts (kaldi-asr#2679) * [egs] Fix to tedlium v3 run.sh (rnnlm rescoring) (kaldi-asr#2686) * [scripts,egs] Tamil OCR with training data from yomdle and testing data from slam (kaldi-asr#2621) note: this data may not be publicly available at the moment. we'll work on that. * [egs] mini_librispeech: allow relative pathnames in download_and_untar.sh (kaldi-asr#2689) * [egs] Updating SITW recipe to account for changes to VoxCeleb1 (kaldi-asr#2690) * [src] Fix nnet1 proj-lstm bug where gradient clipping not used; thx:@cbtpkzm (kaldi-asr#2696) * [egs] Update aishell2 recipe to allow online decoding (no pitch for ivector) (kaldi-asr#2698) * [src] Make cublas and cusparse use per-thread streams. (kaldi-asr#2692) This will reduce synchronization overhead when we actually use multiple cuda devices in one process go down drastically, since we no longer synchronize on the legacy default stream. More details here: https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html * [src] improve handling of low-rank covariance in ivector-compute-lda (kaldi-asr#2693) * [egs] Changes to IAM handwriting-recognition recipe, including BPE encoding (kaldi-asr#2658) * [scripts] Make sure pitch is not included in i-vector feats, in online decoding preparation (kaldi-asr#2699) * [src] fix help message in post-to-smat (kaldi-asr#2703) * [scripts] Fix to steps/cleanup/debug_lexicon.sh (kaldi-asr#2704) * [egs] Cosmetic and file-mode fixes in HKUST recipe (kaldi-asr#2708) * [scripts] nnet1: remove the log-print of args in 'make_nnet_proto.py', thx:[email protected] (kaldi-asr#2706) * [egs] update README in AISHELL-2 (kaldi-asr#2710) * [src] Make constructor of CuDevice private (kaldi-asr#2711) * [egs] fix sorting issue in aishell v1 (kaldi-asr#2705) * [egs] Add soft links for CNN+TDNN scripts (kaldi-asr#2715) * [build] Add missing packages in extras/check_dependencies.sh (kaldi-asr#2719) * [egs] madcat arabic: clean scripts, tuning, use 6-gram LM (kaldi-asr#2718) * [egs] Update WSJ run.sh: comment out outdated things, add run_tdnn.sh. (kaldi-asr#2723) * [scripts,src] Fix potential issue in scripts; minor fixes. (kaldi-asr#2724) The use of split() in latin-1 encoding (which might be used for other ASCII-compatible encoded data like utf-8) is not right because character 160 (expressed here in decimal) is a NBSP in latin-8 encoding and is also in the range UTF-8 uses for encoding. The same goes for strip(). Thanks @ChunChiehChang for finding the issue. * [egs] add example script for RNNLM lattice rescoring for WSJ recipe (kaldi-asr#2727) * [egs] add rnnlm example on tedlium+lm1b; add rnnlm rescoring results (kaldi-asr#2248) * [scripts] Small fix to utils/data/convert_data_dir_to_whole.sh (RE backups) (kaldi-asr#2735) * [src] fix memory bug in kaldi::~LatticeFasterDecoderTpl(), (kaldi-asr#2737) - found it when running 'latgen-faster-mapped-parallel', - core-dumps from the line: decoder/lattice-faster-decoder.cc:52 -- the line is doing 'delete &(FST*)', i.e. deleting the pointer to FST, instead of deleting the FST itslef, -- bug was probably introduced by refactoring commit d0c68a6 from 2018-09-01, -- after the change the code runs fine... (the unit tests for src/decoder are missing) * [egs] Remove per-utt option from nnet3/align scripts (kaldi-asr#2717) * [egs] Small Librispeech example fix, thanks: Yasasa Tennakoon. (kaldi-asr#2738) * [egs] Aishell2 recipe: turn off jieba's new word discovery in word segmentation (kaldi-asr#2740) * [egs] Add missing file local/join_suffix.py in TEDLIUM s5_r3; thx:[email protected] (kaldi-asr#2741) * [egs,scripts] Add Tunisian Arabic (MSA) recipe; cosmetic fixes to pbs.pl (kaldi-asr#2725) * [scripts] Fix missing import in utils/langs/grammar/augment_words_txt.py (kaldi-asr#2742) * [scripts] Fix build_const_arpa_lm.sh w.r.t. where <s> appears inside words (kaldi-asr#2745) * [scripts] Slight improvements to decode_score_fusion.sh usability (kaldi-asr#2746) * [build] update configure to support cuda 10 (kaldi-asr#2747) * [scripts] Fix bug in utils/data/resample_data_dir.sh (kaldi-asr#2749) * [scripts] Fix bug in cleanup after steps/cleanup/clean_and_segment_data*.sh (kaldi-asr#2750) * [egs] several updates of the tunisian_msa recipe (kaldi-asr#2752) * [egs] Small fix to Tunisian MSA TDNN script (RE train_stage) (kaldi-asr#2757) * [src,scripts] Batched nnet3 computation (kaldi-asr#2726) This PR adds the underlying utilities for much faster nnet3 inference on GPU, and a command-line binary (and script support) for nnet3 decoding and posterior computation. TBD: a binary for x-vector computation. This PR also contains unrelated decoder speedups (skipping range checks for transition ids... this may cause segfaults when graphs are mismatched). * [build] Add python3 compatibility to install scripts (kaldi-asr#2748) * [scripts] tfrnnlm: Modify TensorFlow flag format for compatibility with recent versions (kaldi-asr#2760) * [egs] fix old style perl regex in egs/chime1/s5/local/chime1_prepare_data.sh (kaldi-asr#2762) * [scripts] Fix bug in steps/cleanup/debug_lexicon.sh (kaldi-asr#2763) * [egs] Add example for Yomdle Farsi OCR (kaldi-asr#2702) * [scripts] debug_lexicon.sh: Fix bug introduced in kaldi-asr#2763. (kaldi-asr#2764) * [egs] add missing online cmvn config in aishell2 (kaldi-asr#2767) * [egs] Add CNN-TDNN-F script for Librispeech (kaldi-asr#2744) * [src] Some minor cleanup/fixes regarding CUDA memory allocation; other small fixes. (kaldi-asr#2768) * [scripts] Update reverberate_data_dir.py so that it works with python3 (kaldi-asr#2771) * [egs] Chime5: fix total number of words for WER calculation (kaldi-asr#2772) * [egs] RNNLMs on Tedlium w/ Google 1Bword: Increase epochs, update results (kaldi-asr#2775) * [scripts,egs] Added phonetisaurus-based g2p scripts (kaldi-asr#2730) Phonetisaurus is much faster to train then sequitur. * [egs] madcat arabic: clean scripts, tuning, rescoring, text localization (kaldi-asr#2716) * [scripts] Enhancements & minor bugfix to segmentation postprocessing (kaldi-asr#2776) * [src] Update gmm-decode-simple to accept ConstFst (kaldi-asr#2787) * [scripts] Update documentation of train_raw_dnn.py (kaldi-asr#2785) * [src] nnet3: extend what descriptors can be parsed. (kaldi-asr#2780) * [src] Small fix to 'fstrand' (make sure args are parsed) (kaldi-asr#2777) * [src,scripts] Minor, mostly cosmetic updates (kaldi-asr#2788) * [src,scripts] Add script to compare alignment directories. (kaldi-asr#2765) * [scripts] Small fixes to script usage messages, etc. (kaldi-asr#2789) * [egs] Update ami_download.sh after changes on Edinburgh website. (kaldi-asr#2769) * [scripts] Update compare_alignments.sh to allow different lang dirs. (kaldi-asr#2792) * [scripts] Change make_rttm.py so output is in determinstic order (kaldi-asr#2794) * [egs] Fixes to yomdle_zh RE encoding direction, etc. (kaldi-asr#2791) * [src] Add support for context independent phones in gmm-init-biphone (for e2e) (kaldi-asr#2779) * [egs] Simplifying multi-condition version of AMI recipe (kaldi-asr#2800) * [build] Fix openblas build for aarch64 (kaldi-asr#2806) * [build] Make CUDA_ARCH configurable at configure-script level (kaldi-asr#2807) * [src] Print maximum memory stats in CUDA allocator (kaldi-asr#2799) * [src,scripts] Various minor code cleanups (kaldi-asr#2809) * [scripts] Fix handling of UTF-8 in filenames, in wer_per_spk_details.pl (kaldi-asr#2811) * [egs] Update AMI chain recipes (kaldi-asr#2817) * [egs] Improvements to multi_en tdnn-opgru/lstm recipes (kaldi-asr#2824) * [scripts] Fix initial prob of silence when lexicon has silprobs. Thx:@agurianov (kaldi-asr#2823) * [scripts,src] Fix to multitask nnet3 training (kaldi-asr#2818); cosmetic code change. (kaldi-asr#2827) * [scripts] Create shared versions of get_ctm_conf.sh, add get_ctm_conf_fast.sh (kaldi-asr#2828) * [src] Use cuda streams in matrix library (kaldi-asr#2821) * [egs] Add online-decoding recipe to aishell1 (kaldi-asr#2829) * [egs] Add DIHARD 2018 diarization recipe. (kaldi-asr#2822) * [egs] add nnet3 online result for aishell1 (kaldi-asr#2836) * [scripts] RNNLM scripts: don't die when features.txt is not present (kaldi-asr#2837) * [src] Optimize cuda allocator for multi-threaded case (kaldi-asr#2820) * [build] Add cub library for cuda projects (kaldi-asr#2819) not needed now but will be in future. * [src] Make Cuda allocator statistics visible to program (kaldi-asr#2835) * [src] Fix bug affecting scale in GeneralDropoutComponent (non-continuous case) (kaldi-asr#2815) * [build] FIX kaldi-asr#2842: properly check $use_cuda against false. (kaldi-asr#2843) * [doc] Add note about OOVs to data-prep. (kaldi-asr#2844) * [scripts] Allow segmentation with nnet3 chain models (kaldi-asr#2845) * [build] Remove -lcuda from cuda makefiles which breaks operation when no driver present (kaldi-asr#2851) * [scripts] Fix error in analyze_lats.sh for long lattices (replace awk with perl) (kaldi-asr#2854) * [egs] add rnnlm recipe for librispeech (kaldi-asr#2830) * [build] change configure version from 9 to 10 (kaldi-asr#2853) (kaldi-asr#2855) * [src] fixed compilation errors when built with --DOUBLE_PRECISION=1 (kaldi-asr#2856) * [build] Clarify instructions if cub is not found (kaldi-asr#2858) * [egs] Limit MFCC feature extraction job number in Dihard recipe (kaldi-asr#2865) * [egs] Added Bentham handwriting recognition recipe (kaldi-asr#2846) * [src] Share roots of different tones of phones aishell (kaldi-asr#2859) * [egs] Fix path to sequitur in commonvoice egs (kaldi-asr#2868) * [egs] Update reverb recipe (kaldi-asr#2753) * [scripts] Fix error while analyzing lattice (parsing bugs) (kaldi-asr#2873) * [src] Fix memory leak in OnlineCacheFeature; thanks @Worldexe (kaldi-asr#2872) * [egs] TIMIT: fix mac compatibility of sed command (kaldi-asr#2874) * [egs] mini_librispeech: fixing some bugs and limiting repeated downloads (kaldi-asr#2861) * [src,scripts,egs] Speedups to GRU-based networks (special components) (kaldi-asr#2712) * [src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876) * Revert "[src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876)" (kaldi-asr#2877) This reverts commit 84435ff. * Revert "Revert "[src] Fix infinite recursion with -DDOUBLE_PRECISION=1. Thx: @hwiorn (kaldi-asr#2875) (kaldi-asr#2876)" (kaldi-asr#2877)" (kaldi-asr#2878) This reverts commit b196b7f. * Revert "[src] Fix memory leak in OnlineCacheFeature; thanks @Worldexe" (kaldi-asr#2882) the fix was buggy. apologies. * [src] Remove unused code that caused Windows compile failure. Thx:@btiplitz (kaldi-asr#2881) * [src] Really fix memory leak in online decoding; thx:@Worldexe (kaldi-asr#2883) * [src] Fix Windows cuda build failure (use C++11 standard include) (kaldi-asr#2880) * [src] Add #include that caused build failure on Windows (kaldi-asr#2886) * [scripts] Fix max duration check in sad_to_segments.py (kaldi-asr#2889) * [scripts] Fix speech duration calculation in sad_to_segments.py (kaldi-asr#2891) * [src] Fix Windows build problem (timer.h) (kaldi-asr#2888) * [egs] add HUB4 spanish tdnn-f and cnn-tdnn script (kaldi-asr#2895) * [egs] Fix Aishell2 dict prepare bug; should not affect results (kaldi-asr#2890) * [egs] Self-contained example for KWS for mini_librispeech (kaldi-asr#2887) * [egs,scripts] Fix bugs in Dihard 2018 (kaldi-asr#2897) * [scripts] Check last character of files to match with newline (kaldi-asr#2898) * [egs] Update Librispeech RNNLM results; use correct training data (kaldi-asr#2900) * [scripts] RNNLM: old iteration model cleanup; save space (kaldi-asr#2885) * [scripts] Make prepare_lang.sh cleanup beforehand (prevents certain failures) (kaldi-asr#2906) * [scripts] Expose dim-range-node at xconfig level (kaldi-asr#2903) * [scripts] Fix bug related to multi-task in train_raw_rnn.py (kaldi-asr#2907) [scripts] Fix bug related to multi-task in train_raw_rnn.py. Thx:[email protected] * [scripts] Cosmetic fix/clarification to utils/prepare_lang.sh (kaldi-asr#2912) * [scripts,egs] Added a new lexicon learning (adaptation) recipe for tedlium, in accordance with the IS17 paper. (kaldi-asr#2774) * [egs] TDNN+LSTM example scripts, with RNNLM, for Librispeech (kaldi-asr#2857) * [src] cosmetic fix in nnet1 code (kaldi-asr#2921) * [src] Fix incorrect invocation of mutex in nnet-batch-compute code (kaldi-asr#2932) * [egs,minor] Fix typo in comment in voxceleb script (kaldi-asr#2926) * [src,egs] Mostly cosmetic changes; add some missing includes (kaldi-asr#2936) * [egs] Fix path of rescoring binaries used in tfrnnlm scripts (kaldi-asr#2941) * [src] Fix bug in nnet3-latgen-faster-batch for determinize=false (kaldi-asr#2945) thx: Maxim Korenevsky. * [egs] Add example for rimes handwriting database; Madcat arabic script cleanup (kaldi-asr#2935) * [egs] Add scripts for yomdle korean (kaldi-asr#2942) * [build] Refactor/cleanup build system, easier build on ubuntu 18.04. (kaldi-asr#2947) note: if this breaks someone's build we'll have to debug it then. * [scripts,egs] Changes for Python 2/3 compatibility (kaldi-asr#2925) * [egs] Add more modern DNN recipe for fisher_callhome_spanish (kaldi-asr#2951) * [scripts] switch from bc to perl to reduce dependencies (diarization scripts) (kaldi-asr#2956) * [scripts] Further fix for Python 2/3 compatibility (kaldi-asr#2957) * [egs] Remove no-longer-existing option in tedlium_r3 recipe (kaldi-asr#2959) * [build] Handle dependencies for .cu files in addition to .cc files (kaldi-asr#2944) * [src] remove duplicate test mode option from class GeneralDropoutComponent (kaldi-asr#2960) * [egs] Fix minor bugs in WSJ's flat-start/e2e recipe (kaldi-asr#2968) * [egs] Fix to BSD compatibility of TIMIT data prep (kaldi-asr#2966) * [scripts] Fix RNNLM training script problem (chunk_length was ignored) (kaldi-asr#2969) * [src] Fix bug in lattice-1best.cc RE removing insertion penalty (kaldi-asr#2970) * [src] Compute a separate avg (start, end) interval for each sausage word (kaldi-asr#2972) * [build] Move nvcc verbose flag to proper location (kaldi-asr#2962) * [egs] Fix mini_librispeech download_lm.sh crash; thx:[email protected] (kaldi-asr#2974) * [egs] minor fixes related to python2 vs python3 differences (kaldi-asr#2977) * [src] Small fix in test code, avoid spurious failure (kaldi-asr#2978) * [egs] Fix CSJ data-prep; minor path fix for USB version of data (kaldi-asr#2979) * [egs] Add paper ref to README.txt in reverb example (kaldi-asr#2982) * [egs] Minor fixes to sitw recipe (fix problem introdueced in kaldi-asr#2925) (kaldi-asr#2985) * [scripts] Fix bug introduced in kaldi-asr#2957, RE integer division (kaldi-asr#2986) * [egs] Update WSJ flat-start chain recipes to use TDNN-F not TDNN+LSTM (kaldi-asr#2988) * [scripts] Fix typo introduced in kaldi-asr#2925 (kaldi-asr#2989) * [build] Modify Makefile and travis script to fix Travis failures (kaldi-asr#2987) * [src] Simplification and efficiency improvement in ivector-plda-scoring-dense (kaldi-asr#2991) * [egs] Update madcat Arabic and Chinese egs, IAM (kaldi-asr#2964) * [src] Fix overflow bug in convolution code (kaldi-asr#2992) * [src] Fix nan issue in ctm times introduced in kaldi-asr#2972, thx: @vesis84 (kaldi-asr#2993) * [src] Fix 'sausage-time' issue which occurs with disabled MBR decoding. (kaldi-asr#2996) * [egs] Add scripts for yomdle Russian (OCR task) (kaldi-asr#2953) * [egs] Simplify lexicon preparation in Fisher callhome Spanish (kaldi-asr#2999) * [egs] Update GALE Arabic recipe (kaldi-asr#2934) * [egs] Remove outdated NN results from Gale Arabic recipe (kaldi-asr#3002) * [egs] Add RESULTS file for the tedlium s5_r3 (release 3) setup (kaldi-asr#3003) * [src] Fixes to grammar-fst code to handle LM-disambig symbols properly (kaldi-asr#3000) thanks: [email protected] * [src] Cosmetic change to mel computation (fix option string) (kaldi-asr#3011) * [src] Fix Visual Studio error due to alternate syntactic form of noreturn (kaldi-asr#3018) * [egs] Fix location of sequitur installation (kaldi-asr#3017) * [src] Fix w/ ifdef Visual Studio error from alternate syntactic form noreturn (kaldi-asr#3020) * [egs] Some fixes to getting data in heroico recipe (kaldi-asr#3021) * [egs] BABEL script fix: avoid make_L_align.sh generating invalid files (kaldi-asr#3022) * [src] Fix to older online decoding code in online/ (OnlineFeInput; was broken by commit cc2469e). (kaldi-asr#3025) * [script] Fix unset bash variable in make_mfcc.sh (kaldi-asr#3030) * [scripts] Extend limit_num_gpus.sh to support --num-gpus 0. (kaldi-asr#3027) * [scripts] fix bug in utils/add_lex_disambig.pl when sil-probs and pron-probs used (kaldi-asr#3033) bug would likely have resulted in determinization failure (only when not using word-position-dependent phones). * [egs] Fix path in Tedlium r3 rnnlm training script (kaldi-asr#3039) * [src] Thread-safety for GrammarFst (thx:[email protected]) (kaldi-asr#3040) * [scripts] Cosmetic fix to get_degs.sh (kaldi-asr#3045) * [egs] Small bug fixes for IAM and UW3 recipes (kaldi-asr#3048) * [scripts] Nnet3 segmentation: fix default params (kaldi-asr#3051) * [scripts] Allow perturb_data_dir_speed.sh to work with utt2lang (kaldi-asr#3055) * [scripts] Make beam in monophone training configurable (kaldi-asr#3057) * [scripts] Allow reverberate_data_dir.py to support unicode filenames (kaldi-asr#3060) * [scripts] Make some cleanup scripts work with python3 (kaldi-asr#3054) * [scripts] bug fix to nnet2->3 conversion, fixes kaldi-asr#886 (kaldi-asr#3071) * [src] Make copies occur in per-thread default stream (for GPUs) (kaldi-asr#3068) * [src] Add GPU version of MergeTaskOutput().. relates to batch decoding (kaldi-asr#3067) * [src] Add device options to enable tensor core math mode. (kaldi-asr#3066) * [src] Log nnet3 computation to VLOG, not std::cout (kaldi-asr#3072) * [src] Allow upsampling in compute-mfcc-feats, etc. (kaldi-asr#3014) * [src] fix problem with rand_r being undefined on Android (kaldi-asr#3037) * [egs] Update swbd1_map_words.pl, fix them_1's -> them's (kaldi-asr#3052) * [src] Add const overload OnlineNnet2FeaturePipeline::IvectorFeature (kaldi-asr#3073) * [src] Fix syntax error in egs/bn_music_speech/v1/local/make_musan.py (kaldi-asr#3074) * [src] Memory optimization for online feature extraction of long recordings (kaldi-asr#3038) * [build] fixed a bug in linux_configure_redhat_fat when use_cuda=no (kaldi-asr#3075) * [scripts] Add missing '. ./path.sh' to get_utt2num_frames.sh (kaldi-asr#3076) * [src,scripts,egs] Add count-based biphone tree tying for flat-start chain training (kaldi-asr#3007) * [scripts,egs] Remove sed from various scripts (avoid compatibility problems) (kaldi-asr#2981) * [src] Rework error logging for safety and cleanliness (kaldi-asr#3064) * [src] Change warp-synchronous to cub::BlockReduce (safer but slower) (kaldi-asr#3080) * [src] Fix && and || uses where & and | intended, and other weird errors (kaldi-asr#3087) * [build] Some fixes to Makefiles (kaldi-asr#3088) clang is unhappy with '-rdynamic' in compile-only step, and the switch is really unnecessary. Also, the default location for MKL 64-bit libraries is intel64/. The em64t/ was explained already obsolete by an Intel rep in 2010: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/285973 * [src] Fixed -Wreordered warnings in feat (kaldi-asr#3090) * [egs] Replace bc with perl -e (kaldi-asr#3093) * [scripts] Fix python3 compatibility issue in data-perturbing script (kaldi-asr#3084) * [doc] fix some typos in doc. (kaldi-asr#3097) * [build] Make sure expf() speed probe times sensibly (kaldi-asr#3089) * [scripts] Make sure merge_targets.py works in python3 (kaldi-asr#3094) * [src] ifdef to fix compilation failure on CUDA 8 and earlier (kaldi-asr#3103) * [doc] fix typos and broken links in doc. (kaldi-asr#3102) * [scripts] Fix frame_shift bug in egs/swbd/s5c/local/score_sclite_conf.sh (kaldi-asr#3104) * [src] Fix wrong assertion failure in nnet3-am-compute (kaldi-asr#3106) * [src] Cosmetic changes to natural-gradient code (kaldi-asr#3108) * [src,scripts] Python2 compatibility fixes and code cleanup for nnet1 (kaldi-asr#3113) * [doc] Small documentation fixes; update on Kaldi history (kaldi-asr#3031) * [src] Various mostly-cosmetic changes (copying from another branch) (kaldi-asr#3109) * [scripts] Simplify text encoding in RNNLM scripts (now only support utf-8) (kaldi-asr#3065) * [egs] Add "formosa_speech" recipe (Taiwanese Mandarin ASR) (kaldi-asr#2474) * [egs] python3 compatibility in csj example script (kaldi-asr#3123) * [egs] python3 compatibility in example scripts (kaldi-asr#3126) * [scripts] Bug-fix for removing deleted words (kaldi-asr#3116) The type of --max-deleted-words-kept-when-merging in segment_ctm_edits.py was a string, which prevented the mechanism from working altogether. * [scripts] Add fix regarding num-jobs for segment_long_utterances*.sh(kaldi-asr#3130) * [src] Enable allow_{upsample,downsample} with online features (kaldi-asr#3139) * [src] Fix bad assert in fstmakecontextsyms (kaldi-asr#3142) * [src] Fix to "Fixes to grammar-fst & LM-disambig symbols" (kaldi-asr#3000) (kaldi-asr#3143) * [build] Make sure PaUtils exported from portaudio (kaldi-asr#3144) * [src] cudamatrix: fixing a synchronization bug in 'normalize-per-row' (kaldi-asr#3145) was only apparent using large matrices * [src] Fix typo in comment (kaldi-asr#3147) * [src] Add binary that functions as a TCP server (kaldi-asr#2938) * [scripts] Fix bug in comment (kaldi-asr#3152) * [scripts] Fix bug in steps/segmentation/ali_to_targets.sh (kaldi-asr#3155) * [scripts] Avoid holding out more data than the requested num-utts (due to utt2uniq) (kaldi-asr#3141) * [src,scripts] Add support for two-pass agglomerative clustering. (kaldi-asr#3058) * [src] Disable unget warning in PeekToken (and other small fix) (kaldi-asr#3163) * [build] Add new nvidia tools to windows build (kaldi-asr#3159) * [doc] Fix documentation errors and add more docs for tcp-server decoder (kaldi-asr#3164)

slckl added 9 commits November 26, 2018 14:45

initial WIP version of rnnlm_cleanup.py

92e111c

working version of both keep_latest and keep_best

afe09ae

cleanup now only considers those iterations for which rnnlm_compute_p…

085b400

…rob has finished

rnnlm_cleanup.py: added copyright/license header, some comments and m…

430b91e

…inor cleanup

train_rnnlm.sh: initial cleanup script integration

c50364e

rnnlm_cleanup.py: get_compute_prob_info now skips files without compu…

d124106

…te_prob log instead of exiting on them

rnnlm_cleanup.py: iteration model files are now listed based on compu…

0536107

…te_prob log files

train_rnnlm.sh: fixed cleanup script invocation

1d692bd

get_best_model.py: now it only considers iterations that do still hav…

25291d5

…e model files present

rnnlm_cleanup.py: never touch files belonging to iteration 0

fdf1a12

get_best_model.py: model-less iterations will no longer trigger confu…

36a7481

…sing warnings as, given cleanup, it's normal for model files to be absent for most iteration

fixed "feat_embedding" files not considered during cleanup and get_be…

23ea8ad

…st_model.py

danpovey merged commit b50a4cf into kaldi-asr:master Dec 7, 2018

slckl deleted the rnnlm_iteration_cleanup branch December 8, 2018 16:57

chenzhehuai mentioned this pull request Jun 4, 2019

update (#32) chenzhehuai/kaldi#33

Closed

[RNNLM] old iteration model cleanup #2885

[RNNLM] old iteration model cleanup #2885

Uh oh!

Conversation

slckl commented Nov 28, 2018

Uh oh!

danpovey commented Nov 28, 2018

Uh oh!

keli78 commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaofengCheng commented Nov 29, 2018

Uh oh!

slckl commented Nov 29, 2018

Uh oh!

slckl commented Nov 29, 2018

Uh oh!

GaofengCheng commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slckl commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaofengCheng commented Dec 1, 2018

Uh oh!

keli78 commented Dec 2, 2018

Uh oh!

slckl commented Dec 2, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

GaofengCheng commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

GaofengCheng commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

GaofengCheng commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

GaofengCheng commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

GaofengCheng commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

slckl commented Dec 5, 2018

Uh oh!

keli78 commented Dec 6, 2018

Uh oh!

slckl commented Dec 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keli78 commented Nov 28, 2018 •

edited

Loading

GaofengCheng commented Nov 30, 2018 •

edited

Loading

slckl commented Nov 30, 2018 •

edited

Loading