Skip to content

Conversation

@danpovey
Copy link
Contributor

This pull request is a different fix to the problem Karel was addressing in PR #1615.
It makes the accumulation of timing stats dependent on verbose level >= 1.
@LvHang, can you please test this PR instead of #1615, since this is the one that's a candidate to merge?
However if it would be easy and you're nearly done testing PR #1615, testing them both is fine.

Please make run the nnet training scripts in the mini-librispeech setup. However, you don't have to run that setup from the start, just the neural net training. When you're done, show me the location; I want to look at the logs.

@LvHang
Copy link
Contributor

LvHang commented May 11, 2017

Sure. I will do that.
About PR #1615
Today, I run the kaldi master branch and karel's branch.I test them in verbose=0(default) or verbose=2.
In karel's verbose=2 case, the "Total GPU time" is about 1s with 'Tesla M40'. Compare with karel's verbose=0 case whose "Total GPU time" is about 80s with 'Tesla K10.G2'. It's incredible.
So I prepare to reserve a gpu qlogin -l gpu=1 and retest them in the same gpu.
Hang

@danpovey
Copy link
Contributor Author

danpovey commented May 11, 2017 via email

@LvHang
Copy link
Contributor

LvHang commented May 11, 2017

Oh, I see.
But I still think to reserve the same gpu is necessary. Different type gpus bring a huge time elapsed gap.
For example, when I test kaldi master in verbose=0:
If I use k80, the accounting time is 63s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose0/log/train.4.1.log)
If I use k10, the accounting time is 92s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose0/log/train.2.2.log)

Assuming I reserve one gpu, which of the following two ways about testing is better you think?
(1) run all train.$iteration.$jobs and count the average time elapsed.
(2) just run train.0.1 5times, and count the average time elapsed.
Hang

@danpovey
Copy link
Contributor Author

danpovey commented May 11, 2017 via email

@LvHang
Copy link
Contributor

LvHang commented May 11, 2017

Hi Dan,
I tested karel's version(in #1615 PR), kaldi master's version and your version(in this PR) separately.
For karel's version, you can get all the files in /export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose{0,1,2}
For kaldi master's version, you can get all the files in
/export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose{0,1,2}
For your version, you can get the files in
/export/a11/hlyu/synchronization_fix/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp.

I checked your setup about verbose.verbose_opt = ("--verbose=1" if iter % 20 == 0 and iter > 0 else "")
As the mini_librispeech only has 6 iterations, so the verbose always equals 0.

Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla k20m) manually. (You can get the scripts and log files in /export/a11/hlyu/synchronization_test/)
dan_verbose0 time=74; dan_verbose1 time=74; dan_verbose2 time=79
karel_verbose0 time-84; karel_verbose1 time=77; karel_verbose2 time=82
master_verbose0 time=79; master_verbose1 time=79; master_verbose2 time=84

Bests,
Hang

@KarelVesely84
Copy link
Contributor

This is strange, I would expect a larger speedup. It 1-GPU training? What is it BLSTM topology?
(For feedforward nets the speedup should be smaller, there are >100x less kernels called...)
K.

@danpovey
Copy link
Contributor Author

danpovey commented May 11, 2017 via email

@LvHang
Copy link
Contributor

LvHang commented May 11, 2017

Yeah, the result is about TDNN script, which is similar with DNN in topology.
I will try TDNN+LSTM and BLSTM with mini_librispeech today.
Hang

@LvHang
Copy link
Contributor

LvHang commented May 13, 2017

Hi Dan,
This is a report about nnet3/tdnn_lstm_sp with mini_librispeech.
I tested karel's version(in #1615 PR), kaldi master's version and your version(in this PR) separately.
For karel's version, you can get all the files in /export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2}
For kaldi master's version, you can get all the files in
/export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2}
For your version, you can get the files in
/export/a11/hlyu/synchronization_fix/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp.

For nnet3/tdnn_lstm with mini_librispeech, it has 33 iterations.

Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla K10.G2.8GB) manually.
(You can get the scripts and log files in /export/a11/hlyu/synchronization_test_tdnn_lstm/)
dan_verbose0 time=135; dan_verbose1 time=155; dan_verbose2 time=154
karel_verbose0 time=148; karel_verbose1 time=149; karel_verbose2 time=148
master_verbose0 time=158; master_verbose1 time=157; master_verbose2 time=157

Bests,
Hang

@danpovey
Copy link
Contributor Author

danpovey commented May 13, 2017 via email

@danpovey danpovey merged commit b1e8601 into kaldi-asr:master May 13, 2017
@danpovey danpovey deleted the synchronization_fix branch May 13, 2017 04:30
danpovey added a commit that referenced this pull request May 13, 2017
* [scripts] nnet1: minor update  i-vector and mpe scripts (#1607)

- mpe: backward compatibility is provided
- ivec: the ivectors get stored in binary format (saves space)

* [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610)

* [src,scripts,egs] Segmenting long erroneous recordings (#1167)

This is a solution for creating ASR training data from long recordings with transcription but without segmentation information.

* [egs] thchs30 cmd and stage bug fix (#1619)

* [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617)

* [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1
if (CuDevice::Instantiate().Enabled()) {
if (this->num_rows_ == 0) return;
Timer tim;
CuTimer tim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this timer is never used for logging its elapsed time...

KALDI_ASSERT(num_rows == M.NumRows() && this->num_rows_ == M.NumCols());
if (num_rows == 0) return;
Timer tim;
CuTimer tim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this timer is never used for logging its elapsed time...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LvHang, can you please make a PR to fix this?

if (CuDevice::Instantiate().Enabled()) {
if (dim_ == 0) return;
Timer tim;
CuTimer tim;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this timer is never used for logging its elapsed time...

@LvHang
Copy link
Contributor

LvHang commented May 15, 2017

OK

@LvHang LvHang mentioned this pull request May 15, 2017
@KarelVesely84
Copy link
Contributor

Hi, and here is my BLSTM speed test, the order is:

  1. Hainan's implementation,
  2. My PR cudamatrix: accelerating the BLSTM training #1615,
  3. Original code with 'device syncs',
1) [TRAINING, **2.16752** min, fps9571.99] exp/blstm4i_DAN_ASYNC/log/iter02.tr.log:LOG
2) [TRAINING, **2.81618** min, fps7367.23] exp/blstm4i_GEMM_SYNC_cudaGetLastError_FINAL/log/iter02.tr.log:LOG
3) [TRAINING, **5.05169** min, fps4107.03] exp/blstm4i_OLDBIN_SYNC/log/iter02.tr.log:LOG

danpovey added a commit that referenced this pull request May 24, 2017
* [scripts] nnet1: minor update  i-vector and mpe scripts (#1607)

- mpe: backward compatibility is provided
- ivec: the ivectors get stored in binary format (saves space)

* [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610)

* [src,scripts,egs] Segmenting long erroneous recordings (#1167)

This is a solution for creating ASR training data from long recordings with transcription but without segmentation information.

* [egs] thchs30 cmd and stage bug fix (#1619)

* [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617)

* [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1

* [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts.

* [src] fix a small bug: logging cuda elapsed time (#1623)

* [src,scripts,egs]  Add capability for multilingual training with nnet3; babel_multilang example.

* [scripts] Fix some merge problems I noticed on github review.

* [src] fix problem in test code.

* fixed some issues to merge kaldi_52 into master.

* removed add_lda parameter and its dependency.
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
* [scripts] nnet1: minor update  i-vector and mpe scripts (kaldi-asr#1607)

- mpe: backward compatibility is provided
- ivec: the ivectors get stored in binary format (saves space)

* [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610)

* [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167)

This is a solution for creating ASR training data from long recordings with transcription but without segmentation information.

* [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619)

* [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617)

* [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
* [scripts] nnet1: minor update  i-vector and mpe scripts (kaldi-asr#1607)

- mpe: backward compatibility is provided
- ivec: the ivectors get stored in binary format (saves space)

* [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610)

* [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167)

This is a solution for creating ASR training data from long recordings with transcription but without segmentation information.

* [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619)

* [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617)

* [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1

* [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts.

* [src] fix a small bug: logging cuda elapsed time (kaldi-asr#1623)

* [src,scripts,egs]  Add capability for multilingual training with nnet3; babel_multilang example.

* [scripts] Fix some merge problems I noticed on github review.

* [src] fix problem in test code.

* fixed some issues to merge kaldi_52 into master.

* removed add_lda parameter and its dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants