Synchronization fix #1617

danpovey · 2017-05-11T00:20:43Z

This pull request is a different fix to the problem Karel was addressing in PR #1615.
It makes the accumulation of timing stats dependent on verbose level >= 1.
@LvHang, can you please test this PR instead of #1615, since this is the one that's a candidate to merge?
However if it would be easy and you're nearly done testing PR #1615, testing them both is fine.

Please make run the nnet training scripts in the mini-librispeech setup. However, you don't have to run that setup from the start, just the neural net training. When you're done, show me the location; I want to look at the logs.

… verbose-level zero.

LvHang · 2017-05-11T00:37:15Z

Sure. I will do that.
About PR #1615
Today, I run the kaldi master branch and karel's branch.I test them in verbose=0(default) or verbose=2.
In karel's verbose=2 case, the "Total GPU time" is about 1s with 'Tesla M40'. Compare with karel's verbose=0 case whose "Total GPU time" is about 80s with 'Tesla K10.G2'. It's incredible.
So I prepare to reserve a gpu qlogin -l gpu=1 and retest them in the same gpu.
Hang

danpovey · 2017-05-11T00:39:24Z

The printed GPU time is not actually meaningful because in that branch, the timing stats are totally wrong. It's the total time elapsed that you should look at.

…

On Wed, May 10, 2017 at 8:37 PM, LvHang ***@***.***> wrote: Sure. I will do that. About PR #1615 <#1615> Today, I run the kaldi master branch and karel's branch.I test them in verbose=0(default) or verbose=2. In karel's verbose=2 case, the "Total GPU time" is about 1s with 'Tesla M40'. Compare with karel's verbose=0 case whose "Total GPU time" is about 80s with 'Tesla K10.G2'. It's incredible. So I prepare to reserve a gpu qlogin -l gpu=1 and retest them in the same gpu. Hang — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1617 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuxCjLKXWV-t9B3t5NVQ8qKeiKJtwks5r4lg9gaJpZM4NXX3s> .

LvHang · 2017-05-11T01:19:09Z

Oh, I see.
But I still think to reserve the same gpu is necessary. Different type gpus bring a huge time elapsed gap.
For example, when I test kaldi master in verbose=0:
If I use k80, the accounting time is 63s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose0/log/train.4.1.log)
If I use k10, the accounting time is 92s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose0/log/train.2.2.log)

Assuming I reserve one gpu, which of the following two ways about testing is better you think?
(1) run all train.$iteration.$jobs and count the average time elapsed.
(2) just run train.0.1 5times, and count the average time elapsed.
Hang

danpovey · 2017-05-11T01:20:19Z

It's fine to run just one job, you don't have to run an entire training run. You don't have to run it more than twice or so.

…

On Wed, May 10, 2017 at 9:19 PM, LvHang ***@***.***> wrote: Oh, I see. But I still think to reserve the same gpu is necessary. Different type gpus bring a huge time elapsed gap. For example, when I test kaldi master in verbose=0: If I use k80, the accounting time is 63s (in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/ tdnn1a_sp_verbose0/log/train.4.1.log) If I use k10, the accounting time is 92s (in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/ tdnn1a_sp_verbose0/log/train.2.2.log) Assuming I reserve one gpu, which of the following two ways about testing is better you think? (1) run all train.$iteration.$jobs and count the average time elapsed. (2) just run train.0.1 5times, and count the average time elapsed. Hang — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1617 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu0FUm_S2hbQOOjio_YVtCKzZW9U8ks5r4mIPgaJpZM4NXX3s> .

LvHang · 2017-05-11T08:17:28Z

Hi Dan,
I tested karel's version(in #1615 PR), kaldi master's version and your version(in this PR) separately.
For karel's version, you can get all the files in /export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose{0,1,2}
For kaldi master's version, you can get all the files in
/export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp_verbose{0,1,2}
For your version, you can get the files in
/export/a11/hlyu/synchronization_fix/egs/mini_librispeech/s5/exp/chain/tdnn1a_sp.

I checked your setup about verbose.verbose_opt = ("--verbose=1" if iter % 20 == 0 and iter > 0 else "")
As the mini_librispeech only has 6 iterations, so the verbose always equals 0.

Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla k20m) manually. (You can get the scripts and log files in /export/a11/hlyu/synchronization_test/)
dan_verbose0 time=74; dan_verbose1 time=74; dan_verbose2 time=79
karel_verbose0 time-84; karel_verbose1 time=77; karel_verbose2 time=82
master_verbose0 time=79; master_verbose1 time=79; master_verbose2 time=84

Bests,
Hang

KarelVesely84 · 2017-05-11T09:58:42Z

This is strange, I would expect a larger speedup. It 1-GPU training? What is it BLSTM topology?
(For feedforward nets the speedup should be smaller, there are >100x less kernels called...)
K.

danpovey · 2017-05-11T16:18:58Z

Try with the TDNN+LSTM script. For TDNN we expect less difference.

…

On Thu, May 11, 2017 at 5:58 AM, Karel Vesely ***@***.***> wrote: This is strange, I would expect a larger speedup. It 1-GPU training? What is it BLSTM topology? (For feedforward nets the speedup should be smaller, there are >100x less kernels called...) K. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1617 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVux1Nokml9k3okFXs0LSf5C5hIvLiks5r4tvVgaJpZM4NXX3s> .

LvHang · 2017-05-11T17:15:58Z

Yeah, the result is about TDNN script, which is similar with DNN in topology.
I will try TDNN+LSTM and BLSTM with mini_librispeech today.
Hang

LvHang · 2017-05-13T04:00:03Z

Hi Dan,
This is a report about nnet3/tdnn_lstm_sp with mini_librispeech.
I tested karel's version(in #1615 PR), kaldi master's version and your version(in this PR) separately.
For karel's version, you can get all the files in /export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2}
For kaldi master's version, you can get all the files in
/export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2}
For your version, you can get the files in
/export/a11/hlyu/synchronization_fix/egs/mini_librispeech/s5/exp/nnet3/tdnn_lstm1b_sp.

For nnet3/tdnn_lstm with mini_librispeech, it has 33 iterations.

Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla K10.G2.8GB) manually.
(You can get the scripts and log files in /export/a11/hlyu/synchronization_test_tdnn_lstm/)
dan_verbose0 time=135; dan_verbose1 time=155; dan_verbose2 time=154
karel_verbose0 time=148; karel_verbose1 time=149; karel_verbose2 time=148
master_verbose0 time=158; master_verbose1 time=157; master_verbose2 time=157

Bests,
Hang

danpovey · 2017-05-13T04:21:36Z

Thanks. That's a really nice speedup! I think I'll merge this change now. I think it's pretty harmless.

…

On Sat, May 13, 2017 at 12:00 AM, LvHang ***@***.***> wrote: Hi Dan, This is a report about nnet3/tdnn_lstm_sp with mini_librispeech. I tested karel's version(in #1615 <#1615> PR), kaldi master's version and your version(in this PR) separately. For karel's version, you can get all the files in /export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_ librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2} For kaldi master's version, you can get all the files in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/nnet3/ tdnn_lstm1b_sp_verbose{0,1,2} For your version, you can get the files in /export/a11/hlyu/synchronization_fix/egs/mini_ librispeech/s5/exp/nnet3/tdnn_lstm1b_sp. For nnet3/tdnn_lstm with mini_librispeech, it has 33 iterations. Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla K10.G2.8GB) manually. (You can get the scripts and log files in /export/a11/hlyu/synchronization_test_tdnn_ lstm/) dan_verbose0 time=135; dan_verbose1 time=155; dan_verbose2 time=154 karel_verbose0 time=148; karel_verbose1 time=149; karel_verbose2 time=148 master_verbose0 time=158; master_verbose1 time=157; master_verbose2 time=157 Bests, Hang — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1617 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-zL7X-FK54VaKU6IS16bHUeSYrCks5r5SrFgaJpZM4NXX3s> .

* [scripts] nnet1: minor update i-vector and mpe scripts (#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610) * [src,scripts,egs] Segmenting long erroneous recordings (#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1

KarelVesely84 · 2017-05-15T02:49:52Z

src/cudamatrix/cu-tp-matrix.cc

  if (CuDevice::Instantiate().Enabled()) {
    if (this->num_rows_ == 0) return;
-    Timer tim;
+    CuTimer tim;


It seems that this timer is never used for logging its elapsed time...

KarelVesely84 · 2017-05-15T02:50:09Z

src/cudamatrix/cu-tp-matrix.cc

    KALDI_ASSERT(num_rows == M.NumRows() && this->num_rows_ == M.NumCols());
    if (num_rows == 0) return;
-    Timer tim;
+    CuTimer tim;


It seems that this timer is never used for logging its elapsed time...

@LvHang, can you please make a PR to fix this?

KarelVesely84 · 2017-05-15T02:51:18Z

src/cudamatrix/cu-vector.cc

  if (CuDevice::Instantiate().Enabled()) {
    if (dim_ == 0) return;
-    Timer tim;
+    CuTimer tim;


It seems that this timer is never used for logging its elapsed time...

LvHang · 2017-05-15T02:53:16Z

OK

KarelVesely84 · 2017-05-15T05:11:59Z

Hi, and here is my BLSTM speed test, the order is:

Hainan's implementation,
My PR cudamatrix: accelerating the BLSTM training #1615,
Original code with 'device syncs',

1) [TRAINING, **2.16752** min, fps9571.99] exp/blstm4i_DAN_ASYNC/log/iter02.tr.log:LOG
2) [TRAINING, **2.81618** min, fps7367.23] exp/blstm4i_GEMM_SYNC_cudaGetLastError_FINAL/log/iter02.tr.log:LOG
3) [TRAINING, **5.05169** min, fps4107.03] exp/blstm4i_OLDBIN_SYNC/log/iter02.tr.log:LOG

* [scripts] nnet1: minor update i-vector and mpe scripts (#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610) * [src,scripts,egs] Segmenting long erroneous recordings (#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1 * [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts. * [src] fix a small bug: logging cuda elapsed time (#1623) * [src,scripts,egs] Add capability for multilingual training with nnet3; babel_multilang example. * [scripts] Fix some merge problems I noticed on github review. * [src] fix problem in test code. * fixed some issues to merge kaldi_52 into master. * removed add_lda parameter and its dependency.

… default) (kaldi-asr#1617)

* [scripts] nnet1: minor update i-vector and mpe scripts (kaldi-asr#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610) * [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1

* [scripts] nnet1: minor update i-vector and mpe scripts (kaldi-asr#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610) * [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1 * [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts. * [src] fix a small bug: logging cuda elapsed time (kaldi-asr#1623) * [src,scripts,egs] Add capability for multilingual training with nnet3; babel_multilang example. * [scripts] Fix some merge problems I noticed on github review. * [src] fix problem in test code. * fixed some issues to merge kaldi_52 into master. * removed add_lda parameter and its dependency.

danpovey added 4 commits May 10, 2017 19:25

[src] Change how GPU synchronization is done, disable timing stats at…

da570ab

… verbose-level zero.

[src] various bug-fixes to synchronization-fix patch

eb6a2d9

[scripts] Add verbose option every 20th iter of nnet training

18d6c81

[src] fix to comment

a0d37e8

[src] Fix compilation problem when not using GPU

a608997

danpovey merged commit b1e8601 into kaldi-asr:master May 13, 2017

danpovey deleted the synchronization_fix branch May 13, 2017 04:30

KarelVesely84 reviewed May 15, 2017

View reviewed changes

LvHang mentioned this pull request May 15, 2017

Timer fix #1623

Merged

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[src] Change to GPU synchronization, for speed (disables GPU stats by…

ccb2825

… default) (kaldi-asr#1617)

Synchronization fix #1617

Synchronization fix #1617

Uh oh!

Conversation

danpovey commented May 11, 2017

Uh oh!

LvHang commented May 11, 2017

Uh oh!

danpovey commented May 11, 2017 via email

Uh oh!

LvHang commented May 11, 2017

Uh oh!

danpovey commented May 11, 2017 via email

Uh oh!

LvHang commented May 11, 2017

Uh oh!

KarelVesely84 commented May 11, 2017

Uh oh!

danpovey commented May 11, 2017 via email

Uh oh!

LvHang commented May 11, 2017

Uh oh!

LvHang commented May 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey commented May 13, 2017 via email

Uh oh!

KarelVesely84 May 15, 2017

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 May 15, 2017

Choose a reason for hiding this comment

Uh oh!

danpovey May 15, 2017

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 May 15, 2017

Choose a reason for hiding this comment

Uh oh!

LvHang commented May 15, 2017

Uh oh!

KarelVesely84 commented May 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LvHang commented May 13, 2017 •

edited

Loading