-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Synchronization fix #1617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronization fix #1617
Conversation
|
Sure. I will do that. |
|
The printed GPU time is not actually meaningful because in that branch, the
timing stats are totally wrong. It's the total time elapsed that you
should look at.
…On Wed, May 10, 2017 at 8:37 PM, LvHang ***@***.***> wrote:
Sure. I will do that.
About PR #1615 <#1615>
Today, I run the kaldi master branch and karel's branch.I test them in
verbose=0(default) or verbose=2.
In karel's verbose=2 case, the "Total GPU time" is about 1s with 'Tesla
M40'. Compare with karel's verbose=0 case whose "Total GPU time" is about
80s with 'Tesla K10.G2'. It's incredible.
So I prepare to reserve a gpu qlogin -l gpu=1 and retest them in the same
gpu.
Hang
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1617 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuxCjLKXWV-t9B3t5NVQ8qKeiKJtwks5r4lg9gaJpZM4NXX3s>
.
|
|
Oh, I see. Assuming I reserve one gpu, which of the following two ways about testing is better you think? |
|
It's fine to run just one job, you don't have to run an entire training run.
You don't have to run it more than twice or so.
…On Wed, May 10, 2017 at 9:19 PM, LvHang ***@***.***> wrote:
Oh, I see.
But I still think to reserve the same gpu is necessary. Different type
gpus bring a huge time elapsed gap.
For example, when I test kaldi master in verbose=0:
If I use k80, the accounting time is 63s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/
tdnn1a_sp_verbose0/log/train.4.1.log)
If I use k10, the accounting time is 92s
(in /export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/chain/
tdnn1a_sp_verbose0/log/train.2.2.log)
Assuming I reserve one gpu, which of the following two ways about testing
is better you think?
(1) run all train.$iteration.$jobs and count the average time elapsed.
(2) just run train.0.1 5times, and count the average time elapsed.
Hang
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1617 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu0FUm_S2hbQOOjio_YVtCKzZW9U8ks5r4mIPgaJpZM4NXX3s>
.
|
|
Hi Dan, I checked your setup about verbose. Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla k20m) manually. (You can get the scripts and log files in /export/a11/hlyu/synchronization_test/) Bests, |
|
This is strange, I would expect a larger speedup. It 1-GPU training? What is it BLSTM topology? |
|
Try with the TDNN+LSTM script. For TDNN we expect less difference.
…On Thu, May 11, 2017 at 5:58 AM, Karel Vesely ***@***.***> wrote:
This is strange, I would expect a larger speedup. It 1-GPU training? What
is it BLSTM topology?
(For feedforward nets the speedup should be smaller, there are >100x less
kernels called...)
K.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1617 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVux1Nokml9k3okFXs0LSf5C5hIvLiks5r4tvVgaJpZM4NXX3s>
.
|
|
Yeah, the result is about TDNN script, which is similar with DNN in topology. |
|
Hi Dan, For nnet3/tdnn_lstm with mini_librispeech, it has 33 iterations. Meanwhile, I run a job (train.0.1) for the three versions code in 3 verbose(0,1,2) with the same gpu(Tesla K10.G2.8GB) manually. Bests, |
|
Thanks.
That's a really nice speedup!
I think I'll merge this change now. I think it's pretty harmless.
…On Sat, May 13, 2017 at 12:00 AM, LvHang ***@***.***> wrote:
Hi Dan,
This is a report about nnet3/tdnn_lstm_sp with mini_librispeech.
I tested karel's version(in #1615
<#1615> PR), kaldi master's
version and your version(in this PR) separately.
For karel's version, you can get all the files in
/export/a11/hlyu/cumatrix_accelerate_by_async/egs/mini_
librispeech/s5/exp/nnet3/tdnn_lstm1b_sp_verbose{0,1,2}
For kaldi master's version, you can get all the files in
/export/a11/hlyu/kaldi/egs/mini_librispeech/s5/exp/nnet3/
tdnn_lstm1b_sp_verbose{0,1,2}
For your version, you can get the files in
/export/a11/hlyu/synchronization_fix/egs/mini_
librispeech/s5/exp/nnet3/tdnn_lstm1b_sp.
For nnet3/tdnn_lstm with mini_librispeech, it has 33 iterations.
Meanwhile, I run a job (train.0.1) for the three versions code in 3
verbose(0,1,2) with the same gpu(Tesla K10.G2.8GB) manually. (You can get
the scripts and log files in /export/a11/hlyu/synchronization_test_tdnn_
lstm/)
dan_verbose0 time=135; dan_verbose1 time=155; dan_verbose2 time=154
karel_verbose0 time=148; karel_verbose1 time=149; karel_verbose2 time=148
master_verbose0 time=158; master_verbose1 time=157; master_verbose2
time=157
Bests,
Hang
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1617 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-zL7X-FK54VaKU6IS16bHUeSYrCks5r5SrFgaJpZM4NXX3s>
.
|
* [scripts] nnet1: minor update i-vector and mpe scripts (#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610) * [src,scripts,egs] Segmenting long erroneous recordings (#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1
| if (CuDevice::Instantiate().Enabled()) { | ||
| if (this->num_rows_ == 0) return; | ||
| Timer tim; | ||
| CuTimer tim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this timer is never used for logging its elapsed time...
| KALDI_ASSERT(num_rows == M.NumRows() && this->num_rows_ == M.NumCols()); | ||
| if (num_rows == 0) return; | ||
| Timer tim; | ||
| CuTimer tim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this timer is never used for logging its elapsed time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LvHang, can you please make a PR to fix this?
| if (CuDevice::Instantiate().Enabled()) { | ||
| if (dim_ == 0) return; | ||
| Timer tim; | ||
| CuTimer tim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this timer is never used for logging its elapsed time...
|
OK |
|
Hi, and here is my BLSTM speed test, the order is:
|
* [scripts] nnet1: minor update i-vector and mpe scripts (#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (#1610) * [src,scripts,egs] Segmenting long erroneous recordings (#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1 * [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts. * [src] fix a small bug: logging cuda elapsed time (#1623) * [src,scripts,egs] Add capability for multilingual training with nnet3; babel_multilang example. * [scripts] Fix some merge problems I noticed on github review. * [src] fix problem in test code. * fixed some issues to merge kaldi_52 into master. * removed add_lda parameter and its dependency.
* [scripts] nnet1: minor update i-vector and mpe scripts (kaldi-asr#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610) * [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1
* [scripts] nnet1: minor update i-vector and mpe scripts (kaldi-asr#1607) - mpe: backward compatibility is provided - ivec: the ivectors get stored in binary format (saves space) * [src] cosmetic change to const-arpa-lm-building code; remove too-general template. (kaldi-asr#1610) * [src,scripts,egs] Segmenting long erroneous recordings (kaldi-asr#1167) This is a solution for creating ASR training data from long recordings with transcription but without segmentation information. * [egs] thchs30 cmd and stage bug fix (kaldi-asr#1619) * [src] Change to GPU synchronization, for speed (disables GPU stats by default) (kaldi-asr#1617) * [src] Fix template instantiation bug causing failure if DOUBLEPRECISION=1 * [egs,scripts] Updates to BUT-specific cmd.sh settings (affects only Brno team); changes RE verbose level in nnet1 scripts. * [src] fix a small bug: logging cuda elapsed time (kaldi-asr#1623) * [src,scripts,egs] Add capability for multilingual training with nnet3; babel_multilang example. * [scripts] Fix some merge problems I noticed on github review. * [src] fix problem in test code. * fixed some issues to merge kaldi_52 into master. * removed add_lda parameter and its dependency.
This pull request is a different fix to the problem Karel was addressing in PR #1615.
It makes the accumulation of timing stats dependent on verbose level >= 1.
@LvHang, can you please test this PR instead of #1615, since this is the one that's a candidate to merge?
However if it would be easy and you're nearly done testing PR #1615, testing them both is fine.
Please make run the nnet training scripts in the mini-librispeech setup. However, you don't have to run that setup from the start, just the neural net training. When you're done, show me the location; I want to look at the logs.