[WIP] Added pre-caching memory on GPU by jtrmal · Pull Request #2236 · kaldi-asr/kaldi

jtrmal · 2018-02-22T20:58:37Z

A lot of the changes have to be removed before this committing, I guess, but for debugging and tuning this, I think the logs are very useful.

The changes lead to

Total GPU time: 45.4942s (may involve some double-counting)
-----
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 3533918816 bytes currently allocated (max: 3573514948); 62005496 currently in use by user (max: 3248682204); 10870/87098 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:136) Time taken in cudaMallocPitch=0.448115, in cudaMalloc=0.0951378, in cudaFree=0.296628, in this->MallocPitch()=0.929231
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-device.cc:418) Memory used (according to the device): 6798966784 bytes.
LOG (nnet3-chain-train[5.4]:main():nnet3-chain-train.cc:97) Wrote raw model to exp/chain_cleaned/tdnnx2_5.4_sp/2.3.raw

to

Total GPU time: 49.6038s (may involve some double-counting)
-----
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:154) Memory usage: 3533918816 bytes currently allocated (max: 3573540444); 62005496 currently in use by user (max: 3248682204); 12061/87538 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:163) Time taken in cudaMallocPitch=0.346041, in cudaMalloc=0.0912335, in cudaFree=0.225591, in
this->MallocPitch()=7.39817
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-device.cc:419) Memory used (according to the device): 4519362560 bytes.
LOG (nnet3-chain-train[5.4]:main():nnet3-chain-train.cc:97) Wrote raw model to exp/chain_cleaned/tdnnx2_5.4_sp/2.3.raw

jtrmal · 2018-02-22T21:02:11Z

BTW, the negative difference is the GPU time, but that might be just me, using different memory-compression-level, but I'm confident the new result is with memory-compression-level=0, i.e. the most memory demanding(?)

EDIT: the gpu time was depending on some external factor, not on the GPU type nor code version

jtrmal · 2018-02-23T19:50:51Z

I'm checking vs a clean kaldi upstream/master:

upstream/master is the "pristine" kaldi
precaching-early is precaching the matrices (as in PR) and releasing them before forward (before Run())
precaching-forward releases only the forward variables before Run(), the other vars after it
precaching-forward-memfactor uses precaching-forward and the memfactor 1.1/1.03 + the allocator change in the for-loop (1-value instead of 2-value)

OK*(X) means the iteration finished successfully, but the ReleaseSomeMemory() has been called X-times.

anyway, on overall, I think we are seriously bitten by the fact that the cuda allocator in GTX1080 uses new block size of 2MB instead of 1MB (it does sub-divide it for later allocations, but that apparently does not interact well with the kaldi caching allocator)

machine	change-set	mem-compression=0	mem-compression=1	mem-compression=2
c04	upstream/master	FAIL	FAIL	FAIL
c04	precaching-early	OK*(2)	OK*(4)	OK(1)
c04	precaching-forward	OK*(6)	OK*(3)	OK
c04	precaching-forward-memfactor	OK	OK	OK
b08	upstream/master	OK*(173)	OK*(1)	OK*(1)
b08	precaching-early	OK*(184)	OK*(1)	OK*(1)
b08	precaching-forward	OK*(172)	OK*(1)	OK*(1)
b08	precaching-forward-memfactor	OK*(1)	OK*(1)	OK*(1)

memory used (from cudaGetMemInfo())

machine	change-set	mem-compression=0	mem-compression=1	mem-compression=2
c04	upstream/master	FAIL	FAIL	FAIL
c04	precaching-early	6G	7G	9G
c04	precaching-forward	10G	10G	4G
c04	precaching-forward-memfactor	4.5G	5G	3.2G

EDIT:

The memory consumption on the K2.8GB is +- constant and equal what Kaldi thinks
Kaldi thinks it's using approx 3GB
Another test would be run the test with just alloc and dealloc forwarded to cuda function (it will impact run time, but could give us some bounds)
running times are approximately the same

danpovey · 2018-03-01T01:15:45Z

Note: I merged some of this code via PR #2244 (the code that pre-caches the stuff needed for the chain computation).

jtrmal · 2018-03-01T01:29:21Z

Great. I guess I just close this. I'm still looking at other the possibility how to make allocator less susceptible to fragmentation, but I'm more in the stage of reviewing literature than coding. Y.

…

On Wed, Feb 28, 2018, 20:15 Daniel Povey ***@***.***> wrote: Note: I merged some of this code via PR #2244 <#2244> (the code that pre-caches the stuff needed for the chain computation). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX0S2H0nq7ImpNuAf0KCcpoPMYISNks5tZ0vKgaJpZM4SP-e5> .

jtrmal · 2018-03-07T02:34:41Z

Iforgot to close this.

Added pre-caching memory

975a440

jtrmal changed the title ~~Added pre-caching memory~~ Added pre-caching memory on GPU Feb 22, 2018

danpovey changed the title ~~Added pre-caching memory on GPU~~ [WIP] Added pre-caching memory on GPU Feb 28, 2018

jtrmal closed this Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added pre-caching memory on GPU#2236

[WIP] Added pre-caching memory on GPU#2236
jtrmal wants to merge 1 commit intokaldi-asr:masterfrom
jtrmal:cuda_memory_precache2

jtrmal commented Feb 22, 2018

Uh oh!

jtrmal commented Feb 22, 2018 •

edited

Loading

Uh oh!

jtrmal commented Feb 23, 2018 •

edited

Loading

Uh oh!

danpovey commented Mar 1, 2018

Uh oh!

jtrmal commented Mar 1, 2018 via email

Uh oh!

jtrmal commented Mar 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jtrmal commented Feb 22, 2018

Uh oh!

jtrmal commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtrmal commented Feb 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey commented Mar 1, 2018

Uh oh!

jtrmal commented Mar 1, 2018 via email

Uh oh!

jtrmal commented Mar 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jtrmal commented Feb 22, 2018 •

edited

Loading

jtrmal commented Feb 23, 2018 •

edited

Loading