Skip to content

[WIP] Added pre-caching memory on GPU#2236

Closed
jtrmal wants to merge 1 commit intokaldi-asr:masterfrom
jtrmal:cuda_memory_precache2
Closed

[WIP] Added pre-caching memory on GPU#2236
jtrmal wants to merge 1 commit intokaldi-asr:masterfrom
jtrmal:cuda_memory_precache2

Conversation

@jtrmal
Copy link
Contributor

@jtrmal jtrmal commented Feb 22, 2018

A lot of the changes have to be removed before this committing, I guess, but for debugging and tuning this, I think the logs are very useful.

The changes lead to

Total GPU time: 45.4942s (may involve some double-counting)
-----
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:127) Memory usage: 3533918816 bytes currently allocated (max: 3573514948); 62005496 currently in use by user (max: 3248682204); 10870/87098 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:136) Time taken in cudaMallocPitch=0.448115, in cudaMalloc=0.0951378, in cudaFree=0.296628, in this->MallocPitch()=0.929231
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-device.cc:418) Memory used (according to the device): 6798966784 bytes.
LOG (nnet3-chain-train[5.4]:main():nnet3-chain-train.cc:97) Wrote raw model to exp/chain_cleaned/tdnnx2_5.4_sp/2.3.raw

to

Total GPU time: 49.6038s (may involve some double-counting)
-----
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:154) Memory usage: 3533918816 bytes currently allocated (max: 3573540444); 62005496 currently in use by user (max: 3248682204); 12061/87538 calls to Malloc* resulted in CUDA calls.
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-allocator.cc:163) Time taken in cudaMallocPitch=0.346041, in cudaMalloc=0.0912335, in cudaFree=0.225591, in
this->MallocPitch()=7.39817
LOG (nnet3-chain-train[5.4]:PrintMemoryUsage():cu-device.cc:419) Memory used (according to the device): 4519362560 bytes.
LOG (nnet3-chain-train[5.4]:main():nnet3-chain-train.cc:97) Wrote raw model to exp/chain_cleaned/tdnnx2_5.4_sp/2.3.raw

@jtrmal
Copy link
Contributor Author

jtrmal commented Feb 22, 2018

BTW, the negative difference is the GPU time, but that might be just me, using different memory-compression-level, but I'm confident the new result is with memory-compression-level=0, i.e. the most memory demanding(?)

EDIT: the gpu time was depending on some external factor, not on the GPU type nor code version

@jtrmal jtrmal changed the title Added pre-caching memory Added pre-caching memory on GPU Feb 22, 2018
@jtrmal
Copy link
Contributor Author

jtrmal commented Feb 23, 2018

I'm checking vs a clean kaldi upstream/master:

  • upstream/master is the "pristine" kaldi
  • precaching-early is precaching the matrices (as in PR) and releasing them before forward (before Run())
  • precaching-forward releases only the forward variables before Run(), the other vars after it
  • precaching-forward-memfactor uses precaching-forward and the memfactor 1.1/1.03 + the allocator change in the for-loop (1-value instead of 2-value)

OK*(X) means the iteration finished successfully, but the ReleaseSomeMemory() has been called X-times.

anyway, on overall, I think we are seriously bitten by the fact that the cuda allocator in GTX1080 uses new block size of 2MB instead of 1MB (it does sub-divide it for later allocations, but that apparently does not interact well with the kaldi caching allocator)

machine change-set mem-compression=0 mem-compression=1 mem-compression=2
c04 upstream/master FAIL FAIL FAIL
c04 precaching-early OK*(2) OK*(4) OK(1)
c04 precaching-forward OK*(6) OK*(3) OK
c04 precaching-forward-memfactor OK OK OK
b08 upstream/master OK*(173) OK*(1) OK*(1)
b08 precaching-early OK*(184) OK*(1) OK*(1)
b08 precaching-forward OK*(172) OK*(1) OK*(1)
b08 precaching-forward-memfactor OK*(1) OK*(1) OK*(1)

memory used (from cudaGetMemInfo())

machine change-set mem-compression=0 mem-compression=1 mem-compression=2
c04 upstream/master FAIL FAIL FAIL
c04 precaching-early 6G 7G 9G
c04 precaching-forward 10G 10G 4G
c04 precaching-forward-memfactor 4.5G 5G 3.2G

EDIT:

  • The memory consumption on the K2.8GB is +- constant and equal what Kaldi thinks
  • Kaldi thinks it's using approx 3GB
  • Another test would be run the test with just alloc and dealloc forwarded to cuda function (it will impact run time, but could give us some bounds)
  • running times are approximately the same

@danpovey danpovey changed the title Added pre-caching memory on GPU [WIP] Added pre-caching memory on GPU Feb 28, 2018
@danpovey
Copy link
Contributor

danpovey commented Mar 1, 2018

Note: I merged some of this code via PR #2244 (the code that pre-caches the stuff needed for the chain computation).

@jtrmal
Copy link
Contributor Author

jtrmal commented Mar 1, 2018 via email

@jtrmal
Copy link
Contributor Author

jtrmal commented Mar 7, 2018

Iforgot to close this.

@jtrmal jtrmal closed this Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants