[WIP] Added pre-caching memory on GPU#2236
Conversation
|
BTW, the negative difference is the GPU time, but that might be just me, using different memory-compression-level, but I'm confident the new result is with memory-compression-level=0, i.e. the most memory demanding(?) EDIT: the gpu time was depending on some external factor, not on the GPU type nor code version |
|
I'm checking vs a clean kaldi upstream/master:
OK*(X) means the iteration finished successfully, but the ReleaseSomeMemory() has been called X-times. anyway, on overall, I think we are seriously bitten by the fact that the cuda allocator in GTX1080 uses new block size of 2MB instead of 1MB (it does sub-divide it for later allocations, but that apparently does not interact well with the kaldi caching allocator)
memory used (from cudaGetMemInfo())
EDIT:
|
|
Note: I merged some of this code via PR #2244 (the code that pre-caches the stuff needed for the chain computation). |
|
Great. I guess I just close this. I'm still looking at other the
possibility how to make allocator less susceptible to fragmentation, but
I'm more in the stage of reviewing literature than coding.
Y.
…On Wed, Feb 28, 2018, 20:15 Daniel Povey ***@***.***> wrote:
Note: I merged some of this code via PR #2244
<#2244> (the code that pre-caches
the stuff needed for the chain computation).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2236 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX0S2H0nq7ImpNuAf0KCcpoPMYISNks5tZ0vKgaJpZM4SP-e5>
.
|
|
Iforgot to close this. |
A lot of the changes have to be removed before this committing, I guess, but for debugging and tuning this, I think the logs are very useful.
The changes lead to
to