[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling#36691
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling#36691MatthewBonanni merged 2 commits intovllm-project:mainfrom
Conversation
|
do we need a bugfix in 0.17 for this? |
|
@robertgshaw2-redhat no, it was introduced by #30515, which isn't in 0.17. Updated the PR description to clarify |
There was a problem hiding this comment.
Code Review
This pull request addresses an out-of-memory issue during CUDA graph memory profiling for DeepSeek V3.2. The fix correctly initializes the minimal KV cache by using the num_gpu_blocks_override mechanism, which is a more robust approach than the previous memory calculation that was incorrect for UniformTypeKVCacheSpecs. The change is sound, but I've suggested an improvement to ensure the configuration is always restored to its original state, even in the case of an exception, by using a try...finally block.
Note: Security Review did not run due to the size of the PR.
njhill
left a comment
There was a problem hiding this comment.
Thanks @MatthewBonanni , maybe just add a short comment?
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…ct#36691) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…ct#36691) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
The cudagraph memory profiler added in #30515 did not account for
UniformTypeKVCacheSpecsininit_minimal_kv_cache_for_profiling, so thepage_sizewas being improperly multiplied by thegroup_size, causing an allocation that was 61x too large. This PR fixes this and takes advantage of the existingnum_blocksoverride mechanism instead of spoofing the available memory, so it should be more robust.Test Plan
Test Result
main: OOM during startup
PR: starts up successfully
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.