fix CUDAGraph memory being counted twice#37426
fix CUDAGraph memory being counted twice#37426MatthewBonanni merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
There was a problem hiding this comment.
Code Review
This pull request addresses a double-counting issue with CUDA graph memory estimation. When VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS is enabled, peak_activation_memory includes an estimated value for CUDA graph memory. The change correctly subtracts this estimate before adding the actual captured CUDA graph memory (cuda_graph_memory_bytes), preventing the overestimation of non_kv_cache_memory. The logic is sound and the implementation correctly fixes the issue. I have no high or critical severity issues to report.
|
/cc @MatthewBonanni PTAL. |
MatthewBonanni
left a comment
There was a problem hiding this comment.
Thanks for catching this! I think the correct fix would actually be to just not add cudagraph_memory_estimate_applied to self.peak_activation_memory in the first place
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Peter Pan <peter.pan@daocloud.io>
|
Hi @panpan0000, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
|
Thank you @MatthewBonanni , all fixed. please review again. BTW, using AI to create a progress diagram below, so let other people can have a better view of the whole caculation |
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> Signed-off-by: Peter Pan <peter.pan@daocloud.io> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
Fix twice-counted CUDAGraph memory
When
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1, CUDAGraph memory appears to be counted twice in the KV cache recommendation path.In
determine_available_memory,self.peak_activation_memoryalready includes the estimated CUDAGraph memory:Later, in
compile_or_warm_up_model,non_kv_cache_memoryaddsself.peak_activation_memoryand also addscuda_graph_memory_bytes(actual captured CUDAGraph memory), which can introduce double counting!!!!This does not cause a real memory leak or runtime OOM , since this value is used for logging/recommendation
However, it overestimates non_kv_cache_memory , which leads to a smaller-than-necessary suggested KV cache upper bound
haha , when I'm fixing #37420, accidentally found this problem .
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.