Fix redundancy_buffer_memory not taken account in determine_available_memory()#37420
Fix redundancy_buffer_memory not taken account in determine_available_memory()#37420panpan0000 wants to merge 3 commits intovllm-project:mainfrom
redundancy_buffer_memory not taken account in determine_available_memory()#37420Conversation
…memory_bytes Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
There was a problem hiding this comment.
Code Review
This pull request correctly applies the redundancy_buffer_memory to prevent out-of-memory errors, which was previously not accounted for in memory calculations. The change introduces a constant for the buffer size and uses it consistently. I've added a comment suggesting that this buffer size should be made configurable to better accommodate different hardware and use cases, as a hardcoded value may not be optimal for all users.
Hi, @gemini-code-assist |
|
Thanks for the clarification, @panpan0000! I understand that this PR makes an existing |
|
/cc @MatthewBonanni PTAL. |
MatthewBonanni
left a comment
There was a problem hiding this comment.
I think it would be better to just set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 to get more accurate profiling, and, if the OOM still occurs, tune --gpu-memory-utilization. This change would universally decrease KV cache size by 150 MiB
Hi, @MatthewBonanni , Thank you to bring the cudaGraph memory into profiling stage. (yes, I saw the env variable you mentioned will be enabled by default in v0.19., now we have to set it explicitlly) But Our final goal is to "protect vLLM from OOM in runtime", by allocate a safer partition, either by code-auto-setup or manual-config (like -- OOM can come from multiple independent factors, not only missing CUDA-graph accounting, but also PyTorch caching fragmentation increase ...etc Example: fragmentation part will change in later runtime than early profiling stage.
So I think the old code of My PR is to fix the original dead-code (never be used). I knew a hard-code 150MB is not good choice, so dynamic estimation in profiling is a better way but much more complicated (still W.I.P, and maybe longer discussion in #37428) Last, Yes, Subtracting KV-cache for 150MB globally will reduce throughput , but
Again, fix this dead code(buffer) is another protection, together with your fix of counting cuda-graph. What do you think ? |
There was a problem hiding this comment.
I'm not sure that I would agree with your characterization of this as "dead code" -- it seems to me that this was just intended for the log message suggesting an appropriate setting for --kv-cache-memory, which is mutually exclusive with --gpu-memory-utilization (cc @BoyuanFeng, who wrote this). Regardless, I don't think we can make a change like this or #37428 without changing the default --gpu-memory-utilization and notifying users of an impending change (like #30515 did), because people have already tuned --gpu-memory-utilization to their setups. I'm still not sure whether this is the right approach because --gpu-memory-utilization is designed to cover this
|
This pull request has merge conflicts that must be resolved before it can be |
If considering the global impact to who already tuned memory allocation, yes, it make sense 150MB will introduce surprise to them. But I still think even with your cuda-graph counting fix and/or manual I will do more test to prove that. |
Purpose
Fix #37419
Test Plan
Reproduction
Any configuration running close to the memory limit can trigger OOM that the buffer was supposed to prevent. Example:
Under load, if PyTorch caching allocator fragmentation exceeds the implicit margin from
1.0 - gpu_memory_utilization, the process OOMs despite the code claiming to reserve 150 MiB.although previous guru (Lee Jie?) create a buffer value
redundancy_buffer_memory, but it's NOT used but just printing in debug message suggestion for--kv-cache-memory=value.available_kv_cache_memory_bytesmust still leave room for non-KV memory (weights, activations, CUDA graph, NCCL/driver) and allocator fragmentation (reserved - allocated) during runtime. So the suggestion should reserve the same safety margin; otherwise users can still hit OOM near the limit.I'm working on a more elegant solution instead of hardcode 150MB, stay tuned.
Test Result
You will see log
Available KV cache memory:become a bit smaller , and OOM will be relief .Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.