Skip to content

Fix redundancy_buffer_memory not taken account in determine_available_memory()#37420

Open
panpan0000 wants to merge 3 commits intovllm-project:mainfrom
panpan0000:fix_redundancy_buffer_memory
Open

Fix redundancy_buffer_memory not taken account in determine_available_memory()#37420
panpan0000 wants to merge 3 commits intovllm-project:mainfrom
panpan0000:fix_redundancy_buffer_memory

Conversation

@panpan0000
Copy link
Contributor

@panpan0000 panpan0000 commented Mar 18, 2026

Purpose

Fix #37419

Test Plan

Reproduction

Any configuration running close to the memory limit can trigger OOM that the buffer was supposed to prevent. Example:

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

Under load, if PyTorch caching allocator fragmentation exceeds the implicit margin from 1.0 - gpu_memory_utilization, the process OOMs despite the code claiming to reserve 150 MiB.

although previous guru (Lee Jie?) create a buffer value redundancy_buffer_memory, but it's NOT used but just printing in debug message suggestion for --kv-cache-memory= value.

available_kv_cache_memory_bytes must still leave room for non-KV memory (weights, activations, CUDA graph, NCCL/driver) and allocator fragmentation (reserved - allocated) during runtime. So the suggestion should reserve the same safety margin; otherwise users can still hit OOM near the limit.

I'm working on a more elegant solution instead of hardcode 150MB, stay tuned.

Test Result

You will see log Available KV cache memory: become a bit smaller , and OOM will be relief .


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…memory_bytes

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@panpan0000 panpan0000 requested a review from njhill as a code owner March 18, 2026 11:44
@mergify mergify bot added the v1 label Mar 18, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly applies the redundancy_buffer_memory to prevent out-of-memory errors, which was previously not accounted for in memory calculations. The change introduces a constant for the buffer size and uses it consistently. I've added a comment suggesting that this buffer size should be made configurable to better accommodate different hardware and use cases, as a hardcoded value may not be optimal for all users.

@panpan0000
Copy link
Contributor Author

Code Review

This pull request correctly applies the redundancy_buffer_memory to prevent out-of-memory errors, which was previously not accounted for in memory calculations. The change introduces a constant for the buffer size and uses it consistently. I've added a comment suggesting that this buffer size should be made configurable to better accommodate different hardware and use cases, as a hardcoded value may not be optimal for all users.

Hi, @gemini-code-assist
This is a previous hardcode, I just make it from dead code to taking effect .

@gemini-code-assist
Copy link
Contributor

Thanks for the clarification, @panpan0000! I understand that this PR makes an existing redundancy_buffer_memory value effective, rather than introducing a new hardcoded one. It's great to hear that you're already working on a more elegant and configurable solution. Making this buffer size configurable will indeed be very beneficial for users with diverse hardware configurations and use cases, allowing for optimal memory utilization. I look forward to seeing that improvement!

@chaunceyjiang
Copy link
Collaborator

/cc @MatthewBonanni PTAL.

Copy link
Collaborator

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 to get more accurate profiling, and, if the OOM still occurs, tune --gpu-memory-utilization. This change would universally decrease KV cache size by 150 MiB

@panpan0000
Copy link
Contributor Author

panpan0000 commented Mar 20, 2026

I think it would be better to just set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 to get more accurate profiling, and, if the OOM still occurs, tune --gpu-memory-utilization. This change would universally decrease KV cache size by 150 MiB

Hi, @MatthewBonanni , Thank you to bring the cudaGraph memory into profiling stage. (yes, I saw the env variable you mentioned will be enabled by default in v0.19., now we have to set it explicitlly)

But

Our final goal is to "protect vLLM from OOM in runtime", by allocate a safer partition, either by code-auto-setup or manual-config (like --gpu-memory-utilization..etc ), and eventually --> more "auto" than "manual" (As you know, in production deployment, OOM by stress in runtime , causing service SLO downgrade ).

--

OOM can come from multiple independent factors, not only missing CUDA-graph accounting, but also PyTorch caching fragmentation increase ...etc

Example: fragmentation part will change in later runtime than early profiling stage.
like in my test (more detail state in #37428 )

Workload reserved-allocated(fragment) size
Profiling 1219 MiB
do 3~4 times of Profiling no chnage
batch= 10 767 MiB
batch= 20 1825 MiB

So I think the old code of redundancy_buffer_memory(150MB) is a last guardrail / gate keeper.

My PR is to fix the original dead-code (never be used).

I knew a hard-code 150MB is not good choice, so dynamic estimation in profiling is a better way but much more complicated (still W.I.P, and maybe longer discussion in #37428)

Last, Yes, Subtracting KV-cache for 150MB globally will reduce throughput , but

  • the code has been there for some reason, right ? just wake it up.
  • most people choose stability SLO by default (safer, not OOM) than highest-throughput(may be just to do benchmark), but for those who are benchmarking best throughput, they can choose manual tune --kv-cache-memory or --gpu-memory-utilization (like I said before , "default" for production, "manual tune" for hacker :-) )

Again, fix this dead code(buffer) is another protection, together with your fix of counting cuda-graph.

What do you think ?

Copy link
Collaborator

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I would agree with your characterization of this as "dead code" -- it seems to me that this was just intended for the log message suggesting an appropriate setting for --kv-cache-memory, which is mutually exclusive with --gpu-memory-utilization (cc @BoyuanFeng, who wrote this). Regardless, I don't think we can make a change like this or #37428 without changing the default --gpu-memory-utilization and notifying users of an impending change (like #30515 did), because people have already tuned --gpu-memory-utilization to their setups. I'm still not sure whether this is the right approach because --gpu-memory-utilization is designed to cover this

@mergify
Copy link

mergify bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panpan0000.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 20, 2026
@panpan0000
Copy link
Contributor Author

I'm not sure that I would agree with your characterization of this as "dead code" -- it seems to me that this was just intended for the log message suggesting an appropriate setting for --kv-cache-memory, which is mutually exclusive with --gpu-memory-utilization (cc @BoyuanFeng, who wrote this). Regardless, I don't think we can make a change like this or #37428 without changing the default --gpu-memory-utilization and notifying users of an impending change (like #30515 did), because people have already tuned --gpu-memory-utilization to their setups. I'm still not sure whether this is the right approach because --gpu-memory-utilization is designed to cover this

If considering the global impact to who already tuned memory allocation, yes, it make sense 150MB will introduce surprise to them.

But I still think even with your cuda-graph counting fix and/or manual --gpu-memory-utilization, there still will be room for OOM risk at run-time ( aka, like set --gpu-memory-utilization to 0.93 ,and it goes on well on bootup and pass small load, but OOM in high load).

I will do more test to prove that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: redundancy_buffer_memory is Never really used

3 participants