Fix `redundancy_buffer_memory` not taken account in `determine_available_memory()` by panpan0000 · Pull Request #37420 · vllm-project/vllm

panpan0000 · 2026-03-18T11:44:46Z

Purpose

Test Plan

Reproduction

Any configuration running close to the memory limit can trigger OOM that the buffer was supposed to prevent. Example:

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

Under load, if PyTorch caching allocator fragmentation exceeds the implicit margin from 1.0 - gpu_memory_utilization, the process OOMs despite the code claiming to reserve 150 MiB.

although previous guru (Lee Jie?) create a buffer value redundancy_buffer_memory, but it's NOT used but just printing in debug message suggestion for --kv-cache-memory= value.

available_kv_cache_memory_bytes must still leave room for non-KV memory (weights, activations, CUDA graph, NCCL/driver) and allocator fragmentation (reserved - allocated) during runtime. So the suggestion should reserve the same safety margin; otherwise users can still hit OOM near the limit.

I'm working on a more elegant solution instead of hardcode 150MB, stay tuned.

Test Result

You will see log Available KV cache memory: become a bit smaller , and OOM will be relief .

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…memory_bytes Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

gemini-code-assist

Code Review

This pull request correctly applies the redundancy_buffer_memory to prevent out-of-memory errors, which was previously not accounted for in memory calculations. The change introduces a constant for the buffer size and uses it consistently. I've added a comment suggesting that this buffer size should be made configurable to better accommodate different hardware and use cases, as a hardcoded value may not be optimal for all users.

vllm/v1/worker/gpu_worker.py

panpan0000 · 2026-03-18T12:45:36Z

Code Review

This pull request correctly applies the redundancy_buffer_memory to prevent out-of-memory errors, which was previously not accounted for in memory calculations. The change introduces a constant for the buffer size and uses it consistently. I've added a comment suggesting that this buffer size should be made configurable to better accommodate different hardware and use cases, as a hardcoded value may not be optimal for all users.

Hi, @gemini-code-assist
This is a previous hardcode, I just make it from dead code to taking effect .

gemini-code-assist · 2026-03-18T12:46:32Z

Thanks for the clarification, @panpan0000! I understand that this PR makes an existing redundancy_buffer_memory value effective, rather than introducing a new hardcoded one. It's great to hear that you're already working on a more elegant and configurable solution. Making this buffer size configurable will indeed be very beneficial for users with diverse hardware configurations and use cases, allowing for optimal memory utilization. I look forward to seeing that improvement!

chaunceyjiang · 2026-03-19T06:13:16Z

/cc @MatthewBonanni PTAL.

MatthewBonanni

I think it would be better to just set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 to get more accurate profiling, and, if the OOM still occurs, tune --gpu-memory-utilization. This change would universally decrease KV cache size by 150 MiB

panpan0000 · 2026-03-20T04:06:59Z

I think it would be better to just set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 to get more accurate profiling, and, if the OOM still occurs, tune --gpu-memory-utilization. This change would universally decrease KV cache size by 150 MiB

Hi, @MatthewBonanni , Thank you to bring the cudaGraph memory into profiling stage. (yes, I saw the env variable you mentioned will be enabled by default in v0.19., now we have to set it explicitlly)

But

Our final goal is to "protect vLLM from OOM in runtime", by allocate a safer partition, either by code-auto-setup or manual-config (like --gpu-memory-utilization..etc ), and eventually --> more "auto" than "manual" (As you know, in production deployment, OOM by stress in runtime , causing service SLO downgrade ).

--

OOM can come from multiple independent factors, not only missing CUDA-graph accounting, but also PyTorch caching fragmentation increase ...etc

Example: fragmentation part will change in later runtime than early profiling stage.
like in my test (more detail state in #37428 )

Workload	reserved-allocated(fragment) size
Profiling	1219 MiB
do 3~4 times of Profiling	no chnage
batch= 10	767 MiB
batch= 20	1825 MiB

So I think the old code of redundancy_buffer_memory(150MB) is a last guardrail / gate keeper.

My PR is to fix the original dead-code (never be used).

I knew a hard-code 150MB is not good choice, so dynamic estimation in profiling is a better way but much more complicated (still W.I.P, and maybe longer discussion in #37428)

Last, Yes, Subtracting KV-cache for 150MB globally will reduce throughput , but

the code has been there for some reason, right ? just wake it up.
most people choose stability SLO by default (safer, not OOM) than highest-throughput(may be just to do benchmark), but for those who are benchmarking best throughput, they can choose manual tune --kv-cache-memory or --gpu-memory-utilization (like I said before , "default" for production, "manual tune" for hacker :-) )

Again, fix this dead code(buffer) is another protection, together with your fix of counting cuda-graph.

What do you think ?

MatthewBonanni

I'm not sure that I would agree with your characterization of this as "dead code" -- it seems to me that this was just intended for the log message suggesting an appropriate setting for --kv-cache-memory, which is mutually exclusive with --gpu-memory-utilization (cc @BoyuanFeng, who wrote this). Regardless, I don't think we can make a change like this or #37428 without changing the default --gpu-memory-utilization and notifying users of an impending change (like #30515 did), because people have already tuned --gpu-memory-utilization to their setups. I'm still not sure whether this is the right approach because --gpu-memory-utilization is designed to cover this

mergify · 2026-03-20T17:42:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @panpan0000.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

panpan0000 · 2026-03-21T06:10:09Z

I'm not sure that I would agree with your characterization of this as "dead code" -- it seems to me that this was just intended for the log message suggesting an appropriate setting for --kv-cache-memory, which is mutually exclusive with --gpu-memory-utilization (cc @BoyuanFeng, who wrote this). Regardless, I don't think we can make a change like this or #37428 without changing the default --gpu-memory-utilization and notifying users of an impending change (like #30515 did), because people have already tuned --gpu-memory-utilization to their setups. I'm still not sure whether this is the right approach because --gpu-memory-utilization is designed to cover this

If considering the global impact to who already tuned memory allocation, yes, it make sense 150MB will introduce surprise to them.

But I still think even with your cuda-graph counting fix and/or manual --gpu-memory-utilization, there still will be room for OOM risk at run-time ( aka, like set --gpu-memory-utilization to 0.93 ,and it goes on well on bootup and pass small load, but OOM in high load).

I will do more test to prove that.

Fix redundancy_buffer_memory not taken account in available_kv_cache_…

5be3c38

…memory_bytes Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 requested a review from njhill as a code owner March 18, 2026 11:44

mergify bot added the v1 label Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

vllm/v1/worker/gpu_worker.py Show resolved Hide resolved

Merge branch 'main' into fix_redundancy_buffer_memory

b752351

This was referenced Mar 18, 2026

fix CUDAGraph memory being counted twice #37426

Merged

[W.I.P] fragmentation_buffer in profiling #37428

Open

Merge branch 'main' into fix_redundancy_buffer_memory

317ff2d

chaunceyjiang requested a review from MatthewBonanni March 19, 2026 06:13

MatthewBonanni reviewed Mar 19, 2026

View reviewed changes

MatthewBonanni requested changes Mar 20, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `redundancy_buffer_memory` not taken account in `determine_available_memory()`#37420

Fix `redundancy_buffer_memory` not taken account in `determine_available_memory()`#37420
panpan0000 wants to merge 3 commits intovllm-project:mainfrom
panpan0000:fix_redundancy_buffer_memory

panpan0000 commented Mar 18, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

panpan0000 commented Mar 18, 2026

Code Review

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

chaunceyjiang commented Mar 19, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

panpan0000 commented Mar 20, 2026 •

edited

Loading

Uh oh!

MatthewBonanni left a comment •

edited

Loading

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

panpan0000 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

panpan0000 commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Reproduction

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

panpan0000 commented Mar 18, 2026

Code Review

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

chaunceyjiang commented Mar 19, 2026

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

panpan0000 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatthewBonanni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

panpan0000 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

panpan0000 commented Mar 18, 2026 •

edited by github-actions bot

Loading

panpan0000 commented Mar 20, 2026 •

edited

Loading

MatthewBonanni left a comment •

edited

Loading