[W.I.P] fragmentation_buffer in profiling#37428
[W.I.P] fragmentation_buffer in profiling#37428panpan0000 wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic fragmentation buffer to prevent out-of-memory errors during model execution. The buffer is calculated based on memory fragmentation observed during profiling, providing a more robust safety margin than the previous hardcoded value. The changes correctly apply this buffer to both automatic KV cache allocation and the suggested value for manual configuration. My main feedback is to refactor the duplicated magic number for the minimum buffer size into a constant to improve maintainability and ensure consistency between the two code paths.
| self.fragmentation_buffer = max( | ||
| 150 * (1 << 20), | ||
| int(measured_fragmentation * 2), | ||
| ) |
There was a problem hiding this comment.
The value 150 * (1 << 20) is a magic number. To improve readability and maintainability, it should be defined as a constant. This value is also duplicated in compile_or_warm_up_model, making it prone to inconsistencies if updated in only one place. Please define it as a shared constant (e.g., _DEFAULT_FRAGMENTATION_BUFFER_BYTES) and use it in both locations.
| redundancy_buffer_memory = getattr( | ||
| self, "fragmentation_buffer", 150 * (1 << 20) | ||
| ) |
There was a problem hiding this comment.
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
5b3bfa1 to
2b3b885
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
STILL UNDER TESTING.......
Purpose
As we know, vLLM may suffer from OOM under high stress , although vLLM code has tried best in warmup & profiling during starting, but there're still gap.
Sometimes, user want highest throughput (like benchmarking) for as large as possible kv-cache space.
But most of time, we are under SLO KPI, we need vLLM to keep alive as much as possible, not OOM, but alive.
This PR adds a fragmentation-aware safety buffer to KV cache budgeting, instead of just hardcode 150MB,
as another fix for #37420
Background
(1) Why OOM?
Profiling runs in a clean allocator state, so memory looks better than later real running. The fragment will grow after some mixed batches and become worse in higher stress batch size. When a forward step may need one large contiguous block (for example, ~384 MiB), but only smaller pieces are available(329.87MB), so CUDA OOM happens.
(2) What's the fragmentation?
Fragmentation is
reserved - allocated: memory reserved by PyTorch but not used by live tensors. This free space is often split into many diff size small blocks in the allocator pool. The bytes are there, but not as one contiguous block, so a large activation allocation can still fail.fragmentation = reserved - allocated = free blocks
Proposed Change
Co-author by AI
1:
determine_available_memory()record the fragmentation2: mutiple it X 1~2 factor, and subtract it from kv-cache space.
Cost: KV cache shrinks a little bit. Negligible throughput impact.
Abandoned Solutions
1) Try-catch
torch.cuda.OutOfMemoryErrorwith graceful recoveryProposal: Wrap
execute_model()in a try-catch. On OOM, callempty_cache(), reduce the token budget, and retry.Why rejected:
2) Runtime pre-execution memory check
Proposal: Before each
execute_model(), calltorch.cuda.mem_get_info()to check free memory. If insufficient, signal the scheduler to reduce the batch.Why rejected:
mem_get_info()returns free memory from CUDA driver's perspective, but PyTorch's caching allocator holds reserved-but-reusable blocks that appear "used" to CUDA. This makes the check overly conservative, rejecting batches that would fit finemem_get_info()is a CUDA driver API call that may trigger CPU-GPU synchronization, breaking the async scheduling pipeline in vLLM v1 and adding latency to every forward stepactivation_per_token * num_tokens) is not linear — torch.compile/inductor generates different buffer patterns for different symbolic shape combinations3) control token-budget according to realtime mem usage
Proposal: Periodically check GPU memory usage and dynamically scale
max_num_scheduled_tokens.Why rejected:
mem_get_info()inaccuracy problem as above (CUDA free ≠ PyTorch available)4) Percentage buffer as
total_gpu_memory * X %Example:
redundancy_buffer_memory = max(150 MiB, int(total_memory * 0.02))Why rejected :
5) Make
max_concurrencya hard admission cap instead of just a log messageSolution: The logged "Maximum concurrency for 10,000 tokens per request: 14.79x" should be enforced as a hard limit on concurrent requests.
Why rejected:
Fragment Observation
| Workload | fragmentation size|
| Profiling | 1219 MiB |
| do 3~4 times of Profiling | no chnage |
| --num-prompts 10 (低并发) | 767 MiB|
| --num-prompts 20 (高并发) | 1825 MiB |
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.