[Attention][TurboQuant] Share decode buffers across layers to fix OOM#40655
[Attention][TurboQuant] Share decode buffers across layers to fix OOM#40655bhoomit wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes memory usage by sharing TurboQuant decode buffers across attention layers, reducing allocation from O(num_layers) to O(1). Feedback indicates a need for shape validation when reusing buffers to prevent crashes and notes that removing register_buffer may cause the memory profiler to underestimate GPU footprint, potentially leading to OOM. Additionally, using class-level attributes for sharing may cause conflicts between multiple model instances in the same process.
|
|
||
| # Check if another layer already migrated the shared buffers | ||
| shared = getattr(Attention, "_tq_shared_mid_o_buf", None) | ||
| if shared is not None and shared.device == query.device: |
There was a problem hiding this comment.
The reuse logic for shared buffers is missing a shape validation check. If a model contains layers with different configurations (e.g., varying head counts), Attention._tq_shared_mid_o_buf might hold a tensor with dimensions incompatible with the current layer's requirements. Reusing it without verifying that shared.shape == mid_o_buf.shape will lead to memory corruption or runtime crashes during Triton kernel execution. Additionally, the migration logic should ensure that mid_o_buf, output_buf, and lse_buf are always migrated and updated as a consistent set to avoid mixed-device tensor errors.
| if shared is not None and shared.device == query.device: | |
| if shared is not None and shared.device == query.device and shared.shape == mid_o_buf.shape: |
| # Share decode intermediate buffers across all TQ attention layers. | ||
| # Layers execute sequentially so one set of buffers is sufficient. | ||
| # This reduces memory from O(num_layers) to O(1) — saving ~11 GiB | ||
| # for a 64-layer model like Qwen2.5-32B. |
There was a problem hiding this comment.
The removal of register_buffer for these intermediate buffers bypasses vLLM's memory profiler. Since these tensors are now lazily moved to the GPU during the first decode, the profiler (which runs earlier) will not account for this memory usage in its GPU footprint estimation. This can lead to Out-of-Memory (OOM) errors if the KV cache manager allocates all remaining GPU memory based on an underestimated model footprint. While sharing tensors across layers is necessary for memory efficiency, consider a mechanism to ensure these shared buffers are either moved to the device before profiling or explicitly accounted for in the memory budget.
| if not hasattr( | ||
| Attention, "_tq_shared_mid_o_buf" | ||
| ) or Attention._tq_shared_mid_o_buf.shape != (B, Hq, S, D + 1): |
There was a problem hiding this comment.
Using a single class-level attribute Attention._tq_shared_mid_o_buf to share buffers is risky when multiple model instances exist in the same process (e.g., during unit testing or multi-model serving). A second model instance with a different configuration will overwrite the class attribute, potentially causing the first model instance to use incorrectly shaped or stale buffers. It is recommended to store shared buffers in a more instance-aware context, such as a dictionary keyed by the model configuration or within the VllmConfig object, to ensure isolation between different model instances.
c08f3fb to
43fd4da
Compare
|
It seems reusing the buffer would break pipeline parallelism, so could you guard against that case or fail? |
@mgoin thanks for raising the concern. Are you sure PP would break? Doesn't PP have its own separate processes (each rank is a separate worker process)? Each process only has the layers assigned to its PP stage. So the class-level Attention._tq_shared_mid_o_buf is per-process, and layers within a single PP rank still execute sequentially. |
|
@mgoin I opened #40706 with a different fix for the same memory issue. |
43fd4da to
6f8646a
Compare
|
I think the main concern is not only PP or process boundary. The bigger issue is that these buffers are temporary decode scratch, not real model state. So keeping them as class-level shared state on vLLM already has So my concern is more about resource ownership and long-term maintenance, not only whether PP happens to work today. |
|
@mgoin Updated with PP>1 testing and a unit test:
PP is safe here because each rank runs in its own worker process with its own Python interpreter — the class-level attribute is process-local by definition. @lesj0610 thanks for the comment, I see your approach in #40706 which routes TQ buffers through |
|
Self-contained is understandable, but my concern is still resource ownership. These buffers are only temporary decode scratch, not real model state. So keeping them as class-level shared state on
|
There was a problem hiding this comment.
Can we replace this current_workspace_manager().get_simultaneous(...) (can search the codebase for example usages) in TurboQuantAttentionImpl._decode_attention; I dont think there should be any turboquant specific stuff in vllm/model_executor/layers/attention/attention.py (ideally would remove _init_turboquant_buffers completely)
c0a8fdb to
8a066fe
Compare
|
@LucasWilkinson addressed your comment. The change also incorporates the suggestions from @lesj0610 . Thank You :) |
f1801e7 to
29a1afd
Compare
Replace per-layer register_buffer() allocation of TQ decode scratch buffers (_tq_mid_o_buf, _tq_output_buf, _tq_lse_buf) with a single class-level shared set on Attention. Transformer layers execute sequentially so one set of buffers is sufficient. For Qwen3-32B (64 layers, 64 heads, head_dim=128): Before: 123.39 GiB model load (OOM on single H200) After: 62.07 GiB model load, 588K KV tokens, 52.7 tok/s Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
…ove _init_turboquant_buffers Move TQ decode scratch buffers from per-layer register_buffer in attention.py to WorkspaceManager.get_simultaneous() in the TQ backend. Move centroids init into _ensure_on_device for lazy initialization. This removes all TQ-specific code from the generic Attention class. Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
Existing lm_eval TQ tests and test_turboquant.py store/decode roundtrip already cover TurboQuant correctness end-to-end. Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
1d1185b to
831a841
Compare
Builds on top of the shared-decode-buffer PR (vllm-project#40655). Reduces continuation-prefill memory by 60x (shared dequant buffers via WorkspaceManager), eliminates redundant float16_copy kernel in decode (stage2 OUTPUT_FP16), and removes unnecessary memory operations (.contiguous(), .zero_(), fp32 rotation). Changes: 1. Share dequant buffers across layers via WorkspaceManager (saves 57 GB at 1M context, required for CUDA Graph capture) 2. Remove .zero_() on buffers whose trailing positions are never read 3. fp16 Hadamard rotation (2x less bandwidth, uses fp16 tensor cores) 4. Pre-sized K/V buffer (eliminates .contiguous() + torch.cat) 5. Cache arange and cu_seqlens (speculative decode hot path) 6. OUTPUT_FP16 constexpr in stage2 kernel (eliminates float16_copy) 7. Fix: output_buf dtype mismatch in shared-buffer decode path Memory savings (32B model, TP=4, 60 TQ layers): 8K context: 472 MB saved 128K context: 7.4 GB saved 1M context: 57.6 GB saved No accuracy impact. 117 existing TQ unit tests pass. Signed-off-by: Vasani Bhoomit <bhoomit.2010@gmail.com> Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
|
Code merged with follow up PR - #40941 Going to close this, and use this as documentation of the improvements. |
[Attention][TurboQuant] Share decode buffers across layers to fix OOM
Summary
TurboQuant's per-layer decode buffer pre-allocation via
register_buffer()consumes O(num_layers) GPU memory — ~16 GiB direct for Qwen3-32B, but ~61 GiB total with allocator fragmentation from 180 registered buffers. This patch removes the per-layer buffers entirely and uses vLLM'sWorkspaceManagerto share a single set of scratch buffers across all layers at decode time.Inspired by #40706 (@lesj0610) which identified the same problem and proposed using
WorkspaceManager. Per @LucasWilkinson's review suggestion, this PR additionally removes_init_turboquant_buffersfromattention.pycompletely, moving centroids initialization into the TQ backend's lazy_ensure_on_devicepath — keeping all TQ-specific code self-contained.Impact
Problem
TurboQuant PR #38479 introduced per-layer pre-allocation of intermediate decode buffers in
_init_turboquant_buffers():For Qwen3-32B (64 layers, 64 query heads, head_dim=128) with default
max_num_seqs=256andtq_max_kv_splits_for_cuda_graph=32, this allocates ~278 MiB per layer × 60 TQ layers = ~16 GiB direct. With PyTorch allocator fragmentation, the observed overhead is ~61 GiB.Root Cause
Transformer layers execute sequentially — layer N completes before layer N+1 begins. The decode buffers are scratch space fully overwritten each call. One shared set is sufficient.
Fix
_init_turboquant_buffersfromattention.py— no more TQ-specific code in the generic attention layer._ensure_on_devicein the TQ backend — lazy init on first use, alongside the existing Hadamard/midpoints init.WorkspaceManager.get_simultaneous()in_decode_attention— acquires shared scratch buffers from vLLM's existing workspace infrastructure. Falls back to kernel-internal allocation if workspace is unavailable.Verification
Tested on NVIDIA H200 (143 GB):
Qwen3-32B BF16 + turboquant_k8v4 (single GPU):
Qwen3-32B BF16 + turboquant_k8v4 (PP=2, 2× H200):
Test Plan