Skip to content

[Attention][TurboQuant] Share decode buffers across layers to fix OOM#40655

Closed
bhoomit wants to merge 3 commits into
vllm-project:mainfrom
bhoomit:turboquant-share-decode-buffers
Closed

[Attention][TurboQuant] Share decode buffers across layers to fix OOM#40655
bhoomit wants to merge 3 commits into
vllm-project:mainfrom
bhoomit:turboquant-share-decode-buffers

Conversation

@bhoomit
Copy link
Copy Markdown
Contributor

@bhoomit bhoomit commented Apr 22, 2026

[Attention][TurboQuant] Share decode buffers across layers to fix OOM

Summary

TurboQuant's per-layer decode buffer pre-allocation via register_buffer() consumes O(num_layers) GPU memory — ~16 GiB direct for Qwen3-32B, but ~61 GiB total with allocator fragmentation from 180 registered buffers. This patch removes the per-layer buffers entirely and uses vLLM's WorkspaceManager to share a single set of scratch buffers across all layers at decode time.

Inspired by #40706 (@lesj0610) which identified the same problem and proposed using WorkspaceManager. Per @LucasWilkinson's review suggestion, this PR additionally removes _init_turboquant_buffers from attention.py completely, moving centroids initialization into the TQ backend's lazy _ensure_on_device path — keeping all TQ-specific code self-contained.

Impact

Baseline OOMs → works: Qwen3-32B BF16 + TQ k8v4 on a single H200 (143 GB) at 0.90 utilization cannot start with per-layer buffers (123 GiB model load). With this patch: 61 GiB model load, 598K KV tokens, 52.7 tok/s.
No accuracy regression: Coherent generation verified on Qwen3-0.6B + TQ k8v4.
Zero TQ-specific code in attention.py: _init_turboquant_buffers removed entirely.

Problem

TurboQuant PR #38479 introduced per-layer pre-allocation of intermediate decode buffers in _init_turboquant_buffers():

# attention.py (before this patch)
self.register_buffer("_tq_mid_o_buf",
    torch.empty(B, Hq, S, D + 1, dtype=torch.float32), persistent=False)
self.register_buffer("_tq_output_buf",
    torch.empty(B, Hq, D, dtype=torch.float32), persistent=False)
self.register_buffer("_tq_lse_buf",
    torch.empty(B, Hq, dtype=torch.float32), persistent=False)

For Qwen3-32B (64 layers, 64 query heads, head_dim=128) with default max_num_seqs=256 and tq_max_kv_splits_for_cuda_graph=32, this allocates ~278 MiB per layer × 60 TQ layers = ~16 GiB direct. With PyTorch allocator fragmentation, the observed overhead is ~61 GiB.

Root Cause

Transformer layers execute sequentially — layer N completes before layer N+1 begins. The decode buffers are scratch space fully overwritten each call. One shared set is sufficient.

Fix

  1. Remove _init_turboquant_buffers from attention.py — no more TQ-specific code in the generic attention layer.
  2. Move centroids init to _ensure_on_device in the TQ backend — lazy init on first use, alongside the existing Hadamard/midpoints init.
  3. Use WorkspaceManager.get_simultaneous() in _decode_attention — acquires shared scratch buffers from vLLM's existing workspace infrastructure. Falls back to kernel-internal allocation if workspace is unavailable.

Verification

Tested on NVIDIA H200 (143 GB):

Qwen3-32B BF16 + turboquant_k8v4 (single GPU):

Metric Before (per-layer bufs) After (WorkspaceManager) Change
Model load memory 123.39 GiB 61.03 GiB -62.36 GiB
Available KV memory OOM 61.6 GiB
KV token capacity OOM 598,784
Throughput OOM 52.7 tok/s
CUDA graphs OOM 51 PIECEWISE + 51 FULL

Qwen3-32B BF16 + turboquant_k8v4 (PP=2, 2× H200):

Metric Result
Model load per rank 31.56 GiB
KV token capacity 1,766,688
Throughput 52.3 tok/s
Generation Correct ("4", "Paris", coherent reasoning)

Test Plan

  • Unit tests: 6/6 passed (no TQ code in attention.py, workspace shapes, workspace reuse, memory savings math, fallback when workspace unavailable, lazy centroids init)
  • Qwen3-32B BF16 + TQ k8v4 on H200: baseline OOMs, patched runs at 52.7 tok/s with 598K KV tokens
  • Qwen3-32B BF16 + TQ k8v4 PP=2: 52.3 tok/s, 1.77M KV tokens, correct generation
  • Qwen3-0.6B BF16 + TQ k8v4 accuracy: coherent generation verified
  • CUDA graph capture succeeds (51 PIECEWISE + 51 FULL)

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Apr 22, 2026
@bhoomit bhoomit changed the title [Attention][TurboQuant] Share decode buffers across layers [Attention][TurboQuant] Share decode buffers across layers to fix OOM Apr 22, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes memory usage by sharing TurboQuant decode buffers across attention layers, reducing allocation from O(num_layers) to O(1). Feedback indicates a need for shape validation when reusing buffers to prevent crashes and notes that removing register_buffer may cause the memory profiler to underestimate GPU footprint, potentially leading to OOM. Additionally, using class-level attributes for sharing may cause conflicts between multiple model instances in the same process.


# Check if another layer already migrated the shared buffers
shared = getattr(Attention, "_tq_shared_mid_o_buf", None)
if shared is not None and shared.device == query.device:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The reuse logic for shared buffers is missing a shape validation check. If a model contains layers with different configurations (e.g., varying head counts), Attention._tq_shared_mid_o_buf might hold a tensor with dimensions incompatible with the current layer's requirements. Reusing it without verifying that shared.shape == mid_o_buf.shape will lead to memory corruption or runtime crashes during Triton kernel execution. Additionally, the migration logic should ensure that mid_o_buf, output_buf, and lse_buf are always migrated and updated as a consistent set to avoid mixed-device tensor errors.

Suggested change
if shared is not None and shared.device == query.device:
if shared is not None and shared.device == query.device and shared.shape == mid_o_buf.shape:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Comment on lines +433 to +436
# Share decode intermediate buffers across all TQ attention layers.
# Layers execute sequentially so one set of buffers is sufficient.
# This reduces memory from O(num_layers) to O(1) — saving ~11 GiB
# for a 64-layer model like Qwen2.5-32B.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of register_buffer for these intermediate buffers bypasses vLLM's memory profiler. Since these tensors are now lazily moved to the GPU during the first decode, the profiler (which runs earlier) will not account for this memory usage in its GPU footprint estimation. This can lead to Out-of-Memory (OOM) errors if the KV cache manager allocates all remaining GPU memory based on an underestimated model footprint. While sharing tensors across layers is necessary for memory efficiency, consider a mechanism to ensure these shared buffers are either moved to the device before profiling or explicitly accounted for in the memory budget.

Comment on lines +443 to +445
if not hasattr(
Attention, "_tq_shared_mid_o_buf"
) or Attention._tq_shared_mid_o_buf.shape != (B, Hq, S, D + 1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a single class-level attribute Attention._tq_shared_mid_o_buf to share buffers is risky when multiple model instances exist in the same process (e.g., during unit testing or multi-model serving). A second model instance with a different configuration will overwrite the class attribute, potentially causing the first model instance to use incorrectly shaped or stale buffers. It is recommended to store shared buffers in a more instance-aware context, such as a dictionary keyed by the model configuration or within the VllmConfig object, to ensure isolation between different model instances.

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Apr 23, 2026

It seems reusing the buffer would break pipeline parallelism, so could you guard against that case or fail?

@bhoomit
Copy link
Copy Markdown
Contributor Author

bhoomit commented Apr 23, 2026

It seems reusing the buffer would break pipeline parallelism, so could you guard against that case or fail?

@mgoin thanks for raising the concern.

Are you sure PP would break? Doesn't PP have its own separate processes (each rank is a separate worker process)?

Each process only has the layers assigned to its PP stage. So the class-level Attention._tq_shared_mid_o_buf is per-process, and layers within a single PP rank still execute sequentially.

@lesj0610
Copy link
Copy Markdown
Contributor

@mgoin I opened #40706 with a different fix for the same memory issue.
It uses WorkspaceManager instead of a class-level shared buffer on Attention, so decode scratch stays in the existing runtime workspace lifecycle.
If workspace is not initialized, it falls back to the previous lazy per-layer buffer reuse path.

@bhoomit bhoomit force-pushed the turboquant-share-decode-buffers branch from 43fd4da to 6f8646a Compare April 23, 2026 13:51
@lesj0610
Copy link
Copy Markdown
Contributor

I think the main concern is not only PP or process boundary.

The bigger issue is that these buffers are temporary decode scratch, not real model state. So keeping them as class-level shared state on Attention still feels wrong to me.

vLLM already has WorkspaceManager for this kind of runtime temporary memory. If we put TurboQuant scratch there, ownership and lifecycle become much clearer: reserve, lock, then acquire at runtime.

So my concern is more about resource ownership and long-term maintenance, not only whether PP happens to work today.

@bhoomit
Copy link
Copy Markdown
Contributor Author

bhoomit commented Apr 23, 2026

@mgoin Updated with PP>1 testing and a unit test:

  • Verified Qwen3-32B + TQ k8v4 with PP=2 on H200s — correct generation, 52.3 tok/s, 1.77M KV tokens, CUDA graphs captured on both ranks.
  • Added test_pp_ranks_get_independent_buffers which spawns two processes with different head counts (simulating PP ranks with different layer configs) and confirms each gets correctly-shaped buffers without cross-process interference.

PP is safe here because each rank runs in its own worker process with its own Python interpreter — the class-level attribute is process-local by definition.


@lesj0610 thanks for the comment, I see your approach in #40706 which routes TQ buffers through WorkspaceManager. My approach keeps the fix self-contained within TQ's own files, which aligns with the original sig-quantization decision to keep TurboQuant isolated from existing vLLM infrastructure to avoid coupling and simplify review. No new dependencies on internal APIs (WorkspaceManager, lock_workspace, etc.), works in any execution context.

@lesj0610
Copy link
Copy Markdown
Contributor

lesj0610 commented Apr 23, 2026

Self-contained is understandable, but my concern is still resource ownership.

These buffers are only temporary decode scratch, not real model state. So keeping them as class-level shared state on Attention still looks like a separate runtime memory path only for TQ.

WorkspaceManager already exists for this kind of temporary memory, so using that looks more natural to me than adding another lifecycle here.

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this current_workspace_manager().get_simultaneous(...) (can search the codebase for example usages) in TurboQuantAttentionImpl._decode_attention; I dont think there should be any turboquant specific stuff in vllm/model_executor/layers/attention/attention.py (ideally would remove _init_turboquant_buffers completely)

@bhoomit
Copy link
Copy Markdown
Contributor Author

bhoomit commented Apr 24, 2026

@LucasWilkinson addressed your comment. The change also incorporates the suggestions from @lesj0610 . Thank You :)

@bhoomit bhoomit requested a review from LucasWilkinson April 24, 2026 23:28
Comment thread tests/test_tq_shared_buffers.py Outdated
Comment thread tests/test_tq_shared_buffers.py Outdated
Comment thread tests/test_tq_shared_buffers.py Outdated
@bhoomit bhoomit force-pushed the turboquant-share-decode-buffers branch from f1801e7 to 29a1afd Compare April 25, 2026 16:37
@bhoomit bhoomit requested a review from LucasWilkinson April 25, 2026 16:40
bhoomit added 3 commits April 26, 2026 14:38
Replace per-layer register_buffer() allocation of TQ decode scratch
buffers (_tq_mid_o_buf, _tq_output_buf, _tq_lse_buf) with a single
class-level shared set on Attention. Transformer layers execute
sequentially so one set of buffers is sufficient.

For Qwen3-32B (64 layers, 64 heads, head_dim=128):
  Before: 123.39 GiB model load (OOM on single H200)
  After:  62.07 GiB model load, 588K KV tokens, 52.7 tok/s

Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
…ove _init_turboquant_buffers

Move TQ decode scratch buffers from per-layer register_buffer in
attention.py to WorkspaceManager.get_simultaneous() in the TQ backend.
Move centroids init into _ensure_on_device for lazy initialization.

This removes all TQ-specific code from the generic Attention class.

Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
Existing lm_eval TQ tests and test_turboquant.py store/decode
roundtrip already cover TurboQuant correctness end-to-end.

Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
@bhoomit bhoomit force-pushed the turboquant-share-decode-buffers branch from 1d1185b to 831a841 Compare April 26, 2026 22:48
bhoomit added a commit to bhoomit/vllm that referenced this pull request Apr 26, 2026
Builds on top of the shared-decode-buffer PR (vllm-project#40655). Reduces
continuation-prefill memory by 60x (shared dequant buffers via
WorkspaceManager), eliminates redundant float16_copy kernel in
decode (stage2 OUTPUT_FP16), and removes unnecessary memory
operations (.contiguous(), .zero_(), fp32 rotation).

Changes:

1. Share dequant buffers across layers via WorkspaceManager
   (saves 57 GB at 1M context, required for CUDA Graph capture)
2. Remove .zero_() on buffers whose trailing positions are never read
3. fp16 Hadamard rotation (2x less bandwidth, uses fp16 tensor cores)
4. Pre-sized K/V buffer (eliminates .contiguous() + torch.cat)
5. Cache arange and cu_seqlens (speculative decode hot path)
6. OUTPUT_FP16 constexpr in stage2 kernel (eliminates float16_copy)
7. Fix: output_buf dtype mismatch in shared-buffer decode path

Memory savings (32B model, TP=4, 60 TQ layers):
  8K context:   472 MB saved
  128K context: 7.4 GB saved
  1M context:   57.6 GB saved

No accuracy impact. 117 existing TQ unit tests pass.

Signed-off-by: Vasani Bhoomit <bhoomit.2010@gmail.com>
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
@bhoomit
Copy link
Copy Markdown
Contributor Author

bhoomit commented Apr 27, 2026

Code merged with follow up PR - #40941

Going to close this, and use this as documentation of the improvements.

@bhoomit bhoomit closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants