[Attention][TurboQuant] Share decode buffers across layers to fix OOM by bhoomit · Pull Request #40655 · vllm-project/vllm

bhoomit · 2026-04-22T22:24:05Z

[Attention][TurboQuant] Share decode buffers across layers to fix OOM

Summary

TurboQuant's per-layer decode buffer pre-allocation via register_buffer() consumes O(num_layers) GPU memory — ~16 GiB direct for Qwen3-32B, but ~61 GiB total with allocator fragmentation from 180 registered buffers. This patch removes the per-layer buffers entirely and uses vLLM's WorkspaceManager to share a single set of scratch buffers across all layers at decode time.

Inspired by #40706 (@lesj0610) which identified the same problem and proposed using WorkspaceManager. Per @LucasWilkinson's review suggestion, this PR additionally removes _init_turboquant_buffers from attention.py completely, moving centroids initialization into the TQ backend's lazy _ensure_on_device path — keeping all TQ-specific code self-contained.

Impact

Baseline OOMs → works: Qwen3-32B BF16 + TQ k8v4 on a single H200 (143 GB) at 0.90 utilization cannot start with per-layer buffers (123 GiB model load). With this patch: 61 GiB model load, 598K KV tokens, 52.7 tok/s.
No accuracy regression: Coherent generation verified on Qwen3-0.6B + TQ k8v4.
Zero TQ-specific code in attention.py: _init_turboquant_buffers removed entirely.

Problem

TurboQuant PR #38479 introduced per-layer pre-allocation of intermediate decode buffers in _init_turboquant_buffers():

# attention.py (before this patch)
self.register_buffer("_tq_mid_o_buf",
    torch.empty(B, Hq, S, D + 1, dtype=torch.float32), persistent=False)
self.register_buffer("_tq_output_buf",
    torch.empty(B, Hq, D, dtype=torch.float32), persistent=False)
self.register_buffer("_tq_lse_buf",
    torch.empty(B, Hq, dtype=torch.float32), persistent=False)

For Qwen3-32B (64 layers, 64 query heads, head_dim=128) with default max_num_seqs=256 and tq_max_kv_splits_for_cuda_graph=32, this allocates ~278 MiB per layer × 60 TQ layers = ~16 GiB direct. With PyTorch allocator fragmentation, the observed overhead is ~61 GiB.

Root Cause

Transformer layers execute sequentially — layer N completes before layer N+1 begins. The decode buffers are scratch space fully overwritten each call. One shared set is sufficient.

Fix

Remove _init_turboquant_buffers from attention.py — no more TQ-specific code in the generic attention layer.
Move centroids init to _ensure_on_device in the TQ backend — lazy init on first use, alongside the existing Hadamard/midpoints init.
Use WorkspaceManager.get_simultaneous() in _decode_attention — acquires shared scratch buffers from vLLM's existing workspace infrastructure. Falls back to kernel-internal allocation if workspace is unavailable.

Verification

Tested on NVIDIA H200 (143 GB):

Qwen3-32B BF16 + turboquant_k8v4 (single GPU):

Metric	Before (per-layer bufs)	After (WorkspaceManager)	Change
Model load memory	123.39 GiB	61.03 GiB	-62.36 GiB
Available KV memory	OOM	61.6 GiB	∞
KV token capacity	OOM	598,784	∞
Throughput	OOM	52.7 tok/s	∞
CUDA graphs	OOM	51 PIECEWISE + 51 FULL	✓

Qwen3-32B BF16 + turboquant_k8v4 (PP=2, 2× H200):

Metric	Result
Model load per rank	31.56 GiB
KV token capacity	1,766,688
Throughput	52.3 tok/s
Generation	Correct ("4", "Paris", coherent reasoning)

Test Plan

Unit tests: 6/6 passed (no TQ code in attention.py, workspace shapes, workspace reuse, memory savings math, fallback when workspace unavailable, lazy centroids init)
Qwen3-32B BF16 + TQ k8v4 on H200: baseline OOMs, patched runs at 52.7 tok/s with 598K KV tokens
Qwen3-32B BF16 + TQ k8v4 PP=2: 52.3 tok/s, 1.77M KV tokens, correct generation
Qwen3-0.6B BF16 + TQ k8v4 accuracy: coherent generation verified
CUDA graph capture succeeds (51 PIECEWISE + 51 FULL)

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request optimizes memory usage by sharing TurboQuant decode buffers across attention layers, reducing allocation from O(num_layers) to O(1). Feedback indicates a need for shape validation when reusing buffers to prevent crashes and notes that removing register_buffer may cause the memory profiler to underestimate GPU footprint, potentially leading to OOM. Additionally, using class-level attributes for sharing may cause conflicts between multiple model instances in the same process.

gemini-code-assist · 2026-04-22T22:27:54Z

+
+                # Check if another layer already migrated the shared buffers
+                shared = getattr(Attention, "_tq_shared_mid_o_buf", None)
+                if shared is not None and shared.device == query.device:


The reuse logic for shared buffers is missing a shape validation check. If a model contains layers with different configurations (e.g., varying head counts), Attention._tq_shared_mid_o_buf might hold a tensor with dimensions incompatible with the current layer's requirements. Reusing it without verifying that shared.shape == mid_o_buf.shape will lead to memory corruption or runtime crashes during Triton kernel execution. Additionally, the migration logic should ensure that mid_o_buf, output_buf, and lse_buf are always migrated and updated as a consistent set to avoid mixed-device tensor errors.

Suggested change

if shared is not None and shared.device == query.device:

if shared is not None and shared.device == query.device and shared.shape == mid_o_buf.shape:

gemini-code-assist · 2026-04-22T22:27:54Z

+        # Share decode intermediate buffers across all TQ attention layers.
+        # Layers execute sequentially so one set of buffers is sufficient.
+        # This reduces memory from O(num_layers) to O(1) — saving ~11 GiB
+        # for a 64-layer model like Qwen2.5-32B.


The removal of register_buffer for these intermediate buffers bypasses vLLM's memory profiler. Since these tensors are now lazily moved to the GPU during the first decode, the profiler (which runs earlier) will not account for this memory usage in its GPU footprint estimation. This can lead to Out-of-Memory (OOM) errors if the KV cache manager allocates all remaining GPU memory based on an underestimated model footprint. While sharing tensors across layers is necessary for memory efficiency, consider a mechanism to ensure these shared buffers are either moved to the device before profiling or explicitly accounted for in the memory budget.

gemini-code-assist · 2026-04-22T22:27:54Z

+        if not hasattr(
+            Attention, "_tq_shared_mid_o_buf"
+        ) or Attention._tq_shared_mid_o_buf.shape != (B, Hq, S, D + 1):


Using a single class-level attribute Attention._tq_shared_mid_o_buf to share buffers is risky when multiple model instances exist in the same process (e.g., during unit testing or multi-model serving). A second model instance with a different configuration will overwrite the class attribute, potentially causing the first model instance to use incorrectly shaped or stale buffers. It is recommended to store shared buffers in a more instance-aware context, such as a dictionary keyed by the model configuration or within the VllmConfig object, to ensure isolation between different model instances.

mgoin · 2026-04-23T08:50:24Z

It seems reusing the buffer would break pipeline parallelism, so could you guard against that case or fail?

bhoomit · 2026-04-23T13:34:34Z

It seems reusing the buffer would break pipeline parallelism, so could you guard against that case or fail?

@mgoin thanks for raising the concern.

Are you sure PP would break? Doesn't PP have its own separate processes (each rank is a separate worker process)?

Each process only has the layers assigned to its PP stage. So the class-level Attention._tq_shared_mid_o_buf is per-process, and layers within a single PP rank still execute sequentially.

lesj0610 · 2026-04-23T13:45:59Z

@mgoin I opened #40706 with a different fix for the same memory issue.
It uses WorkspaceManager instead of a class-level shared buffer on Attention, so decode scratch stays in the existing runtime workspace lifecycle.
If workspace is not initialized, it falls back to the previous lazy per-layer buffer reuse path.

lesj0610 · 2026-04-23T13:53:22Z

I think the main concern is not only PP or process boundary.

The bigger issue is that these buffers are temporary decode scratch, not real model state. So keeping them as class-level shared state on Attention still feels wrong to me.

vLLM already has WorkspaceManager for this kind of runtime temporary memory. If we put TurboQuant scratch there, ownership and lifecycle become much clearer: reserve, lock, then acquire at runtime.

So my concern is more about resource ownership and long-term maintenance, not only whether PP happens to work today.

bhoomit · 2026-04-23T14:10:05Z

@mgoin Updated with PP>1 testing and a unit test:

Verified Qwen3-32B + TQ k8v4 with PP=2 on H200s — correct generation, 52.3 tok/s, 1.77M KV tokens, CUDA graphs captured on both ranks.
Added test_pp_ranks_get_independent_buffers which spawns two processes with different head counts (simulating PP ranks with different layer configs) and confirms each gets correctly-shaped buffers without cross-process interference.

PP is safe here because each rank runs in its own worker process with its own Python interpreter — the class-level attribute is process-local by definition.

@lesj0610 thanks for the comment, I see your approach in #40706 which routes TQ buffers through WorkspaceManager. My approach keeps the fix self-contained within TQ's own files, which aligns with the original sig-quantization decision to keep TurboQuant isolated from existing vLLM infrastructure to avoid coupling and simplify review. No new dependencies on internal APIs (WorkspaceManager, lock_workspace, etc.), works in any execution context.

lesj0610 · 2026-04-23T14:25:55Z

Self-contained is understandable, but my concern is still resource ownership.

These buffers are only temporary decode scratch, not real model state. So keeping them as class-level shared state on Attention still looks like a separate runtime memory path only for TQ.

WorkspaceManager already exists for this kind of temporary memory, so using that looks more natural to me than adding another lifecycle here.

LucasWilkinson

Can we replace this current_workspace_manager().get_simultaneous(...) (can search the codebase for example usages) in TurboQuantAttentionImpl._decode_attention; I dont think there should be any turboquant specific stuff in vllm/model_executor/layers/attention/attention.py (ideally would remove _init_turboquant_buffers completely)

bhoomit · 2026-04-24T23:28:26Z

@LucasWilkinson addressed your comment. The change also incorporates the suggestions from @lesj0610 . Thank You :)

Replace per-layer register_buffer() allocation of TQ decode scratch buffers (_tq_mid_o_buf, _tq_output_buf, _tq_lse_buf) with a single class-level shared set on Attention. Transformer layers execute sequentially so one set of buffers is sufficient. For Qwen3-32B (64 layers, 64 heads, head_dim=128): Before: 123.39 GiB model load (OOM on single H200) After: 62.07 GiB model load, 588K KV tokens, 52.7 tok/s Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>

…ove _init_turboquant_buffers Move TQ decode scratch buffers from per-layer register_buffer in attention.py to WorkspaceManager.get_simultaneous() in the TQ backend. Move centroids init into _ensure_on_device for lazy initialization. This removes all TQ-specific code from the generic Attention class. Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>

Existing lm_eval TQ tests and test_turboquant.py store/decode roundtrip already cover TurboQuant correctness end-to-end. Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>

Builds on top of the shared-decode-buffer PR (vllm-project#40655). Reduces continuation-prefill memory by 60x (shared dequant buffers via WorkspaceManager), eliminates redundant float16_copy kernel in decode (stage2 OUTPUT_FP16), and removes unnecessary memory operations (.contiguous(), .zero_(), fp32 rotation). Changes: 1. Share dequant buffers across layers via WorkspaceManager (saves 57 GB at 1M context, required for CUDA Graph capture) 2. Remove .zero_() on buffers whose trailing positions are never read 3. fp16 Hadamard rotation (2x less bandwidth, uses fp16 tensor cores) 4. Pre-sized K/V buffer (eliminates .contiguous() + torch.cat) 5. Cache arange and cu_seqlens (speculative decode hot path) 6. OUTPUT_FP16 constexpr in stage2 kernel (eliminates float16_copy) 7. Fix: output_buf dtype mismatch in shared-buffer decode path Memory savings (32B model, TP=4, 60 TQ layers): 8K context: 472 MB saved 128K context: 7.4 GB saved 1M context: 57.6 GB saved No accuracy impact. 117 existing TQ unit tests pass. Signed-off-by: Vasani Bhoomit <bhoomit.2010@gmail.com> Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>

bhoomit · 2026-04-27T06:08:42Z

Code merged with follow up PR - #40941

Going to close this, and use this as documentation of the improvements.

bhoomit requested review from LucasWilkinson and MatthewBonanni as code owners April 22, 2026 22:24

claude Bot reviewed Apr 22, 2026

View reviewed changes

mergify Bot added the v1 label Apr 22, 2026

bhoomit changed the title ~~[Attention][TurboQuant] Share decode buffers across layers~~ [Attention][TurboQuant] Share decode buffers across layers to fix OOM Apr 22, 2026

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

bhoomit mentioned this pull request Apr 22, 2026

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

Open

13 tasks

bhoomit force-pushed the turboquant-share-decode-buffers branch from c08f3fb to 43fd4da Compare April 22, 2026 22:37

lesj0610 mentioned this pull request Apr 23, 2026

Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers lesj0610/vllm#35

Draft

mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels Apr 23, 2026

lesj0610 mentioned this pull request Apr 23, 2026

[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers #40706

Open

bhoomit force-pushed the turboquant-share-decode-buffers branch from 43fd4da to 6f8646a Compare April 23, 2026 13:51

LucasWilkinson requested changes Apr 24, 2026

View reviewed changes

noonghunna mentioned this pull request Apr 24, 2026

[Bug]: TurboQuant KV × any speculative decoding (MTP or ngram) produces degenerate token loops — confirmed across dense and hybrid attention #40831

Closed

bhoomit force-pushed the turboquant-share-decode-buffers branch from c0a8fdb to 8a066fe Compare April 24, 2026 23:22

bhoomit requested a review from LucasWilkinson April 24, 2026 23:28

LucasWilkinson requested changes Apr 25, 2026

View reviewed changes

Comment thread tests/test_tq_shared_buffers.py Outdated

Comment thread tests/test_tq_shared_buffers.py Outdated

Comment thread tests/test_tq_shared_buffers.py Outdated

bhoomit force-pushed the turboquant-share-decode-buffers branch from f1801e7 to 29a1afd Compare April 25, 2026 16:37

bhoomit requested a review from LucasWilkinson April 25, 2026 16:40

bhoomit added 3 commits April 26, 2026 14:38

[Attention][Cleanup] Drop test_tq_shared_buffers.py

831a841

Existing lm_eval TQ tests and test_turboquant.py store/decode roundtrip already cover TurboQuant correctness end-to-end. Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>

bhoomit force-pushed the turboquant-share-decode-buffers branch from 1d1185b to 831a841 Compare April 26, 2026 22:48

bhoomit mentioned this pull request Apr 26, 2026

[Attention][TurboQuant] Share dequant buffers, eliminate float16_copy #40941

Merged

bhoomit closed this Apr 27, 2026

	if shared is not None and shared.device == query.device:
	if shared is not None and shared.device == query.device and shared.shape == mid_o_buf.shape:

Uh oh!

Conversation

bhoomit commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!