Skip to content

[TurboQuant] Share decode scratch workspace across layers#40798

Open
Bot1822 wants to merge 3 commits intovllm-project:mainfrom
Bot1822:tq-workspace-manager-main
Open

[TurboQuant] Share decode scratch workspace across layers#40798
Bot1822 wants to merge 3 commits intovllm-project:mainfrom
Bot1822:tq-workspace-manager-main

Conversation

@Bot1822
Copy link
Copy Markdown

@Bot1822 Bot1822 commented Apr 24, 2026

Summary

TurboQuant currently registers three decode scratch buffers on every attention layer. These buffers are temporary decode workspace, but because they are registered per layer they scale with the number of TurboQuant layers and with max_num_seqs.

This PR moves TurboQuant decode scratch allocation to the v1 workspace manager so the scratch tensors are shared across layers. It also reserves the maximum TurboQuant decode workspace before CUDA graph capture locks the workspace, preventing locked-workspace growth at runtime.

This PR is scoped to TurboQuant decode scratch memory usage. It does not claim to fix TurboQuant speculative-decoding correctness issues.

Motivation

On large models and large H100/H200 server defaults, the per-layer scratch buffers can consume tens of GiB of non-KV memory and significantly reduce KV cache capacity.

Measured on H200 with Llama-3.1-70B, TP=2, kv_cache_dtype=turboquant_3bit_nc, max_model_len=65536, gpu_memory_utilization=0.90:

version model loading memory available KV memory GPU KV cache size max concurrency @ 65,536
before 105.23 GiB 14.61 GiB 400,128 tokens 6.11x
after 65.74 GiB 53.97 GiB 1,478,384 tokens 22.56x

The old per-layer buffer cost for this setup is about 39.5 GiB:

B = 1024
Hq = 32
S = 32
D = 128
TQ layers = 76
mid_o ~= B * Hq * S * (D + 1) * 4 bytes * 76 ~= 39.5 GiB

Changes

  • Keep TurboQuant centroids as per-layer state.
  • Stop registering _tq_mid_o_buf, _tq_output_buf, and _tq_lse_buf on every attention layer.
  • Allocate TurboQuant decode scratch through WorkspaceManager.get_simultaneous().
  • Reserve a max-size TurboQuant decode workspace before lock_workspace() in CUDA graph capture.
  • Reserve across all attention groups so hybrid model layouts do not miss TurboQuant groups outside the first group.
  • Add behavior tests for the workspace allocation and reservation paths.

Duplicate-work note

I checked the open TurboQuant scratch/workspace work before publishing this PR. This PR is intentionally scoped to the v1 workspace-manager reservation path and the measured H200 Llama-3.1-70B TP=2 memory-capacity regression. It is not a general cleanup and does not claim to solve unrelated TurboQuant speculative-decoding correctness issues.

Testing

  • python -m py_compile tests/quantization/test_turboquant.py vllm/v1/worker/gpu_model_runner.py vllm/model_executor/layers/attention/attention.py vllm/v1/attention/ops/triton_turboquant_decode.py vllm/v1/attention/backends/turboquant_attn.py
  • git diff --check
  • H200 container: python3 -m pytest tests/quantization/test_turboquant.py -k TurboQuantDecodeWorkspace -q
    • Result: 4 passed, 117 deselected, 17 warnings in 2.03s
  • H200 startup/capacity validation:
    • model: /mnt/afs/models/Llama/Llama-3.1-70B
    • GPUs: H200 2x, TP=2
    • --kv-cache-dtype turboquant_3bit_nc
    • --max-model-len 65536
    • --gpu-memory-utilization 0.90
    • result: model loading memory reduced from 105.23 GiB to 65.74 GiB; GPU KV cache size increased from 400,128 to 1,478,384 tokens
  • H200 serving benchmark, /v1/completions, random dataset, 32 prompts, input length 4096, output length 128, request_rate=inf, max_concurrency=8, ignore_eos=true:
    • before: 0.3186 req/s, 40.78 output tok/s, mean TTFT 16809 ms, mean TPOT 65.16 ms
    • after: 0.3343 req/s, 42.80 output tok/s, mean TTFT 15629 ms, mean TPOT 65.14 ms
  • Sampled MMLU-Pro regression, leaderboard_mmlu_pro, limit=20, 5-shot, local-completions, num_concurrent=4:
    • before: acc=0.60 ± 0.1124
    • after: acc=0.60 ± 0.1124
    • same predicted option on 20/20 samples and same correctness on 20/20 samples

AI Assistance

AI assistance was used to prepare this patch and PR text. Guipeng Zhang is the human submitter responsible for reviewing and defending the change.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@Bot1822 Bot1822 force-pushed the tq-workspace-manager-main branch from 14e925a to ead5c65 Compare April 24, 2026 10:31
@mergify mergify Bot added the v1 label Apr 24, 2026
@Bot1822
Copy link
Copy Markdown
Author

Bot1822 commented Apr 24, 2026

Related: #40706 is also addressing TurboQuant decode scratch deduplication. This PR focuses on the H200/Llama-3.1-70B TP=2 case and explicitly reserves the max-size decode workspace before the v1 workspace is locked after CUDA graph capture.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors TurboQuant decode scratch space management by transitioning from per-layer registered buffers to a shared workspace manager. This change prevents excessive memory usage in models with many layers by sharing scratch buffers across layers. Feedback was provided regarding the workspace reservation logic in gpu_model_runner.py, which currently only checks the first attention group and may fail to reserve memory for hybrid models where TurboQuant layers are present in subsequent groups.

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
Comment on lines +6126 to +6144
for group in self.attn_groups[0]:
if group.backend.get_name() != "TURBOQUANT":
continue

max_num_reqs = self.scheduler_config.max_num_seqs
num_heads = self.model_config.get_num_attention_heads(self.parallel_config)
head_size = self.model_config.get_head_size()
max_num_splits = (
self.vllm_config.attention_config.tq_max_kv_splits_for_cuda_graph
)
current_workspace_manager().get_simultaneous(
(
(max_num_reqs, num_heads, max_num_splits, head_size + 1),
torch.float32,
),
((max_num_reqs, num_heads, head_size), torch.float32),
((max_num_reqs, num_heads), torch.float32),
)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation only iterates over the first KV cache group (self.attn_groups[0]). In hybrid models (e.g., Mamba + Attention), TurboQuant layers might be assigned to a subsequent KV cache group. If the TurboQuant group is not in the first list, the workspace will not be reserved at its maximum size before the CUDA graph lock, potentially leading to runtime AssertionError crashes when the batch size increases. Iterating over all groups in self.attn_groups ensures the workspace is correctly reserved for any model configuration.

Suggested change
for group in self.attn_groups[0]:
if group.backend.get_name() != "TURBOQUANT":
continue
max_num_reqs = self.scheduler_config.max_num_seqs
num_heads = self.model_config.get_num_attention_heads(self.parallel_config)
head_size = self.model_config.get_head_size()
max_num_splits = (
self.vllm_config.attention_config.tq_max_kv_splits_for_cuda_graph
)
current_workspace_manager().get_simultaneous(
(
(max_num_reqs, num_heads, max_num_splits, head_size + 1),
torch.float32,
),
((max_num_reqs, num_heads, head_size), torch.float32),
((max_num_reqs, num_heads), torch.float32),
)
return
for groups in self.attn_groups:
for group in groups:
if group.backend.get_name() != "TURBOQUANT":
continue
max_num_reqs = self.max_num_reqs
num_heads = self.model_config.get_num_attention_heads(self.parallel_config)
head_size = self.model_config.get_head_size()
max_num_splits = (
self.vllm_config.attention_config.tq_max_kv_splits_for_cuda_graph
)
current_workspace_manager().get_simultaneous(
(
(max_num_reqs, num_heads, max_num_splits, head_size + 1),
torch.float32,
),
((max_num_reqs, num_heads, head_size), torch.float32),
((max_num_reqs, num_heads), torch.float32),
)
return

@Bot1822
Copy link
Copy Markdown
Author

Bot1822 commented Apr 24, 2026

Addressed the AI review comment about hybrid models: _reserve_turboquant_decode_workspace() now iterates over all attention groups instead of only self.attn_groups[0], so TurboQuant groups outside the first group also reserve max decode workspace before lock_workspace().

I kept self.scheduler_config.max_num_seqs rather than switching to self.max_num_reqs because this reservation is intentionally tied to the configured scheduler maximum that caused the original per-layer allocation issue. I also added a regression check to prevent indexing only the first attention group again.

@Bot1822 Bot1822 marked this pull request as ready for review April 24, 2026 10:41
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@Bot1822 Bot1822 marked this pull request as draft April 24, 2026 10:51
@Bot1822 Bot1822 marked this pull request as ready for review April 24, 2026 11:11
@Bot1822
Copy link
Copy Markdown
Author

Bot1822 commented Apr 24, 2026

Additional H200 validation on Llama-3.1-70B, TP=2, kv_cache_dtype=turboquant_3bit_nc, max_model_len=65536, gpu_memory_utilization=0.90.

Startup / capacity:

version model loading memory available KV memory GPU KV cache size max concurrency @ 65,536
before 105.23 GiB 14.61 GiB 400,128 tokens 6.11x
after 65.74 GiB 53.97 GiB 1,478,384 tokens 22.56x

Serving benchmark, /v1/completions, random dataset, 32 prompts, input length 4096, output length 128, request_rate=inf, max_concurrency=8, ignore_eos=true:

version req/s output tok/s mean TTFT median TTFT p99 TTFT mean TPOT median TPOT p99 TPOT
before 0.3186 40.78 16809 ms 2412 ms 64133 ms 65.16 ms 62.87 ms 90.35 ms
after 0.3343 42.80 15629 ms 2415 ms 59396 ms 65.14 ms 62.83 ms 90.20 ms

Sampled MMLU-Pro regression, leaderboard_mmlu_pro, limit=20, 5-shot, local-completions, num_concurrent=4:

version acc stderr
before 0.60 0.1124
after 0.60 0.1124

Sample-level comparison: same predicted option on 20/20 samples, same correctness on 20/20 samples.

Bot1822 added 2 commits April 24, 2026 20:00
Signed-off-by: Guipeng Zhang <zhangguipeng23z@ict.ac.cn>
Signed-off-by: Guipeng Zhang <zhangguipeng23z@ict.ac.cn>
@noonghunna
Copy link
Copy Markdown

Heads-up — this PR may be the upstream fix for #40831 (TurboQuant × spec-decode degenerate token loops), incidentally rather than by design.

We isolated #40831 through a six-probe ladder (summary in #40831) to CUDA graph capture/replay specifically:

  • --enforce-eager (cudagraph + torch.compile both off): bug fixed (23 TPS)
  • --compilation-config '{"cudagraph_mode":"NONE"}' (cudagraph off, torch.compile on): bug fixed (33 TPS)
  • Default (cudagraph + torch.compile both on): degenerate token loops on tools, structured outputs, long-recall

Triton kernels and torch.compile inductor output are correct when invoked dynamically. Only the captured graph is wrong, which strongly points at buffer-pointer instability between capture and replay.

@Sandermage flagged this PR as the likely fix because it moves _tq_mid_o_buf / _tq_output_buf / _tq_lse_buf from per-layer register_buffer(..., persistent=False) into WorkspaceManager.get_simultaneous(). The captured cudagraph would then reference a stable data_ptr — a persistent base buffer that runtime spec-decode shapes can slice into without the underlying address changing.

The current pre-allocation in attention.py::_init_turboquant_buffers sizes these buffers at B = max_num_seqs (typically 1) but spec-decode verify routes through _prefill_attention with B = q_len (4 for MTP n=3). Under the existing reuse logic, that reallocates the buffer and replaces buf_holder._tq_mid_o_buf, but the captured cudagraph from warmup still references the original data_ptr. Stale pointer → corrupted reads at replay.

If this PR is the right fix, we expect:

  • The cudagraph + torch.compile + spec-decode + TurboQuant configuration to produce coherent output (no <parameter=parameter=> corruption on tool calls, no for for / age age token duplication on code/XML, no first-token loops on long-context recall).
  • TPS to recover from our workaround's 33 TPS back to the original ~85 TPS.

Worth a focused validation pass on this PR before merging:

  1. Run with spec-decode enabled (--speculative-config '{"method":"mtp","num_speculative_tokens":3}' or ngram).
  2. Run a structured-output probe (tool calls, JSON, XML, code) in addition to the throughput benchmarks.
  3. Check needle-in-haystack at multiple depths (10K, 30K, 60K, 90K) — token duplication on retrieval is the most sensitive signal.

Independent reproductions of #40831 came in from at least four rigs (1× 3090, 1× 4090, 2× A5000, RTX 5090) across different models and TurboQuant presets, so the bug is robust enough to confirm against this PR.

@noonghunna
Copy link
Copy Markdown

Update — backported this PR onto our pinned vLLM nightly digest and tested against the originally-failing #40831 config (Qwen3.6 + turboquant_3bit_nc + MTP n=3 + cudagraph ON). The bug persists with all of #40798's structural changes applied.

Full details + the disk-edit patch script we used + per-test results: #40831 follow-up.

Symptoms unchanged:

  • Tool calls: still produce <tool_call> inline-cascade with empty tool_calls[]
  • Long-context recall: still produces first-token loops (amber amber amber...)
  • Streaming: still shows occasional token duplication (above above above above)
  • TPS at ~96 confirms cudagraph + torch.compile genuinely active — not silently falling back to eager

This means either (a) PR #40798 is necessary but not sufficient on its own, (b) there's a companion change in main we'd need to also backport, or (c) our backport has a subtle anchor mismatch that maintainers with branch-local CI could rule in/out.

Sorry for the earlier signal-boosting — wanted to update the thread now that we have data either way. The PR's core claim (memory savings from sharing workspace across layers) is still valid; we're not commenting on that. Just the side-effect "this might fix #40831" hypothesis didn't pan out on our test.

@Bot1822
Copy link
Copy Markdown
Author

Bot1822 commented Apr 25, 2026

Confirmed based on the backport result in #40831. This PR is scoped to reducing TurboQuant decode scratch memory by sharing temporary decode workspace across layers and reserving it before the workspace is locked. I will not claim it fixes #40831. The TurboQuant + speculative decoding + CUDA graph correctness issue appears separate and should be investigated independently.

@MidasMining
Copy link
Copy Markdown

This PR's stated goal — "Reserve a max-size TurboQuant decode workspace before lock_workspace()... preventing locked-workspace growth at runtime" — directly addresses Issue #41565 that I filed today. The assertion message in #41565 is Workspace is locked but allocation from 'turboquant_attn.py:757:_continuation_prefill' requires 2.00 MB, current size is 0.26 MB, which is exactly the failure mode this PR prevents.

Repro is small: any TQ-enabled prompt past ~6-8K cached tokens crashes the engine on stock v0.20.0. The same workloads run cleanly on the pre-#40941 fork. Details + threshold sweep in #41565.

Happy to validate this PR on 8× RTX A4000 SM86 / Nemotron-3-Super-120B-AWQ-4bit / vLLM 0.20.0 — your H200 / Llama-3.1-70B numbers are great but the failure also reproduces on Ampere with hybrid Mamba+MoE+Attention models, and that's where I have a deterministic crash to test against. If you'd like Ampere validation before merge, give me a thumbs-up here and I'll run a sweep at 8K / 16K / 32K / 64K cached KV and post the results.

Also potentially related: Issue #40420 (jhsmith409, Apr 21) — different symptom (CUDA OOM at 185K tokens) but same _continuation_prefill function and Sandermage's lazy-buffer-allocation analysis. May share root cause; worth checking whether your reservation also covers that path.

— MidasMining, 8× RTX A4000 SM86 / vLLM 0.20.0

@gaby
Copy link
Copy Markdown

gaby commented May 6, 2026

@Bot1822 Fix the merge conflict

Signed-off-by: Guipeng Zhang <zhangguipeng23z@ict.ac.cn>
@Bot1822
Copy link
Copy Markdown
Author

Bot1822 commented May 6, 2026

@MidasMining @gaby The merge conflict has been resolved and the PR branch has been updated.

The updated version keeps the WorkspaceManager-based path from main and preserves this PR fix: TurboQuant workspace is reserved before lock_workspace(), including the _continuation_prefill path mentioned in #41565.

jsboige added a commit to jsboige/vllm that referenced this pull request May 6, 2026
Same-day cutover sequence 2026-05-06, both candidates rejected:
- 27B Dense+TQ K8V4: tripped all 3 perf rollback thresholds (decode -50%,
  5-concurrent -49%, tool +40%); GSM8K/IFEval gains within sampling noise.
- MoE 35B-A3B+TQ K8V4: booted with 1.49M-token KV (+4.6x vs FP8 322K) but
  EngineCore crashes on first chunked-prefill continuation (workspace
  16.31->29.73 MB, turboquant_attn.py:720). Upstream issue vllm#41726
  already filed by jhsmith409, candidate fix PR vllm-project#40798 open. Our repro
  posted as issue comment confirms persistence post-vllm-project#39931 on hybrid MoE.

Restored production: vllm-qwen36-shmpatched:nightly-f6983f01d-patched1 +
--kv-cache-dtype fp8 (Apr 06 baseline, stable since 2026-04-19). All
smoke tests pass (chat, thinking, tool calling).

Files:
- CLAUDE.md: TQ migration section rewritten as REJECTED, current state
  reverted to MoE+FP8, deployment table updated, 2 entries added to
  rejected models list.
- profiles/medium-qwen36-moe.yml: image + kv-cache-dtype reverted with
  inline rationale.
- Dockerfile.qwen36-27b-tq -> Dockerfile.qwen36-tq (renamed generic, used
  for both 27B and MoE TQ attempts; image vllm-qwen36-tq:nightly-e47c98ef-
  patched1 retained for re-test once vllm-project#40798 merges).
- profiles/medium-qwen36-27b.yml -> archives/2026/medium-qwen36-27b.yml.
  rejected-2026-05-06.
- qwen3_benchmark/lmms_results/qwen3.6-27b/: GSM8K + IFEval results
  preserved as evidence for the rejection rationale.

Upstream tracking:
- Issue: vllm-project#41726
- PR (fix): vllm-project#40798
- Our comment: vllm-project#41726 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gaby
Copy link
Copy Markdown

gaby commented May 7, 2026

@mgoin @JartX @njhill @sfeng33 Ping, TurboQuant does not work at all without this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants