[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers by lesj0610 · Pull Request #40706 · vllm-project/vllm

lesj0610 · 2026-04-23T13:45:46Z

Before this PR, each TurboQuant attention layer kept three decode scratch buffers (_tq_mid_o_buf, _tq_output_buf, _tq_lse_buf) as persistent register_buffer. These are temporary scratch only, not real state. But they stayed allocated per layer, so KV cache memory was wasted proportional to the number of attention layers.

This PR removes those per-layer buffers. Each layer now calls reserve_turboquant_decode_workspace() at init, and all layers share three workspace tensors from WorkspaceManager at decode time.

I ran the duplicate check before opening:

gh pr list --repo vllm-project/vllm --state open --search "turboquant decode"

The closest result is #40655. That PR puts one shared buffer on the Attention class. This PR uses the existing v1 workspace lifecycle instead (reserve before warmup, lock, then acquire at runtime). Shared state does not go on the Attention class, so the pipeline parallelism concern raised in #40655 is addressed differently here.

If WorkspaceManager is not initialized, decode falls back to the previous lazy per-layer buffer reuse path.

KV cache memory — Qwen3-8B, TP=2, RTX 3090

preset	branch	KV mem	tokens
`turboquant_k8v4`	`origin/main`	12.0 GiB	387,248
`turboquant_k8v4`	this PR	14.02 GiB	452,224
`turboquant_4bit_nc`	`origin/main`	12.0 GiB	508,512
`turboquant_4bit_nc`	this PR	14.02 GiB	593,824

For turboquant_4bit_nc, short chat also returned 서울 on both branches.

Tests:

.venv/bin/python -m pytest tests/quantization/test_turboquant.py \
  -k 'init_turboquant_does_not_create_per_layer_decode_buffers or \
      workspace_reservation_uses_max_not_sum_for_heterogeneous_heads or \
      workspace_acquire_after_lock_no_growth or \
      decode_uses_layer_fallback_when_workspace_unavailable' -q

pre-commit run ruff-check --files \
  tests/quantization/test_turboquant.py \
  vllm/model_executor/layers/attention/attention.py \
  vllm/v1/attention/backends/turboquant_attn.py

Both passed.

gemini-code-assist

Code Review

This pull request migrates TurboQuant decode buffers from per-layer static allocations to a centralized management system using the WorkspaceManager. The changes include new utility functions for workspace reservation and retrieval, as well as comprehensive unit tests covering heterogeneous head sizes and fallback scenarios. Review feedback identifies a high-priority issue: removing the original per-layer buffer registration without a fallback mechanism for v0 environments (where the WorkspaceManager is not used) risks Out-Of-Memory (OOM) errors during memory profiling. It is recommended to return a status from the reservation function and maintain the original registration logic as a fallback.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c3c19bf58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 mentioned this pull request Apr 23, 2026

[Attention][TurboQuant] Share decode buffers across layers to fix OOM #40655

Closed

5 tasks

mergify Bot added the v1 label Apr 23, 2026

lesj0610 changed the title ~~Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers~~ [TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/turboquant_attn.py

Comment thread vllm/model_executor/layers/attention/attention.py Outdated

lesj0610 marked this pull request as ready for review April 23, 2026 13:53

lesj0610 requested review from LucasWilkinson, MatthewBonanni, mgoin, pavanimajety, robertgshaw2-redhat and yewentao256 as code owners April 23, 2026 13:53

claude Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/turboquant_attn.py Outdated

Bot1822 mentioned this pull request Apr 24, 2026

[TurboQuant] Share decode scratch workspace across layers #40798

Open

noonghunna mentioned this pull request Apr 24, 2026

[Bug]: TurboQuant KV × any speculative decoding (MTP or ngram) produces degenerate token loops — confirmed across dense and hybrid attention #40831

Closed

Reserve TurboQuant decode workspace before lock

2de4f33

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 force-pushed the lesj/tq-decode-workspace-dedup branch from d673da7 to 2de4f33 Compare May 4, 2026 00:27

Merge main into TurboQuant workspace PR

47f654c

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706

[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706
lesj0610 wants to merge 2 commits intovllm-project:mainfrom
lesj0610:lesj/tq-decode-workspace-dedup

lesj0610 commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lesj0610 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lesj0610 commented Apr 23, 2026 •

edited

Loading