Skip to content

[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706

Open
lesj0610 wants to merge 2 commits intovllm-project:mainfrom
lesj0610:lesj/tq-decode-workspace-dedup
Open

[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706
lesj0610 wants to merge 2 commits intovllm-project:mainfrom
lesj0610:lesj/tq-decode-workspace-dedup

Conversation

@lesj0610
Copy link
Copy Markdown
Contributor

@lesj0610 lesj0610 commented Apr 23, 2026

Before this PR, each TurboQuant attention layer kept three decode scratch buffers (_tq_mid_o_buf, _tq_output_buf, _tq_lse_buf) as persistent register_buffer. These are temporary scratch only, not real state. But they stayed allocated per layer, so KV cache memory was wasted proportional to the number of attention layers.

This PR removes those per-layer buffers. Each layer now calls reserve_turboquant_decode_workspace() at init, and all layers share three workspace tensors from WorkspaceManager at decode time.

I ran the duplicate check before opening:

gh pr list --repo vllm-project/vllm --state open --search "turboquant decode"

The closest result is #40655. That PR puts one shared buffer on the Attention class. This PR uses the existing v1 workspace lifecycle instead (reserve before warmup, lock, then acquire at runtime). Shared state does not go on the Attention class, so the pipeline parallelism concern raised in #40655 is addressed differently here.

If WorkspaceManager is not initialized, decode falls back to the previous lazy per-layer buffer reuse path.

KV cache memory — Qwen3-8B, TP=2, RTX 3090

preset branch KV mem tokens
turboquant_k8v4 origin/main 12.0 GiB 387,248
turboquant_k8v4 this PR 14.02 GiB 452,224
turboquant_4bit_nc origin/main 12.0 GiB 508,512
turboquant_4bit_nc this PR 14.02 GiB 593,824

For turboquant_4bit_nc, short chat also returned 서울 on both branches.

Tests:

.venv/bin/python -m pytest tests/quantization/test_turboquant.py \
  -k 'init_turboquant_does_not_create_per_layer_decode_buffers or \
      workspace_reservation_uses_max_not_sum_for_heterogeneous_heads or \
      workspace_acquire_after_lock_no_growth or \
      decode_uses_layer_fallback_when_workspace_unavailable' -q

pre-commit run ruff-check --files \
  tests/quantization/test_turboquant.py \
  vllm/model_executor/layers/attention/attention.py \
  vllm/v1/attention/backends/turboquant_attn.py

Both passed.

@mergify mergify Bot added the v1 label Apr 23, 2026
@lesj0610 lesj0610 changed the title Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers [TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers Apr 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates TurboQuant decode buffers from per-layer static allocations to a centralized management system using the WorkspaceManager. The changes include new utility functions for workspace reservation and retrieval, as well as comprehensive unit tests covering heterogeneous head sizes and fallback scenarios. Review feedback identifies a high-priority issue: removing the original per-layer buffer registration without a fallback mechanism for v0 environments (where the WorkspaceManager is not used) risks Out-Of-Memory (OOM) errors during memory profiling. It is recommended to return a status from the reservation function and maintain the original registration logic as a fallback.

Comment thread vllm/v1/attention/backends/turboquant_attn.py
Comment thread vllm/model_executor/layers/attention/attention.py Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c3c19bf58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/v1/attention/backends/turboquant_attn.py Outdated
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610 lesj0610 force-pushed the lesj/tq-decode-workspace-dedup branch from d673da7 to 2de4f33 Compare May 4, 2026 00:27
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant