[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706
[TurboQuant] Reduce TurboQuant KV memory loss by deduplicating decode scratch buffers#40706lesj0610 wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request migrates TurboQuant decode buffers from per-layer static allocations to a centralized management system using the WorkspaceManager. The changes include new utility functions for workspace reservation and retrieval, as well as comprehensive unit tests covering heterogeneous head sizes and fallback scenarios. Review feedback identifies a high-priority issue: removing the original per-layer buffer registration without a fallback mechanism for v0 environments (where the WorkspaceManager is not used) risks Out-Of-Memory (OOM) errors during memory profiling. It is recommended to return a status from the reservation function and maintain the original registration logic as a fallback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c3c19bf58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
d673da7 to
2de4f33
Compare
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Before this PR, each TurboQuant attention layer kept three decode scratch buffers (
_tq_mid_o_buf,_tq_output_buf,_tq_lse_buf) as persistentregister_buffer. These are temporary scratch only, not real state. But they stayed allocated per layer, so KV cache memory was wasted proportional to the number of attention layers.This PR removes those per-layer buffers. Each layer now calls
reserve_turboquant_decode_workspace()at init, and all layers share three workspace tensors fromWorkspaceManagerat decode time.I ran the duplicate check before opening:
gh pr list --repo vllm-project/vllm --state open --search "turboquant decode"The closest result is #40655. That PR puts one shared buffer on the
Attentionclass. This PR uses the existing v1 workspace lifecycle instead (reservebefore warmup,lock, then acquire at runtime). Shared state does not go on theAttentionclass, so the pipeline parallelism concern raised in #40655 is addressed differently here.If
WorkspaceManageris not initialized, decode falls back to the previous lazy per-layer buffer reuse path.KV cache memory — Qwen3-8B, TP=2, RTX 3090
turboquant_k8v4origin/mainturboquant_k8v4turboquant_4bit_ncorigin/mainturboquant_4bit_ncFor
turboquant_4bit_nc, short chat also returned서울on both branches.Tests:
Both passed.