[Attention][TurboQuant] Sparse V tile-skip (opt-in)#41422
[Attention][TurboQuant] Sparse V tile-skip (opt-in)#41422TheTom wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add a per-tile skip in the TQ decode kernel: when a KV tile's softmax
probability is entirely below threshold, skip the V load + dequant +
weighted-sum and just decay the running accumulator. The online-softmax
denominator (l_prev, m_prev) is still updated either way, so totals
stay exact.
* triton_turboquant_decode_attention: new SPARSE_V / SPARSE_V_THRESHOLD
constexprs on the stage-1 kernel; non-sparse code path is unchanged.
* Backend: VLLM_TQ_SPARSE_V (default "0"), VLLM_TQ_SPARSE_V_THRESHOLD,
VLLM_TQ_SPARSE_V_CTX_THRESHOLD env vars; _tq_sparse_v_enabled()
helper resolves "1" / "0" / "auto" with a context-length gate for
the auto case.
* Off by default: validated on AMD MI300X only; flipping the auto-on
default needs cross-platform numbers in a follow-up.
* Test: sparse_v=True at threshold=0 must produce byte-equivalent
output to sparse_v=False, since no tile satisfies max(p) < 0.
Signed-off-by: TheTom <tturney1@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Sparse V' optimization for TurboQuant attention, which skips Value tile loading and dequantization when softmax probabilities fall below a specified threshold. The implementation includes environment variable configurations for control, updates to the Triton decode kernel to support the optimization, and a new test case verifying that a zero threshold maintains baseline equivalence. I have no feedback to provide.
Replaces the per-element zero from the original PR #98 (which never saved compute since the V matmul still ran with val=0) with a tile-uniform skip of the entire V matmul block when no position contributes meaningfully. Per-warp shfl reduction of the per-thread max softmax probability, then a per-block fan-out via shared memory. Branch on the result so all threads take the same path — no warp divergence (was the cause of the April 24 revert in commit f2dc968 on the VEC kernel). Off by default. Opt in at build time: cmake -DGGML_CUDA_FLAGS=-DGGML_CUDA_TURBO_SPARSE_V_TILE # threshold defaults to 0.001f; override with # -DGGML_CUDA_FLAGS='-DGGML_CUDA_TURBO_SPARSE_V_TILE -DGGML_CUDA_TURBO_SPARSE_V_THRESHOLD=0.0001f' Mirror of vllm-project/vllm#41422 which validated this pattern on AMD MI300X: +7.13% decode @ 32K with PPL bit-identical and NIAH all-pass on Qwen3-8B. Default-off path is byte-identical to upstream (verified on M5 Max Metal: Qwen2.5-7B Q8_0 sym turbo3 PPL 6.6594, exact match). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the vllm-project/vllm#41422 design pattern to llama.cpp's CUDA fattn-vec kernel: replace the per-lane sparse V skip (which had warp divergence on turbo paths and was compile-time gated off for turbo via PR #115's `if constexpr (!V_is_turbo)`) with a warp-uniform skip via `warp_reduce_max`. All lanes branch on the same value so there's no warp divergence regardless of V type. Off by default. Opt in at build time: cmake -DCMAKE_CXX_FLAGS=-DGGML_CUDA_TURBO_SPARSE_V_VEC Threshold defaults to 0.001f (matches vLLM PR #41422). Override with -DGGML_CUDA_TURBO_SPARSE_V_VEC_THRESHOLD=<val>. Default-off path is byte-identical (verified on M5 Max Metal: Qwen2.5-7B Q8_0 sym turbo3 PPL 6.6594, exact match with PR #115 baseline). Pairs with the prior commit's tile-kernel sparse V skip (off-by-default opt-in via GGML_CUDA_TURBO_SPARSE_V_TILE) — that one targets fattn-tile prefill, this one targets fattn-vec decode (the actual hot path on single-token generation where the vLLM win was measured). NO MERGE — testing branch only. AMD MI300X HIP validation pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-platform validation on Ampere SM86 (8× RTX A4000)Tested this PR on Nemotron-3-Super-120B-AWQ-4bit (88-layer Mamba+MoE+Attention hybrid) on 8× RTX A4000 (SM86 Ampere), CUDA 13.0 driver 580.76.05, vLLM 0.20.0 with this PR cherry-picked. Two updates upstream were needed for the patch to apply cleanly:
Both unrelated to this PR; just flagging for anyone validating against the v0.20.0 GA tag. Bit-equivalence at threshold=0Reproduced the no-op claim by setting Throughput on Ampere SM86Single-stream decode,
*Both baseline and sparse-V crash at the same long-context threshold with Findings
Suggestions
— MidasMining, 8× RTX A4000 SM86 / Nemotron-3-Super-120B / vLLM 0.20.0 + this PR |
Purpose
Add a per-tile skip in the TurboQuant decode kernel: when a KV tile's softmax probability max is entirely below
SPARSE_V_THRESHOLD, skip the V load + dequant + weighted-sum and just decay the running accumulator. The online-softmax denominator (l_prev,m_prev) is still updated either way so totals stay exact.Off by default. Opt-in via env var. The kernel-side branch costs a
tl.max+ comparison, and the win comes from avoiding ~10tl.loadcalls and the dequant arithmetic on tiles that contribute negligibly. Validated on AMD MI300X only — flipping the default-on switch needs cross-platform numbers in a follow-up.Summary of changes
triton_turboquant_decode_attention: newSPARSE_V/SPARSE_V_THRESHOLDconstexprs on the stage-1 kernel; non-sparse path is byte-equivalent to upstream.VLLM_TQ_SPARSE_V(default"0"),VLLM_TQ_SPARSE_V_THRESHOLD(default0.001),VLLM_TQ_SPARSE_V_CTX_THRESHOLD(default8192);_tq_sparse_v_enabled()resolves"1"/"0"/"auto"with a context-length gate for the auto case.sparse_v=Trueatthreshold=0produces byte-equivalent output tosparse_v=False, since no tile satisfiesmax(p) < 0.Duplicate-work check
Searched open issues / PRs for
turboquant× {sparse, tile-skip, attention sparsity, softmax skip} — no matches. Tracking issue #40069 (TurboQuant follow-ups) does not list a sparse-V item. Adjacent perf work:triton_turboquant_decode.py.Test Plan / Results
Tested on AMD MI300X (gfx942), ROCm 7.2, vLLM ROCm 7.2.1 wheels. NVIDIA validation pending — sparse V is hardware-agnostic in concept (a per-tile Triton branch), but the throughput claim is AMD-only.
Bit-equality (no-op equivalence)
→ 2/2 passed.
sparse_v=Truewiththreshold=0produces outputtorch.allclose(atol=1e-6, rtol=1e-6)tosparse_v=Falsefor bothturboquant_k8v4(FP8 keys) andturboquant_4bit_nc(MSE keys).32K — Qwen/Qwen3-8B,
turboquant_4bit_nc3-chunk PPL on
wikitext-2-raw/wiki.test.rawat 32K, 1 trial:Tokens per branch: 24,570. PPL on 16K and PPL with
sparse_v=0on the treatment branch were also collected to confirm no-op equivalence; both bit-identical to upstream.128K — Qwen/Qwen3-4B-Instruct-2507-FP8,
turboquant_4bit_nc_continuation_prefillbusts the locked workspace at 128K (AssertionError: Workspace is locked but allocation requires 264 MB, current size is 262 MB) — separate bug, tracked at #40420. For the long-context bench below, both control and treatment were rebased onto Bot1822's branch (tq-workspace-manager-main), giving a clean apples-to-apples on a tree where the workspace-lock bug is fixed. Without #40798 merged, this PR's 128K perf claim is currently inaccessible to upstream main users.1-chunk PPL on
wikitext-2-raw/wiki.test.rawat 128K, 1 trial:Tokens per branch: 131,070. The smaller throughput delta at 128K vs 32K is likely a single-trial signal-to-noise issue at long-context; multi-trial bench would tighten the bound.
Full TQ unit-test suite
→ passes (2 new tests for sparse-V no-op equivalence; remaining tests untouched).
AI assistance
This PR was prepared with AI assistance (Anthropic Claude). Each line of the diff was reviewed by the human submitter, the bit-equality assertion and the bench (PPL, NIAH, throughput) were run on the human's hardware (AMD MI300X dev cloud), and the result tables come from runs the human supervised. The first iteration crashed engine init with
NameError: name 'os' is not defined— caught at A/B time, fixed, and re-validated. Commits carryCo-authored-by: Claudeper AGENTS.md.