[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks by njhill · Pull Request #44151 · vllm-project/vllm

njhill · 2026-06-01T02:40:14Z

Purpose

Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with speculative decoding, plus a tokenization bug in the PD acceptance test that was masking it.

Bug 1 (correctness): prefill transfers lookahead blocks In request_finished, the prefill node reports remote_num_tokens = num_computed_tokens but sends its full block allocation, which with speculative decoding includes trailing lookahead-reservation blocks. When num_prompt_tokens % block_size == 0, the lookahead slot spills into an extra block, so len(remote_block_ids) == len(local_block_ids) + 1 on the decode side. _apply_prefix_caching then does remote[-num_local:] -- which assumes the surplus is an already-cached prefix and keeps the remote suffix. That drops the real first block and keeps the never-written lookahead block, shifting the entire block-to-block mapping. The decode node ends up attending to never-written KV (on both the target and the EAGLE3 drafter layers), producing wrong outputs for the affected requests.

Fix: clip the transferred block_ids per KV cache group, using each group's own block_size, to the blocks covering num_computed_tokens. Self-attention groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state groups (Mamba/SSM) and any other spec whose length is not indexed by token count are passed through unchanged, so hybrid models are handled correctly.

Bug 2 (test): double-BOS in the PD acceptance test test_spec_decode_acceptance.py sends already-chat-templated prompts (which contain the BOS token) through the completions API, which prepends another BOS by default. This double-BOS lowered acceptance ~5% versus the standalone baselines (which tokenize with add_special_tokens=False), affecting single-engine and PD identically. Set add_special_tokens=False so the test compares like-for-like.

Test Plan

Test plan:

New unit tests in test_remote_decode_lifecycle.py (CPU): test_remote_decode_drops_lookahead_blocks (parametrized over 0/1/2 trailing lookahead blocks) and test_remote_decode_lookahead_clip_is_per_group (hybrid Mamba + attention with a per-group block_size). Both directly exercise request_finished; they fail without the fix and pass with it. Full file: 8 passed.
NIXL PD + EAGLE3 acceptance (Llama-3.1-8B, FLASH_ATTN, 2xGPU): per-position acceptance now 0.728 / 0.524 / 0.363 (vs baseline 0.730 / 0.521 / 0.354) and PASSES; V2 matches V1. Before the fix the misalignment dropped pos-2 acceptance and produced divergent outputs for prompt_len % block_size == 0 requests.

Not a duplicate: checked open NIXL/PD/spec-decode PRs (#43151, #42554, #41169, #35264); none addresses the lookahead-block transfer / block-mapping misalignment.

AI assistance (Claude) was used; all changes reviewed and tested by the submitter.

…kahead blocks Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with speculative decoding, plus a tokenization bug in the PD acceptance test that was masking it. Bug 1 (correctness): prefill transfers lookahead blocks In request_finished, the prefill node reports remote_num_tokens = num_computed_tokens but sends its full block allocation, which with speculative decoding includes trailing lookahead-reservation blocks. When num_prompt_tokens % block_size == 0, the lookahead slot spills into an extra block, so len(remote_block_ids) == len(local_block_ids) + 1 on the decode side. _apply_prefix_caching then does remote[-num_local:] -- which assumes the surplus is an already-cached prefix and keeps the remote suffix. That drops the real first block and keeps the never-written lookahead block, shifting the entire block-to-block mapping. The decode node ends up attending to never-written KV (on both the target and the EAGLE3 drafter layers), producing wrong outputs for the affected requests. Fix: clip the transferred block_ids per KV cache group, using each group's own block_size, to the blocks covering num_computed_tokens. Self-attention groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state groups (Mamba/SSM) and any other spec whose length is not indexed by token count are passed through unchanged, so hybrid models are handled correctly. Bug 2 (test): double-BOS in the PD acceptance test test_spec_decode_acceptance.py sends already-chat-templated prompts (which contain the BOS token) through the completions API, which prepends another BOS by default. This double-BOS lowered acceptance ~5% versus the standalone baselines (which tokenize with add_special_tokens=False), affecting single-engine and PD identically. Set add_special_tokens=False so the test compares like-for-like. Test plan: - New unit tests in test_remote_decode_lifecycle.py (CPU): test_remote_decode_drops_lookahead_blocks (parametrized over 0/1/2 trailing lookahead blocks) and test_remote_decode_lookahead_clip_is_per_group (hybrid Mamba + attention with a per-group block_size). Both directly exercise request_finished; they fail without the fix and pass with it. Full file: 8 passed. - NIXL PD + EAGLE3 acceptance (Llama-3.1-8B, FLASH_ATTN, 2xGPU): per-position acceptance now 0.728 / 0.524 / 0.363 (vs baseline 0.730 / 0.521 / 0.354) and PASSES; V2 matches V1. Before the fix the misalignment dropped pos-2 acceptance and produced divergent outputs for prompt_len % block_size == 0 requests. Not a duplicate: checked open NIXL/PD/spec-decode PRs (vllm-project#43151, vllm-project#42554, vllm-project#41169, vllm-project#35264); none addresses the lookahead-block transfer / block-mapping misalignment. AI assistance (Claude) was used; all changes reviewed and tested by the submitter. Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NickLucche

hey @njhill can you check out #43996?
I think I described the issue, along with a path to simply skip drafting on P with a config-level check here #43807 , which I believe should make the workflow easier.

Let me know what you think

njhill · 2026-06-01T15:01:41Z

Ah thanks @NickLucche! Actually I came across this independently from debugging a MRV2 test failure, so just opened this fix as a placeholder... I hadn't seen your issue but will check it now

mergify · 2026-06-12T11:17:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

njhill requested review from ApostaC, NickLucche, orozery and xuechendi as code owners June 1, 2026 02:40

njhill marked this pull request as draft June 1, 2026 02:40

mergify Bot added v1 bug Something isn't working kv-connector labels Jun 1, 2026

NickLucche reviewed Jun 1, 2026

View reviewed changes

mergify Bot added the needs-rebase label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151

[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151
njhill wants to merge 1 commit into
vllm-project:mainfrom
njhill:fix-pd-specdec

njhill commented Jun 1, 2026

Uh oh!

NickLucche left a comment

Uh oh!

njhill commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhill commented Jun 1, 2026

Purpose

Test Plan

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Jun 1, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants