Skip to content

[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151

Draft
njhill wants to merge 1 commit into
vllm-project:mainfrom
njhill:fix-pd-specdec
Draft

[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151
njhill wants to merge 1 commit into
vllm-project:mainfrom
njhill:fix-pd-specdec

Conversation

@njhill

@njhill njhill commented Jun 1, 2026

Copy link
Copy Markdown
Member

Purpose

Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with speculative decoding, plus a tokenization bug in the PD acceptance test that was masking it.

Bug 1 (correctness): prefill transfers lookahead blocks In request_finished, the prefill node reports remote_num_tokens = num_computed_tokens but sends its full block allocation, which with speculative decoding includes trailing lookahead-reservation blocks. When num_prompt_tokens % block_size == 0, the lookahead slot spills into an extra block, so len(remote_block_ids) == len(local_block_ids) + 1 on the decode side. _apply_prefix_caching then does remote[-num_local:] -- which assumes the surplus is an already-cached prefix and keeps the remote suffix. That drops the real first block and keeps the never-written lookahead block, shifting the entire block-to-block mapping. The decode node ends up attending to never-written KV (on both the target and the EAGLE3 drafter layers), producing wrong outputs for the affected requests.

Fix: clip the transferred block_ids per KV cache group, using each group's own block_size, to the blocks covering num_computed_tokens. Self-attention groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state groups (Mamba/SSM) and any other spec whose length is not indexed by token count are passed through unchanged, so hybrid models are handled correctly.

Bug 2 (test): double-BOS in the PD acceptance test test_spec_decode_acceptance.py sends already-chat-templated prompts (which contain the BOS token) through the completions API, which prepends another BOS by default. This double-BOS lowered acceptance ~5% versus the standalone baselines (which tokenize with add_special_tokens=False), affecting single-engine and PD identically. Set add_special_tokens=False so the test compares like-for-like.

Test Plan

Test plan:

  • New unit tests in test_remote_decode_lifecycle.py (CPU): test_remote_decode_drops_lookahead_blocks (parametrized over 0/1/2 trailing lookahead blocks) and test_remote_decode_lookahead_clip_is_per_group (hybrid Mamba + attention with a per-group block_size). Both directly exercise request_finished; they fail without the fix and pass with it. Full file: 8 passed.
  • NIXL PD + EAGLE3 acceptance (Llama-3.1-8B, FLASH_ATTN, 2xGPU): per-position acceptance now 0.728 / 0.524 / 0.363 (vs baseline 0.730 / 0.521 / 0.354) and PASSES; V2 matches V1. Before the fix the misalignment dropped pos-2 acceptance and produced divergent outputs for prompt_len % block_size == 0 requests.

Not a duplicate: checked open NIXL/PD/spec-decode PRs (#43151, #42554, #41169, #35264); none addresses the lookahead-block transfer / block-mapping misalignment.

AI assistance (Claude) was used; all changes reviewed and tested by the submitter.

…kahead blocks

Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with
speculative decoding, plus a tokenization bug in the PD acceptance test
that was masking it.

Bug 1 (correctness): prefill transfers lookahead blocks
In request_finished, the prefill node reports remote_num_tokens =
num_computed_tokens but sends its full block allocation, which with
speculative decoding includes trailing lookahead-reservation blocks. When
num_prompt_tokens % block_size == 0, the lookahead slot spills into an extra
block, so len(remote_block_ids) == len(local_block_ids) + 1 on the decode
side. _apply_prefix_caching then does remote[-num_local:] -- which assumes
the surplus is an already-cached prefix and keeps the remote suffix. That
drops the real first block and keeps the never-written lookahead block,
shifting the entire block-to-block mapping. The decode node ends up attending
to never-written KV (on both the target and the EAGLE3 drafter layers),
producing wrong outputs for the affected requests.

Fix: clip the transferred block_ids per KV cache group, using each group's
own block_size, to the blocks covering num_computed_tokens. Self-attention
groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state
groups (Mamba/SSM) and any other spec whose length is not indexed by token
count are passed through unchanged, so hybrid models are handled correctly.

Bug 2 (test): double-BOS in the PD acceptance test
test_spec_decode_acceptance.py sends already-chat-templated prompts (which
contain the BOS token) through the completions API, which prepends another
BOS by default. This double-BOS lowered acceptance ~5% versus the standalone
baselines (which tokenize with add_special_tokens=False), affecting
single-engine and PD identically. Set add_special_tokens=False so the test
compares like-for-like.

Test plan:
- New unit tests in test_remote_decode_lifecycle.py (CPU):
  test_remote_decode_drops_lookahead_blocks (parametrized over 0/1/2 trailing
  lookahead blocks) and test_remote_decode_lookahead_clip_is_per_group (hybrid
  Mamba + attention with a per-group block_size). Both directly exercise
  request_finished; they fail without the fix and pass with it. Full file:
  8 passed.
- NIXL PD + EAGLE3 acceptance (Llama-3.1-8B, FLASH_ATTN, 2xGPU): per-position
  acceptance now 0.728 / 0.524 / 0.363 (vs baseline 0.730 / 0.521 / 0.354)
  and PASSES; V2 matches V1. Before the fix the misalignment dropped pos-2
  acceptance and produced divergent outputs for prompt_len % block_size == 0
  requests.

Not a duplicate: checked open NIXL/PD/spec-decode PRs (vllm-project#43151, vllm-project#42554, vllm-project#41169,
vllm-project#35264); none addresses the lookahead-block transfer / block-mapping
misalignment.

AI assistance (Claude) was used; all changes reviewed and tested by the
submitter.

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@njhill njhill marked this pull request as draft June 1, 2026 02:40
@mergify mergify Bot added v1 bug Something isn't working kv-connector labels Jun 1, 2026

@NickLucche NickLucche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @njhill can you check out #43996?
I think I described the issue, along with a path to simply skip drafting on P with a config-level check here #43807 , which I believe should make the workflow easier.

Let me know what you think

@njhill

njhill commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Ah thanks @NickLucche! Actually I came across this independently from debugging a MRV2 test failure, so just opened this fix as a placeholder... I hadn't seen your issue but will check it now

@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants