[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151
Draft
njhill wants to merge 1 commit into
Draft
[Bugfix] NIXL PD: Don't transfer spec-decode lookahead blocks#44151njhill wants to merge 1 commit into
njhill wants to merge 1 commit into
Conversation
…kahead blocks Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with speculative decoding, plus a tokenization bug in the PD acceptance test that was masking it. Bug 1 (correctness): prefill transfers lookahead blocks In request_finished, the prefill node reports remote_num_tokens = num_computed_tokens but sends its full block allocation, which with speculative decoding includes trailing lookahead-reservation blocks. When num_prompt_tokens % block_size == 0, the lookahead slot spills into an extra block, so len(remote_block_ids) == len(local_block_ids) + 1 on the decode side. _apply_prefix_caching then does remote[-num_local:] -- which assumes the surplus is an already-cached prefix and keeps the remote suffix. That drops the real first block and keeps the never-written lookahead block, shifting the entire block-to-block mapping. The decode node ends up attending to never-written KV (on both the target and the EAGLE3 drafter layers), producing wrong outputs for the affected requests. Fix: clip the transferred block_ids per KV cache group, using each group's own block_size, to the blocks covering num_computed_tokens. Self-attention groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state groups (Mamba/SSM) and any other spec whose length is not indexed by token count are passed through unchanged, so hybrid models are handled correctly. Bug 2 (test): double-BOS in the PD acceptance test test_spec_decode_acceptance.py sends already-chat-templated prompts (which contain the BOS token) through the completions API, which prepends another BOS by default. This double-BOS lowered acceptance ~5% versus the standalone baselines (which tokenize with add_special_tokens=False), affecting single-engine and PD identically. Set add_special_tokens=False so the test compares like-for-like. Test plan: - New unit tests in test_remote_decode_lifecycle.py (CPU): test_remote_decode_drops_lookahead_blocks (parametrized over 0/1/2 trailing lookahead blocks) and test_remote_decode_lookahead_clip_is_per_group (hybrid Mamba + attention with a per-group block_size). Both directly exercise request_finished; they fail without the fix and pass with it. Full file: 8 passed. - NIXL PD + EAGLE3 acceptance (Llama-3.1-8B, FLASH_ATTN, 2xGPU): per-position acceptance now 0.728 / 0.524 / 0.363 (vs baseline 0.730 / 0.521 / 0.354) and PASSES; V2 matches V1. Before the fix the misalignment dropped pos-2 acceptance and produced divergent outputs for prompt_len % block_size == 0 requests. Not a duplicate: checked open NIXL/PD/spec-decode PRs (vllm-project#43151, vllm-project#42554, vllm-project#41169, vllm-project#35264); none addresses the lookahead-block transfer / block-mapping misalignment. AI assistance (Claude) was used; all changes reviewed and tested by the submitter. Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NickLucche
reviewed
Jun 1, 2026
Member
Author
|
Ah thanks @NickLucche! Actually I came across this independently from debugging a MRV2 test failure, so just opened this fix as a placeholder... I hadn't seen your issue but will check it now |
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fixes a KV-corruption bug in NIXL prefill/decode disaggregation with speculative decoding, plus a tokenization bug in the PD acceptance test that was masking it.
Bug 1 (correctness): prefill transfers lookahead blocks In request_finished, the prefill node reports
remote_num_tokens = num_computed_tokensbut sends its full block allocation, which with speculative decoding includes trailing lookahead-reservation blocks. Whennum_prompt_tokens % block_size == 0, the lookahead slot spills into an extra block, solen(remote_block_ids) == len(local_block_ids) + 1on the decode side._apply_prefix_cachingthen doesremote[-num_local:]-- which assumes the surplus is an already-cached prefix and keeps the remote suffix. That drops the real first block and keeps the never-written lookahead block, shifting the entire block-to-block mapping. The decode node ends up attending to never-written KV (on both the target and the EAGLE3 drafter layers), producing wrong outputs for the affected requests.Fix: clip the transferred block_ids per KV cache group, using each group's own block_size, to the blocks covering num_computed_tokens. Self-attention groups (full / sliding-window, incl. MLA/sink subclasses) are clipped; state groups (Mamba/SSM) and any other spec whose length is not indexed by token count are passed through unchanged, so hybrid models are handled correctly.
Bug 2 (test): double-BOS in the PD acceptance test test_spec_decode_acceptance.py sends already-chat-templated prompts (which contain the BOS token) through the completions API, which prepends another BOS by default. This double-BOS lowered acceptance ~5% versus the standalone baselines (which tokenize with
add_special_tokens=False), affecting single-engine and PD identically. Set add_special_tokens=False so the test compares like-for-like.Test Plan
Test plan:
test_remote_decode_lifecycle.py(CPU):test_remote_decode_drops_lookahead_blocks(parametrized over 0/1/2 trailing lookahead blocks) andtest_remote_decode_lookahead_clip_is_per_group(hybrid Mamba + attention with a per-group block_size). Both directly exerciserequest_finished; they fail without the fix and pass with it. Full file: 8 passed.prompt_len % block_size == 0requests.Not a duplicate: checked open NIXL/PD/spec-decode PRs (#43151, #42554, #41169, #35264); none addresses the lookahead-block transfer / block-mapping misalignment.
AI assistance (Claude) was used; all changes reviewed and tested by the submitter.