Skip to content

[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy)#38419

Closed
SandishKumarHN wants to merge 3 commits intovllm-project:mainfrom
SandishKumarHN:fix/eagle-async-spec-decode-backup-token
Closed

[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy)#38419
SandishKumarHN wants to merge 3 commits intovllm-project:mainfrom
SandishKumarHN:fix/eagle-async-spec-decode-backup-token

Conversation

@SandishKumarHN
Copy link
Copy Markdown
Contributor

@SandishKumarHN SandishKumarHN commented Mar 28, 2026

Summary

Fixes the backup token selection in prepare_next_token_ids_padded for both EAGLE and ExtractHiddenStates proposers.

When async scheduling is enabled, seq_lens_cpu is the optimistic sequence length — it includes unaccepted draft token placeholders. Using this value as an index into get_token_id() returns -1 (out of range), which gets fed back to the drafter as the next input token and degrades draft acceptance rate.

The fix uses num_tokens_no_spec - 1 instead, which is the index of the last committed output token and is always valid.

Root cause

This was identified as the root cause of the Nemotron BF16 accuracy regression in the LM Eval Large Models H200 CI job (accuracy ~0.74 instead of the expected ~0.93).

Changes

  • vllm/v1/spec_decode/eagle.py: use num_tokens_no_spec - 1 as backup token index
  • vllm/v1/spec_decode/extract_hidden_states.py: same fix
  • tests/v1/spec_decode/test_backup_token_async_spec.py: unit tests covering both the buggy and fixed behavior

Test

pytest tests/v1/spec_decode/test_backup_token_async_spec.py -v
# 11 passed

Closes #38098

…re_next_token_ids_padded)

When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951),
optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is
passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is
inflated relative to the actual committed output_token_ids because
_prepare_inputs appends -1 placeholder slots optimistically.

The backup token lookup calls:
  request.get_token_id(seq_lens_cpu[i])

where seq_lens_cpu[i] points one past the end of the committed tokens,
causing get_token_id() to return -1 (placeholder). The drafter then
receives -1 as its next input token, which corrupts its hidden state and
degrades the draft acceptance rate — causing the Nemotron-3-Super-120B
BF16 GSM8K score to drop from ~0.93 to ~0.74.

Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed
output token — for the backup token lookup in both EagleProposer
(eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py).
num_tokens_no_spec is set to request.num_tokens before the optimistic
extend, so it always points to a valid token slot.

Fixes: vllm-project#38098
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in speculative decoding where backup tokens were incorrectly retrieved due to an off-by-one error and sequence length inflation during async scheduling. The implementation replaces the use of optimistic sequence lengths with the index of the last committed token (num_tokens_no_spec - 1) in both eagle.py and extract_hidden_states.py. A new regression test file was added to verify the fix. Review feedback recommends renaming one of the new tests to better reflect that the original logic was consistently off-by-one, even in non-async scenarios.

Comment on lines +232 to +234
def test_buggy_vs_fixed_diverge_only_with_inflation(self):
"""Buggy and fixed code agree when there is no inflation
(seq_lens == num_tokens_no_spec), and diverge when inflated."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test name and docstring are a bit misleading. They suggest that the buggy and fixed code only diverge when there's inflation from async scheduling. However, the test implementation and comments within it correctly point out that the original code was always off-by-one, even without inflation.

To avoid future confusion, I'd recommend renaming the test and updating its docstring to reflect this important finding.

Suggested change
def test_buggy_vs_fixed_diverge_only_with_inflation(self):
"""Buggy and fixed code agree when there is no inflation
(seq_lens == num_tokens_no_spec), and diverge when inflated."""
def test_buggy_code_was_always_off_by_one(self):
"""Verify that the original buggy code was always off-by-one.
This test confirms that the buggy and fixed code paths diverge even
without async inflation. The original code used `seq_len` as an index,
which is an off-by-one error, while the fixed code correctly uses
`seq_len - 1`.
"""

Per Gemini review, the original name was misleading — the buggy code was
always off-by-one, not just when async inflation was present. Rename to
test_buggy_code_was_always_off_by_one and update the docstring to clearly
explain that seq_len (= num_tokens) is always out of range for get_token_id().

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
@SandishKumarHN SandishKumarHN changed the title [Spec Decode] Fix backup token off-by-one in async spec decode causing Nemotron BF16 GSM8K accuracy regression [Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) Mar 28, 2026
@mergify mergify bot added the bug Something isn't working label Mar 28, 2026
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
@MatthewBonanni
Copy link
Copy Markdown
Collaborator

Hi @SandishKumarHN, thank you very much for the fix! I found that this fix was necessary but not sufficient on its own to restore the Nemotron-3-Super accuracy to expected levels. I made an additional fix and posted a PR with you as co-author that incorporates both of our changes: #38556. Let's work to get that one merged 👍

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

Closing in favor of #38556 which has merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure]: LM Eval Large Models (H200)

2 participants