[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) by SandishKumarHN · Pull Request #38419 · vllm-project/vllm

SandishKumarHN · 2026-03-28T04:45:10Z

Summary

Fixes the backup token selection in prepare_next_token_ids_padded for both EAGLE and ExtractHiddenStates proposers.

When async scheduling is enabled, seq_lens_cpu is the optimistic sequence length — it includes unaccepted draft token placeholders. Using this value as an index into get_token_id() returns -1 (out of range), which gets fed back to the drafter as the next input token and degrades draft acceptance rate.

The fix uses num_tokens_no_spec - 1 instead, which is the index of the last committed output token and is always valid.

Root cause

This was identified as the root cause of the Nemotron BF16 accuracy regression in the LM Eval Large Models H200 CI job (accuracy ~0.74 instead of the expected ~0.93).

Changes

vllm/v1/spec_decode/eagle.py: use num_tokens_no_spec - 1 as backup token index
vllm/v1/spec_decode/extract_hidden_states.py: same fix
tests/v1/spec_decode/test_backup_token_async_spec.py: unit tests covering both the buggy and fixed behavior

Test

pytest tests/v1/spec_decode/test_backup_token_async_spec.py -v
# 11 passed

Closes #38098

…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request addresses a bug in speculative decoding where backup tokens were incorrectly retrieved due to an off-by-one error and sequence length inflation during async scheduling. The implementation replaces the use of optimistic sequence lengths with the index of the last committed token (num_tokens_no_spec - 1) in both eagle.py and extract_hidden_states.py. A new regression test file was added to verify the fix. Review feedback recommends renaming one of the new tests to better reflect that the original logic was consistently off-by-one, even in non-async scenarios.

gemini-code-assist · 2026-03-28T04:46:08Z

tests/v1/spec_decode/test_backup_token_async_spec.py

+    def test_buggy_vs_fixed_diverge_only_with_inflation(self):
+        """Buggy and fixed code agree when there is no inflation
+        (seq_lens == num_tokens_no_spec), and diverge when inflated."""


The test name and docstring are a bit misleading. They suggest that the buggy and fixed code only diverge when there's inflation from async scheduling. However, the test implementation and comments within it correctly point out that the original code was always off-by-one, even without inflation.

To avoid future confusion, I'd recommend renaming the test and updating its docstring to reflect this important finding.

Suggested change

def test_buggy_vs_fixed_diverge_only_with_inflation(self):

"""Buggy and fixed code agree when there is no inflation

(seq_lens == num_tokens_no_spec), and diverge when inflated."""

def test_buggy_code_was_always_off_by_one(self):

"""Verify that the original buggy code was always off-by-one.

This test confirms that the buggy and fixed code paths diverge even

without async inflation. The original code used `seq_len` as an index,

which is an off-by-one error, while the fixed code correctly uses

`seq_len - 1`.

"""

Per Gemini review, the original name was misleading — the buggy code was always off-by-one, not just when async inflation was present. Rename to test_buggy_code_was_always_off_by_one and update the docstring to clearly explain that seq_len (= num_tokens) is always out of range for get_token_id(). Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

MatthewBonanni · 2026-03-30T20:23:12Z

Hi @SandishKumarHN, thank you very much for the fix! I found that this fix was necessary but not sufficient on its own to restore the Nemotron-3-Super accuracy to expected levels. I made an additional fix and posted a PR with you as co-author that incorporates both of our changes: #38556. Let's work to get that one merged 👍

MatthewBonanni · 2026-03-31T15:13:49Z

Closing in favor of #38556 which has merged

SandishKumarHN requested review from MatthewBonanni, benchislett and luccafong as code owners March 28, 2026 04:45

claude bot reviewed Mar 28, 2026

View reviewed changes

SandishKumarHN mentioned this pull request Mar 28, 2026

[CI][Eval] Lower Nemotron-3-Super-120B-A12B-BF16 GSM8K accuracy threshold to 0.91 #38403

Closed

mergify bot added speculative-decoding v1 labels Mar 28, 2026

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

SandishKumarHN changed the title ~~[Spec Decode] Fix backup token off-by-one in async spec decode causing Nemotron BF16 GSM8K accuracy regression~~ [Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) Mar 28, 2026

mergify bot added the bug Something isn't working label Mar 28, 2026

Simplify comments and rewrite test file for clarity

055ccf2

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

MatthewBonanni mentioned this pull request Mar 30, 2026

[Bugfix][Async] Fix async spec decoding with hybrid models #38556

Merged

5 tasks

MatthewBonanni closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy)#38419

[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy)#38419
SandishKumarHN wants to merge 3 commits intovllm-project:mainfrom
SandishKumarHN:fix/eagle-async-spec-decode-backup-token

SandishKumarHN commented Mar 28, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Uh oh!

MatthewBonanni commented Mar 30, 2026

Uh oh!

MatthewBonanni commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    def test_buggy_vs_fixed_diverge_only_with_inflation(self):
-        """Buggy and fixed code agree when there is no inflation
-        (seq_lens == num_tokens_no_spec), and diverge when inflated."""
+    def test_buggy_code_was_always_off_by_one(self):
+        """Verify that the original buggy code was always off-by-one.
+        This test confirms that the buggy and fixed code paths diverge even
+        without async inflation. The original code used `seq_len` as an index,
+        which is an off-by-one error, while the fixed code correctly uses
+        `seq_len - 1`.
+        """

Uh oh!

Conversation

SandishKumarHN commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Test

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni commented Mar 30, 2026

Uh oh!

MatthewBonanni commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SandishKumarHN commented Mar 28, 2026 •

edited

Loading