[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy)#38419
Conversation
…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in speculative decoding where backup tokens were incorrectly retrieved due to an off-by-one error and sequence length inflation during async scheduling. The implementation replaces the use of optimistic sequence lengths with the index of the last committed token (num_tokens_no_spec - 1) in both eagle.py and extract_hidden_states.py. A new regression test file was added to verify the fix. Review feedback recommends renaming one of the new tests to better reflect that the original logic was consistently off-by-one, even in non-async scenarios.
| def test_buggy_vs_fixed_diverge_only_with_inflation(self): | ||
| """Buggy and fixed code agree when there is no inflation | ||
| (seq_lens == num_tokens_no_spec), and diverge when inflated.""" |
There was a problem hiding this comment.
The test name and docstring are a bit misleading. They suggest that the buggy and fixed code only diverge when there's inflation from async scheduling. However, the test implementation and comments within it correctly point out that the original code was always off-by-one, even without inflation.
To avoid future confusion, I'd recommend renaming the test and updating its docstring to reflect this important finding.
| def test_buggy_vs_fixed_diverge_only_with_inflation(self): | |
| """Buggy and fixed code agree when there is no inflation | |
| (seq_lens == num_tokens_no_spec), and diverge when inflated.""" | |
| def test_buggy_code_was_always_off_by_one(self): | |
| """Verify that the original buggy code was always off-by-one. | |
| This test confirms that the buggy and fixed code paths diverge even | |
| without async inflation. The original code used `seq_len` as an index, | |
| which is an off-by-one error, while the fixed code correctly uses | |
| `seq_len - 1`. | |
| """ |
Per Gemini review, the original name was misleading — the buggy code was always off-by-one, not just when async inflation was present. Rename to test_buggy_code_was_always_off_by_one and update the docstring to clearly explain that seq_len (= num_tokens) is always out of range for get_token_id(). Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
|
Hi @SandishKumarHN, thank you very much for the fix! I found that this fix was necessary but not sufficient on its own to restore the Nemotron-3-Super accuracy to expected levels. I made an additional fix and posted a PR with you as co-author that incorporates both of our changes: #38556. Let's work to get that one merged 👍 |
|
Closing in favor of #38556 which has merged |
Summary
Fixes the backup token selection in
prepare_next_token_ids_paddedfor both EAGLE andExtractHiddenStatesproposers.When async scheduling is enabled,
seq_lens_cpuis the optimistic sequence length — it includes unaccepted draft token placeholders. Using this value as an index intoget_token_id()returns-1(out of range), which gets fed back to the drafter as the next input token and degrades draft acceptance rate.The fix uses
num_tokens_no_spec - 1instead, which is the index of the last committed output token and is always valid.Root cause
This was identified as the root cause of the Nemotron BF16 accuracy regression in the LM Eval Large Models H200 CI job (accuracy ~0.74 instead of the expected ~0.93).
Changes
vllm/v1/spec_decode/eagle.py: usenum_tokens_no_spec - 1as backup token indexvllm/v1/spec_decode/extract_hidden_states.py: same fixtests/v1/spec_decode/test_backup_token_async_spec.py: unit tests covering both the buggy and fixed behaviorTest
Closes #38098