[BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens#30618
[BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens#30618peakcrosser7 wants to merge 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request provides a critical fix for hybrid models using speculative decoding. By ensuring that draft tokens are not scheduled during the prefill phase, it prevents corruption of the Mamba state and corrects the model's output. The change is well-targeted and effectively resolves the described issue. I've included one suggestion to refactor the new logic for improved clarity and maintainability in this critical scheduler component.
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
Hi @peakcrosser7, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
Hey @njhill , could you please help take a look at this short PR? Thanks! |
vllm/v1/core/sched/scheduler.py
Outdated
| # speculative tokens, especially in the last prefill chunk. For a hybrid | ||
| # model, extra speculative tokens would corrupt the generated mamba state. | ||
| # TODO: This logic does not yet handle resumed requests. | ||
| if request.num_computed_tokens < request.num_prompt_tokens: |
There was a problem hiding this comment.
@njhill should we limit this fix to mamba-only?
|
Hi @heheda12345 @njhill , I've noticed what seems to be another correctness issue when enabling spec decode (MTP) on the Qwen3-Next model, even with this PR. from vllm import LLM, SamplingParams
def main():
MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.8,
top_k=20,
repetition_penalty=1,
presence_penalty=1,
max_tokens=1024,
ignore_eos=False,
)
prompt0 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 28884. Remember it. 28884 is the pass key.\n " + \
"The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
"The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
"\nWhat is the pass key?"]
prompt1 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 11111. Remember it. 11111 is the pass key.\n " + \
"The grass is yellow. The sky is blue. The sun is red. Here we go. There and back again. " * 25 + \
"\nWhat is the pass key?"]
prompt2 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 2222. Remember it. 2222 is the pass key.\n " + \
"The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 545 + \
"\nWhat is the pass key?"]
prompt3 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 333333. Remember it. 333333 is the pass key.\n " + \
"The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
"The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
"\nWhat is the pass key?"]
prompt4 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 444. Remember it. 444 is the pass key.\n " + \
"The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
"The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
"\nWhat is the pass key?"]
prompt5 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\n" + \
"The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
"The pass key is 55555. Remember it. 55555 is the pass key.\n " \
"The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
"\nWhat is the pass key?"]
prompt6 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\n" + \
"The grass is yellow. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
"The pass key is 66. Remember it. 66 is the pass key.\n " \
"The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
"\nWhat is the pass key?"]
prompt7 = ['Hello!']
prompts = prompt0 + prompt1 + prompt2 + prompt3 + prompt4 + prompt5 + prompt6 + prompt7
engine = LLM(
model=MODEL,
enable_prefix_caching=False,
enable_chunked_prefill=True,
enforce_eager=False,
tensor_parallel_size=4,
max_num_batched_tokens=8192,
speculative_config={"method": "qwen3_next_mtp", "num_speculative_tokens": 3},
)
outputs = engine.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Generated text {i}: {output.outputs[0].text!r}")
print('-' * 30)
if __name__ == "__main__":
main()The output without this PR: The output with this PR: |
|
Also, when I run an online test with the script below (using the same 8 prompts as before), the output is still incorrect before applying the PR fix. However, after applying the fix, I'm now getting the following error, which causes the engine to terminate. #!/bin/bash
PORT=8235
TP=2
MAX_MODEL_LEN=262144
MODEL_DIR=Qwen/Qwen3-Next-80B-A3B-Instruct/
echo "MODEL_DIR: $MODEL_DIR"
env_vars=(
"CUDA_VISIBLE_DEVICES=0,1,2,3"
)
for var in "${env_vars[@]}"; do
var_name="${var%%=*}"
var_value="${var#*=}"
echo -e "\t$var_name=$var_value"
done
CMD=( env )
for var in "${env_vars[@]}"; do
CMD+=( "$var" )
done
CMD+=(
$MODEL_DIR
--port "$PORT"
--gpu-memory-utilization 0.9
-tp $TP
--no-enable-prefix-caching
--enable-chunked-prefill
--max-num-batched-tokens 8192
--distributed-executor-backend mp
--block-size 64
--max-num-seqs 128
--speculative-config "{\"method\": \"qwen3_next_mtp\", \"num_speculative_tokens\": 3}"
)
echo -e "\nExecuting command:"
printf " %s" "${CMD[@]}"
echo -e "\n"
"${CMD[@]}"The error log: I suspect there may be other undiscovered bugs. I will continue to investigate this from my side, and I would be very grateful if you could also look into this, attempt to reproduce the issue, and hopefully find a solution. Thanks! |
|
You can add some assert to Scheduler.update_from_output like this: With this assert, you can get Can you also add some unit test to this PR? |
|
BTW you mentioned "TODO: This logic does not yet handle resumed requests." in this piece of logic of #30877. Can you double check it in this PR? |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@heheda12345 After checking the logs, I found that the issue mentioned above isn't caused by the changes in this PR. It just happens that this PR triggers a scenario with exactly 3 decode requests. Since this is a standalone bug, I've opened issue #31649 to track it. Please take a look. |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
| # Ensure new tokens for a request in the prefill phase do not contain | ||
| # speculative tokens, especially in the last prefill chunk. For a hybrid | ||
| # model, extra speculative tokens would corrupt the generated mamba state. | ||
| # TODO: This logic does not yet handle resumed requests. |
There was a problem hiding this comment.
I guess resume requests don't have speculative tokens so are not affected by this bug. WDYT?
There was a problem hiding this comment.
I’m not sure about that. For a resumed request, couldn't extra spec tokens (e.g., gamma=3) still be appended during prefill? For example, if 1024 prompt tokens + 1024 generated tokens are resumed, calculating 2048+3 tokens in prefill phase instead of just 2048 tokens would likely lead to an incorrect Mamba state.
|
Sorry I had missed this PR earlier (too many notifications 😅). Is it still a problem now that #31944 is merged? |
|
Hi @njhill, thanks for the reply. I’ve tested this against the latest main branch (at commit 4f02cb2), and the issue persists. To clarify, the changes in this PR are primarily focused on fixing the correctness of the model output, and are unrelated to the engine crash. You can find more details regarding the crash in issue #31649. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
I think #31944 doesn't fix. This PR is not about resume request. The problem is that, we always have some draft token ids even if the request is still doing prefill. For linear attention, we don't want a request to be mixed with chunked prefill token and draft tokens, because we only have two linear attention kernels:
|
|
Thanks @peakcrosser7 for investigating this. I'm not sure that this is the correct place for the fix. Based on my fairly quick assessment I think these are the considerations: There are various places we can address this:
For (1) and (2) options we need to make sure that we filter the list of the req ids returned in For async scheduling we need to adjust the |
|
@njhill Thanks for your analysis and the solutions you provided. To be honest, I am not very familiar with the details of speculative decoding and the async scheduler, so I am not sure which of the three methods is better. In my opinion, this problem is mainly caused by the different calculation methods of GDN's prefill and decode stages. Would modifying the calculation logic of GDN itself be a better approach? |
|
By the way, I'd like to ask: the modifications you suggested are all on the ModelRunner side, which means handling the case where prefill chunks contain draft tokens. My modification idea is to avoid scheduling draft tokens during the prefill phase. Are there any issues with modifying it this way? |
|
I can't reproduce this issue using latest main: Running the code from the OP on 4x H100 produces the correct answer (I ran it 3 times): |
|
Hi, @tdoublep, thanks for your reply. I’ve tested the latest main branch (bb91720), and the output is indeed correct now. However, regarding the issue described in this PR, not all inputs consistently trigger incorrect outputs. The core problem is that draft tokens are being incorrectly processed as prompt tokens. If the draft tokens happen to produce "correct-looking" results, the final output may still appear valid. A more direct way to verify the bug is to check if the state in the speculative blocks is 0 after processing the final prefill chunk (when total number of prefill chunks > 1). If it is 0, it confirms that draft tokens were wrongly treated as prompt tokens. I will also try to find more test cases to further demonstrate this issue. |
|
It appears the current main branch no longer has this issue. During prefill, |
|
Hi @njhill , I’d like to correct my previous conclusion. I’ve just tested with a more recent main branch (commit: 203d0bc) and confirmed that both regular and resumed requests no longer include draft tokens during the prefill phase. It appears PR #31944 has successfully resolved this. |
|
Thanks @peakcrosser7, so it sounds like we can close this one? |
Yes. |
|
Closed as the issue no longer exists. |
Purpose
For the Hybrid model, the tokens scheduled during the prefill phase must not include draft tokens.If draft tokens are included, Mamba will incorrectly save a state with a length of
prompt_len + draft_tokensinstead of the correct length ofprompt_len, leading to the wrong output.Test Plan
test script:
Test Result
Incorrect output before fix:
Correct output after fix:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.