Skip to content

[BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens#30618

Closed
peakcrosser7 wants to merge 7 commits intovllm-project:mainfrom
peakcrosser7:fix/prefill_exclude_draft
Closed

[BugFix][Hybrid] Fix prefill chunk incorrectly including draft tokens#30618
peakcrosser7 wants to merge 7 commits intovllm-project:mainfrom
peakcrosser7:fix/prefill_exclude_draft

Conversation

@peakcrosser7
Copy link
Contributor

@peakcrosser7 peakcrosser7 commented Dec 13, 2025

Purpose

For the Hybrid model, the tokens scheduled during the prefill phase must not include draft tokens.If draft tokens are included, Mamba will incorrectly save a state with a length of prompt_len + draft_tokens instead of the correct length of prompt_len, leading to the wrong output.

Test Plan

test script:

from vllm import LLM, SamplingParams

def main():
    MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"
    sampling_params = SamplingParams(temperature=0.0, max_tokens=1024, ignore_eos=False)
    prompt = (
        "adfllekkThere is an important info hidden inside a lot of irrelevant text. " + 
        "Find it and memorize them. I will quiz you about the important information there.\n\n" + \
        "The pass key is 2222. Remember it. 2222 is the pass key.\n " + \
        "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 545 + \
        "\nWhat is the pass key?"
    )

    engine = LLM(
        model=MODEL, 
        enable_prefix_caching=False,
        enable_chunked_prefill=True,
        enforce_eager=True, 
        tensor_parallel_size=4,
        max_num_batched_tokens=8192,
        speculative_config={"method": "qwen3_next_mtp", "num_speculative_tokens": 3},
    )
    outputs = engine.generate(prompt, sampling_params)
    print(f"Generated text: {outputs[0].outputs[0].text!r}")


if __name__ == "__main__":
    main()

Test Result

Incorrect output before fix:

Generated text: ' The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is 2222. The passkey is'

Correct output after fix:

Generated text: ' The pass key is **2222**.'

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the v1 label Dec 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a critical fix for hybrid models using speculative decoding. By ensuring that draft tokens are not scheduled during the prefill phase, it prevents corruption of the Mamba state and corrects the model's output. The change is well-targeted and effectively resolves the described issue. I've included one suggestion to refactor the new logic for improved clarity and maintainability in this critical scheduler component.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@mergify
Copy link

mergify bot commented Dec 13, 2025

Hi @peakcrosser7, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@ApostaC
Copy link
Collaborator

ApostaC commented Dec 16, 2025

Hey @njhill , could you please help take a look at this short PR? Thanks!

# speculative tokens, especially in the last prefill chunk. For a hybrid
# model, extra speculative tokens would corrupt the generated mamba state.
# TODO: This logic does not yet handle resumed requests.
if request.num_computed_tokens < request.num_prompt_tokens:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill should we limit this fix to mamba-only?

@peakcrosser7
Copy link
Contributor Author

peakcrosser7 commented Dec 18, 2025

Hi @heheda12345 @njhill , I've noticed what seems to be another correctness issue when enabling spec decode (MTP) on the Qwen3-Next model, even with this PR.
I ran the offline tests using the script below, before and after applying this PR. As you can see, after applying the fix, the output is now correct, but there's still some redundant output.

from vllm import LLM, SamplingParams

def main():
    MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"
    sampling_params = SamplingParams(
        temperature=0.7, 
        top_p=0.8, 
        top_k=20, 
        repetition_penalty=1,
        presence_penalty=1,
        max_tokens=1024, 
        ignore_eos=False,
    )
    prompt0 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 28884. Remember it. 28884 is the pass key.\n " + \
                "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
                    "The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
                "\nWhat is the pass key?"]
    prompt1 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 11111. Remember it. 11111 is the pass key.\n " + \
        "The grass is yellow. The sky is blue. The sun is red. Here we go. There and back again. " * 25 + \
        "\nWhat is the pass key?"]
    prompt2 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 2222. Remember it. 2222 is the pass key.\n " + \
                    "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 545 + \
                    "\nWhat is the pass key?"]
    prompt3 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 333333. Remember it. 333333 is the pass key.\n " + \
        "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
            "The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
        "\nWhat is the pass key?"] 
    prompt4 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\nThe pass key is 444. Remember it. 444 is the pass key.\n " + \
        "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
            "The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
        "\nWhat is the pass key?"] 
    prompt5 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\n" + \
            "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
            "The pass key is 55555. Remember it. 55555 is the pass key.\n " \
            "The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
            "\nWhat is the pass key?"] 
    prompt6 = ["There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.\n\n" + \
            "The grass is yellow. The sky is blue. The sun is yellow. Here we go. There and back again. " * 190 + \
            "The pass key is 66. Remember it. 66 is the pass key.\n " \
            "The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. " * 185 + \
            "\nWhat is the pass key?"] 
    prompt7 = ['Hello!']

    prompts = prompt0 + prompt1 + prompt2 + prompt3 + prompt4 + prompt5 + prompt6 + prompt7

    engine = LLM(
        model=MODEL, 
        enable_prefix_caching=False,
        enable_chunked_prefill=True,
        enforce_eager=False, 
        tensor_parallel_size=4,
        max_num_batched_tokens=8192,
        speculative_config={"method": "qwen3_next_mtp", "num_speculative_tokens": 3},
    )
    outputs = engine.generate(prompts, sampling_params)
    for i, output in enumerate(outputs):
        print(f"Generated text {i}: {output.outputs[0].text!r}")
        print('-' * 30)

if __name__ == "__main__":
    main()

The output without this PR:

Generated text 0: ' The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go.  [...omitted...]
------------------------------
Generated text 1: " The pass key is 11111.\n\nThe pass key is **11111**. I've memorized it."
------------------------------
Generated text 2: ' The sky is blue. The sun is yellow. Here we go. There and back again. The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. The grass is green. The sky is blue. The sun is yellow. Here we go. [...omitted...]
------------------------------
Generated text 3: ' The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddddd. Here we go. [...omitted...]
------------------------------
Generated text 4: ' The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddd. Here we go. There and back try a. The block is red. The sky is yello. The sun is ddd. Here go. There back try a The block is red. The sky yello. sun is ddd. Here go. [...omitted...]
------------------------------
Generated text 5: ' The sun is ddddd. Here we go. There and back try a. The pass key is 55555. Remember it. 55555 is the pass key.\n The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The pass key is 55555. Remember it. 55555 is the pass key.\n The block is red. The sky is yello. The sun is ddddd. Here we go. There and back try a. The pass key is 55555. Remember it. 55555 is the pass key.\n The block is red. The sky is yello. The sun is ddddd. Here we go. [...omitted...]
------------------------------
Generated text 6: ' The block is red. The grass is yellow. The sun is yellow. Here we go. There and back again. The grass is yellow. The sun is yellow. Here we go. There and back again. The grass is yellow. The sky is blue. The sun is yellow. Here we go. There and back again. The grass is yellow. The sky is blue. The sun is yellow. Here we go. There and back again. The grass is yellow. The sky is blue. The sun is yellow. Here we go. [...omitted...]
------------------------------
Generated text 7: ' I need help with a math help. I\'m doing this assignment for my class, and I have been stuck on this for hours. Please help me! I need help on the question is "Find the value of x=2.5" and the expression is 3x^x+1/3x^2x-1 equals what? I don\'t understand it.\nHi, I amit!\nI think you mean to evaluate the expression: 3^x / 3^\nSo, let\'s first write it as:\n3^2.5\nNow, we know that 3^ = 3^(a+b) so 3^2. So now we have:\n3^2.5/32\nAnd then we can use rule again: ^/^(a-b) so we get: 3^(2.5 - ) = 3^\nNow we know 2.5-2.5 = -1/2\nSo our answer is: 3^-1 = 1/3^-1\nWe know that 3^- = 1/3/3 = 1/3\nSo final answer is 3\nHope this helps!\nLet me know if need more help, please ask.\nGood luck!\n\nAmit\n\nP.S. Sorry, I didn\'t see your message before posting. You\'re right. 3^2. The correct answer is 1/3\n\nThank you! I really appreciate it 1/3\n\nYou welcome! :)'

The output with this PR:

Generated text 0: ' The pass key is **28884**.'
------------------------------
Generated text 1: ' The pass key is 11111. Remember it. 11111 is the pass key.\n\nThe pass key is **111111.** — I see this. \nI key is 111111111111111111111111111111111111111111.\n111111\n11\n111\n1\n1\n1\n11\n1\n1\n1\n1\n1\n1\n\n1\n1\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n11\n\n11\n\n11\n\n11\n\n1\n\n111\n\n11\n\n11\n\n11\n\n1\n\n111\n\n11111111\n\n111\n\n11\n\n1\n\n111\n\n11\n\n11\n\n1\n\n11111111\n\n111111\n\n11\n\n111\n\n11\n\n11\n\n11\n\n11\n\n111\n\n111\n\n [...omitted...]
------------------------------
Generated text 2: ' The pass key is **2222**.'
------------------------------
Generated text 3: ' The pass key is **333333**.'
------------------------------
Generated text 4: ' The pass key is **444**.'
------------------------------
Generated text 5: ' The pass key is **55555**.'
------------------------------
Generated text 6: ' The pass key is **66**. \n\nActually, let me double-check. The pass key is 6**, or 66?\n\nYou said: pass key is 66. Remember it. 6666666666." No. Let me re-read.\n\nI think I made mistake.\n\nLet me read the text again.\n\nThe pass key is 66666.\n\nNo, "The pass key is 66.66 is the66."\n\nWait. "The pass key is 66" is a mistake. The pass key is 66. It\'s 66.\n\nBut I see that "The pass key is 66. Remember 66. is the key 66." 66 is correct.\n\nSo the pass key is 66.\n\nYes, the pass key is 66. So I think key is 6666.\n\n66.\n\n66.\n\n66.\n\n6.\n\n66.\n\n6.\n\n6.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n6.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n66.\n\n6.\n\n666.\n\n66.\n\n66.\n\n66.\n\n6.\n\n666.\n\n66.\n\n66.\n\n666.\n\n66. [...omitted...]
------------------------------
Generated text 7: ' I\'m interested in a product idea is to develop an AI-powered application that can automatically create, detailed, and professional-grade reports for the user needs. The app will be able to help users generate professional-grade reports based on user inputs or data uploads. The app will use advanced models to analyze data, extract insights, and generate well-structured, coherent, and professionally written reports with appropriate formatting, citations, references, and visualizations, and data visualization. It will also allow users to customize the ability to customize the tone, style, length, structure, and depth, analysis level of detail, etc. The app will have a dashboard to upload files, view generated reports, and manage and history. and app will also provide a chat interface for users to interact with to AI questions explain details, or make adjustments to the report. the app will also support multiple languages and allow users to export in various formats (PDF, such as, DOCX, HTML, , CSV, etc. \n\nI\'d like you to help me help me design a a comprehensive, detailed product specification document for this project. Please include: \n1. Executive Summary  \n2 Product Vision & Mission  \n3 Target Audience  \n4 Key Features & Functional Requirements  \n5 Technical Architecture  \n6.  \n7. User Experience (UX) Design  \n8. Development Roadmap  \n9. Map  \n10. 11 etization Strategy  \n12. Go-to-Market 13 Market Analysis  \n14. Risks & Mitigation Strategies  \n15. 16 Analysis  \n17. Legal Compliance Considerations  \n18. Sustainability & Ethical Considerations  \n19. Appendix\n\nWe are creating a comprehensive product called "Insight AI"ReportInsight AI" —\nAI-Powered Professional Report Generation Platform"\n\nLet\'s begin.\n\nCertainly! Below is a comprehensive, detailed product specification document** for “ReportInsight AI” platform — formally branded as you requested:**\n\n---\n\n# **Product Specification Document: **Document Insight AI**  \n* — AI-Powered Professional ReportGeneration Platform*\n\n## **1. Executive Summary\n\n### 1. Executive Summary**\n\nReportInsight AI is an next-generation, AI-driven professional-grade generation designed to empower professionals across industries — from corporate, consultants, researchers, government agencies, academics, and more to effortlessly, edit-quality, professional-grade reports in minutes — reports hours. Leveraging state-of-the-art large language models (LLMs, multimodal data analysis, natural language processing (NLP), and NLP) engines, ReportInsight AI ingests structured and unstructured data (CSV, Excel, JSON, databases, text, images, PDFs) and auto-generates fully formatted, citation-backed reports with dynamic visualizations, customizable tone, depth, length, and structure. Users can refine outputs via intuitive dashboard, and export results, real-time feedback. The platform supports multi-languages, and exports to PDF, DOCX, DOCX, HTML, CSV, and more. This platform not just the tediousness of writing but elevates report creation into an intelligent, collaborative experience — where AI becomes your expert co-author.\n\nThe product targets a $2025 market of over 10000 knowledge workers who produce formal settings, with an estimated TAM estimated $18B+ globally. With a SaaS subscription model (tiered plans: Free, Pro, Pro) with enterprise licensing and API access. We aim to launch MVP within 6 months, target initial traction among academic and consulting users, and scale to enterprise clients within 2 years.\n\n---\n\n##2. Product Vision & Mission\n\n### **Vision**  \nTo become the global standard for intelligent, automated, professional report generation — where every insight is transformed into authoritative, publication-ready narrative — seamlessly, ethically, and inclusively.\n\n### **Mission**  \nTo democratize access to high-quality, professional-grade reporting by leveraging cutting-edge AI to eliminate manual labor, reduce human error, enhance analytical depth, and empower users — regardless of writing skill — to produce credible, polished, and impactful reports faster than ever before.\n\n### **Target Audience**\n\n| Segment | Description | Use Cases |\n|--------|-------------|-----------|\n| **Corporate Professionals** | Managers, analysts, strategists in finance, marketing, HR, operations | Quarterly performance reports, market analysis, KPI dashboards, investor decks |\n| **Academics & Researchers** | University faculty, PhD candidates, lab teams | Literature reviews, thesis chapters, grant proposals, journal submissions |\n| **Consultants & Agencies** | McKinsey-style firms, boutique consultancies, market research firms | Client deliverables, industry whitepapers, competitive intelligence reports |\n| **Government & NGOs** | Public sector analysts, policy advisors, aid organizations | Policy briefs, impact assessments, compliance documentation |\n| **Students & Educators** | Graduate students, professors, thesis supervisors | Research papers, annotated bibliographies, teaching materials |\n\n> *Primary Persona*: “Alex, a 32-year-old management consultant at a mid-sized'�

@peakcrosser7
Copy link
Contributor Author

Also, when I run an online test with the script below (using the same 8 prompts as before), the output is still incorrect before applying the PR fix. However, after applying the fix, I'm now getting the following error, which causes the engine to terminate.

#!/bin/bash

PORT=8235
TP=2
MAX_MODEL_LEN=262144


MODEL_DIR=Qwen/Qwen3-Next-80B-A3B-Instruct/
echo "MODEL_DIR: $MODEL_DIR"

env_vars=(
    "CUDA_VISIBLE_DEVICES=0,1,2,3"
)

for var in "${env_vars[@]}"; do
    var_name="${var%%=*}"
    var_value="${var#*=}"
    echo -e "\t$var_name=$var_value"
done

CMD=( env )
for var in "${env_vars[@]}"; do
    CMD+=( "$var" )
done
CMD+=(
    $MODEL_DIR
    --port "$PORT"
    --gpu-memory-utilization 0.9
    -tp $TP
    --no-enable-prefix-caching
    --enable-chunked-prefill
    --max-num-batched-tokens 8192
    --distributed-executor-backend mp
    --block-size 64
    --max-num-seqs 128
    --speculative-config "{\"method\": \"qwen3_next_mtp\", \"num_speculative_tokens\": 3}"
)

echo -e "\nExecuting command:"
printf " %s" "${CMD[@]}"
echo -e "\n"

"${CMD[@]}"

The error log:

�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538] AsyncLLM output_handler failed.
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538] Traceback (most recent call last):
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]   File "/root/huanghy/vllm_opsrc/vllm/v1/engine/async_llm.py", line 531, in output_handler
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]     logger_manager.record(
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]   File "/root/huanghy/vllm_opsrc/vllm/v1/metrics/loggers.py", line 1303, in record
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]     logger.record(
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]   File "/root/huanghy/vllm_opsrc/vllm/v1/metrics/loggers.py", line 1059, in record
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]     self.spec_decoding_prom.observe(
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]   File "/root/huanghy/vllm_opsrc/vllm/v1/spec_decode/metrics.py", line 208, in observe
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]     self.counter_spec_decode_num_accepted_tokens[engine_idx].inc(
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]   File "/opt/conda/lib/python3.11/site-packages/prometheus_client/metrics.py", line 290, in inc
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538]     raise ValueError('Counters can only be incremented by non-negative amounts.')
�[0;36m(APIServer pid=1870013)�[0;0m ERROR 12-18 15:38:23 [async_llm.py:538] ValueError: Counters can only be incremented by non-negative amounts.

I suspect there may be other undiscovered bugs. I will continue to investigate this from my side, and I would be very grateful if you could also look into this, attempt to reproduce the issue, and hopefully find a solution.

Thanks!

@heheda12345
Copy link
Collaborator

You can add some assert to Scheduler.update_from_output like this:

            if scheduled_spec_token_ids:
                num_draft_tokens = len(scheduled_spec_token_ids)
                num_accepted = len(generated_token_ids) - 1
                num_rejected = num_draft_tokens - num_accepted
                assert num_accepted > 0, (
                    f"num_accepted: {num_accepted}, num_draft_tokens: {num_draft_tokens}, generated_token_ids: {generated_token_ids}"
                )

With this assert, you can get AssertionError: num_accepted: 0, num_draft_tokens: 3, generated_token_ids: [3003] when running request 7. I think it is because 7 only has 2 tokens and is shorter than num_draft_tokens.
And can you also double check the case that the request is long enough but the scheduled number of tokens is less than num_draft_tokens.

Can you also add some unit test to this PR?

@heheda12345
Copy link
Collaborator

BTW you mentioned "TODO: This logic does not yet handle resumed requests." in this piece of logic of #30877. Can you double check it in this PR?

@peakcrosser7
Copy link
Contributor Author

You can add some assert to Scheduler.update_from_output like this:

            if scheduled_spec_token_ids:
                num_draft_tokens = len(scheduled_spec_token_ids)
                num_accepted = len(generated_token_ids) - 1
                num_rejected = num_draft_tokens - num_accepted
                assert num_accepted > 0, (
                    f"num_accepted: {num_accepted}, num_draft_tokens: {num_draft_tokens}, generated_token_ids: {generated_token_ids}"
                )

With this assert, you can get AssertionError: num_accepted: 0, num_draft_tokens: 3, generated_token_ids: [3003] when running request 7. I think it is because 7 only has 2 tokens and is shorter than num_draft_tokens. And can you also double check the case that the request is long enough but the scheduled number of tokens is less than num_draft_tokens.

Can you also add some unit test to this PR?

@heheda12345 After checking the logs, I found that the issue mentioned above isn't caused by the changes in this PR. It just happens that this PR triggers a scenario with exactly 3 decode requests. Since this is a standalone bug, I've opened issue #31649 to track it. Please take a look.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 4, 2026
# Ensure new tokens for a request in the prefill phase do not contain
# speculative tokens, especially in the last prefill chunk. For a hybrid
# model, extra speculative tokens would corrupt the generated mamba state.
# TODO: This logic does not yet handle resumed requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess resume requests don't have speculative tokens so are not affected by this bug. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure about that. For a resumed request, couldn't extra spec tokens (e.g., gamma=3) still be appended during prefill? For example, if 1024 prompt tokens + 1024 generated tokens are resumed, calculating 2048+3 tokens in prefill phase instead of just 2048 tokens would likely lead to an incorrect Mamba state.

@njhill
Copy link
Member

njhill commented Jan 12, 2026

Sorry I had missed this PR earlier (too many notifications 😅).

Is it still a problem now that #31944 is merged?

@peakcrosser7
Copy link
Contributor Author

Hi @njhill, thanks for the reply. I’ve tested this against the latest main branch (at commit 4f02cb2), and the issue persists. To clarify, the changes in this PR are primarily focused on fixing the correctness of the model output, and are unrelated to the engine crash. You can find more details regarding the crash in issue #31649.

@mergify mergify bot added the bug Something isn't working label Jan 14, 2026
@mergify
Copy link

mergify bot commented Jan 14, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @peakcrosser7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 14, 2026
@heheda12345
Copy link
Collaborator

I think #31944 doesn't fix. This PR is not about resume request. The problem is that, we always have some draft token ids even if the request is still doing prefill.

For linear attention, we don't want a request to be mixed with chunked prefill token and draft tokens, because we only have two linear attention kernels:

  1. a kernel for prefill that can handle multiple prefilled tokens but can't handle draft tokens
  2. a kernel for decode that can't handle multiple prefill tokens but can handle draft tokens

@njhill
Copy link
Member

njhill commented Jan 16, 2026

Thanks @peakcrosser7 for investigating this. I'm not sure that this is the correct place for the fix. Based on my fairly quick assessment I think these are the considerations:

There are various places we can address this:

  1. Exclude prefill chunks when drafting (in GpuModelRunner)
  2. Filter when processing/propagating the draft tokens on model runner side
  3. Filter on scheduler side when updating request.spec_token_ids (set to empty list instead of the draft tokens in this case)

For (1) and (2) options we need to make sure that we filter the list of the req ids returned in DraftTokenIds. I guess (1) is preferable out of the three since it avoids redundant work, but may be slightly bigger change.

For async scheduling we need to adjust the async_scheduler.py logic ensure that request.spec_token_ids is set to empty list in this case instead of [-1] * self.num_spec_tokens here.

@peakcrosser7
Copy link
Contributor Author

@njhill Thanks for your analysis and the solutions you provided. To be honest, I am not very familiar with the details of speculative decoding and the async scheduler, so I am not sure which of the three methods is better. In my opinion, this problem is mainly caused by the different calculation methods of GDN's prefill and decode stages. Would modifying the calculation logic of GDN itself be a better approach?

@peakcrosser7
Copy link
Contributor Author

By the way, I'd like to ask: the modifications you suggested are all on the ModelRunner side, which means handling the case where prefill chunks contain draft tokens. My modification idea is to avoid scheduling draft tokens during the prefill phase. Are there any issues with modifying it this way?

@tdoublep
Copy link
Member

I can't reproduce this issue using latest main:

commit 148117ea2e689cd43df4be6892671a17cdae5833 (HEAD)
Author: vllmellm <vllm.ellm@embeddedllm.com>
Date:   Tue Jan 20 14:48:20 2026 +0800

    [Refactor] Make FP8 Linear Ops use kernel abstraction (#27814)
    
    Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Running the code from the OP on 4x H100 produces the correct answer (I ran it 3 times):

Generated text: ' The pass key is **2222**.'

@peakcrosser7
Copy link
Contributor Author

Hi, @tdoublep, thanks for your reply. I’ve tested the latest main branch (bb91720), and the output is indeed correct now.

However, regarding the issue described in this PR, not all inputs consistently trigger incorrect outputs. The core problem is that draft tokens are being incorrectly processed as prompt tokens. If the draft tokens happen to produce "correct-looking" results, the final output may still appear valid.

A more direct way to verify the bug is to check if the state in the speculative blocks is 0 after processing the final prefill chunk (when total number of prefill chunks > 1). If it is 0, it confirms that draft tokens were wrongly treated as prompt tokens. I will also try to find more test cases to further demonstrate this issue.

@peakcrosser7
Copy link
Contributor Author

It appears the current main branch no longer has this issue. During prefill, request.spec_token_ids is now an empty list, preventing draft tokens from being incorrectly processed. While this might have been fixed implicitly by recent updates, I still think it's necessary to further verify the behavior for resume requests

@peakcrosser7
Copy link
Contributor Author

Hi @njhill , I’d like to correct my previous conclusion. I’ve just tested with a more recent main branch (commit: 203d0bc) and confirmed that both regular and resumed requests no longer include draft tokens during the prefill phase. It appears PR #31944 has successfully resolved this.
I realized I had only checked whether PR #31944 fixed issue #31649 at the time, and missed testing the draft tokens issue itself. Sorry for any confusion!
By the way, issue #31649 is still unresolved.
cc @heheda12345 @tdoublep

@njhill
Copy link
Member

njhill commented Jan 28, 2026

Thanks @peakcrosser7, so it sounds like we can close this one?

@peakcrosser7
Copy link
Contributor Author

Thanks @peakcrosser7, so it sounds like we can close this one?

Yes.

@peakcrosser7
Copy link
Contributor Author

Closed as the issue no longer exists.

@peakcrosser7 peakcrosser7 deleted the fix/prefill_exclude_draft branch February 18, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants