Skip to content

Eagle3 mm support, enablement on qwen3vl#4848

Merged
wangxiyuan merged 28 commits intovllm-project:mainfrom
jesse996:eagle-qwen3vl
Jan 19, 2026
Merged

Eagle3 mm support, enablement on qwen3vl#4848
wangxiyuan merged 28 commits intovllm-project:mainfrom
jesse996:eagle-qwen3vl

Conversation

@jesse996
Copy link
Copy Markdown
Contributor

@jesse996 jesse996 commented Dec 9, 2025

What this PR does / why we need it?

follow pr vllm-project/vllm#20788 , Eagle3 mm support, enablement on qwen3vl
target model Qwen/Qwen3-VL-8B-Instruct
eagle3 MNN/Qwen3-VL-8B-Instruct-Eagle3

Does this PR introduce any user-facing change?

No

How was this patch tested?

pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv

vLLM with eagle3 :

vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images  --speculative-config '{
    "method": "eagle3",
    "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
    "num_speculative_tokens": 3
  }'

vLLM without eagle3 :

vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images 

bench:

vllm bench serve   --backend openai-chat   --base-url http://127.0.0.1:9100   --tokenizer /model/Qwen3-VL-8B-Instruct   --endpoint /v1/chat/completions   --model /model/Qwen3-VL-8B-Instruct   --dataset-name random  --num-prompts 50   --max-concurrency 5   --temperature 0   --top-p 1.0   --seed 123

eagle3 result:
IMG_0555

without eagle3 result
IMG_0556

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Eagle3 speculative decoding for multi-modal models, specifically targeting Qwen3-VL. The changes primarily focus on updating the EagleProposer to handle multi-modal inputs, including M-RoPE and image embeddings, and adding a new end-to-end test to verify correctness. While the changes are in the right direction, I've identified a couple of critical issues. The new test is using an incorrect base model for generating reference outputs, which invalidates the test's purpose. Additionally, there's a critical bug in the EagleProposer where the logic for sharing vocabulary embeddings between the target and draft models is flawed, which would lead to incorrect behavior. I've also pointed out a leftover debug print statement that should be removed.

Compare the outputs of a original LLM and a speculative LLM
should be the same when using eagle speculative decoding.
'''
ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The reference LLM is being initialized with model_name, which is hardcoded to LLM-Research/Meta-Llama-3.1-8B-Instruct. However, this test is intended for vision-language models and should use vl_model_name (Qwen/Qwen3-VL-8B-Instruct), which is correctly passed as a fixture. This means the test is not comparing against the correct reference model, making the test results invalid.

Suggested change
ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False)
ref_llm = LLM(model=vl_model_name, max_model_len=2048, enforce_eager=False)

Comment on lines +134 to +138
if hasattr(self.model.model, "embed_tokens"):
self.model.model.embed_tokens = self.model.model.embed_tokens
elif hasattr(self.model.model, "embedding"):
self.model.model.embed_tokens = self.model.model.embedding
# self.model.model.embed_tokens = model.model.embed_tokens
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for sharing vocabulary embeddings between the target model (model) and the draft model (self.model) is incorrect. The line self.model.model.embed_tokens = self.model.model.embed_tokens is a no-op, and the hasattr checks are performed on the draft model instead of the target model. This will result in the draft model not using the target model's embeddings. The logic should check for embed_tokens or embedding on the target model and assign it to the draft model.

Suggested change
if hasattr(self.model.model, "embed_tokens"):
self.model.model.embed_tokens = self.model.model.embed_tokens
elif hasattr(self.model.model, "embedding"):
self.model.model.embed_tokens = self.model.model.embedding
# self.model.model.embed_tokens = model.model.embed_tokens
if hasattr(model.model, "embed_tokens"):
self.model.model.embed_tokens = model.model.embed_tokens
elif hasattr(model.model, "embedding"):
self.model.model.embed_tokens = model.model.embedding

)
self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens
self.uses_mrope = self.vllm_config.model_config.uses_mrope
print("self.uses_mrope = ", self.uses_mrope)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This print statement appears to be for debugging purposes and should be removed from the final code.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@jesse996 jesse996 force-pushed the eagle-qwen3vl branch 6 times, most recently from f7383ec to 614f1c9 Compare December 16, 2025 04:38
@jesse996 jesse996 force-pushed the eagle-qwen3vl branch 2 times, most recently from 5b47bcc to 4e1776a Compare December 16, 2025 06:34
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026
@jesse996 jesse996 force-pushed the eagle-qwen3vl branch 2 times, most recently from 8cdc644 to a760e14 Compare January 14, 2026 09:01
Signed-off-by: jesse <szxfml@gmail.com>

update test

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>

gh add new model

Signed-off-by: jesse <szxfml@gmail.com>
@wangxiyuan wangxiyuan merged commit 2b6dc10 into vllm-project:main Jan 19, 2026
21 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 19, 2026
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 21, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
follow pr
[https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788)
, Eagle3 mm support, enablement on qwen3vl
target model
[Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct])
eagle3
[MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv

vLLM with eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images  --speculative-config '{
    "method": "eagle3",
    "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
    "num_speculative_tokens": 3
  }'
```
vLLM without eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images 
```

bench:
```
vllm bench serve   --backend openai-chat   --base-url http://127.0.0.1:9100   --tokenizer /model/Qwen3-VL-8B-Instruct   --endpoint /v1/chat/completions   --model /model/Qwen3-VL-8B-Instruct   --dataset-name random  --num-prompts 50   --max-concurrency 5   --temperature 0   --top-p 1.0   --seed 123
```

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: jesse <szxfml@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-download module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants