Eagle3 mm support, enablement on qwen3vl by jesse996 · Pull Request #4848 · vllm-project/vllm-ascend

jesse996 · 2025-12-09T15:28:38Z

What this PR does / why we need it?

follow pr vllm-project/vllm#20788 , Eagle3 mm support, enablement on qwen3vl
target model Qwen/Qwen3-VL-8B-Instruct
eagle3 MNN/Qwen3-VL-8B-Instruct-Eagle3

Does this PR introduce any user-facing change?

No

How was this patch tested?

pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv

vLLM with eagle3 :

vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images  --speculative-config '{
    "method": "eagle3",
    "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
    "num_speculative_tokens": 3
  }'

vLLM without eagle3 :

vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images

bench:

vllm bench serve   --backend openai-chat   --base-url http://127.0.0.1:9100   --tokenizer /model/Qwen3-VL-8B-Instruct   --endpoint /v1/chat/completions   --model /model/Qwen3-VL-8B-Instruct   --dataset-name random  --num-prompts 50   --max-concurrency 5   --temperature 0   --top-p 1.0   --seed 123

eagle3 result:

without eagle3 result

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request adds support for Eagle3 speculative decoding for multi-modal models, specifically targeting Qwen3-VL. The changes primarily focus on updating the EagleProposer to handle multi-modal inputs, including M-RoPE and image embeddings, and adding a new end-to-end test to verify correctness. While the changes are in the right direction, I've identified a couple of critical issues. The new test is using an incorrect base model for generating reference outputs, which invalidates the test's purpose. Additionally, there's a critical bug in the EagleProposer where the logic for sharing vocabulary embeddings between the target and draft models is flawed, which would lead to incorrect behavior. I've also pointed out a leftover debug print statement that should be removed.

gemini-code-assist · 2025-12-09T15:30:15Z

tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py

+    Compare the outputs of a original LLM and a speculative LLM
+    should be the same when using eagle speculative decoding.
+    '''
+    ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False)


The reference LLM is being initialized with model_name, which is hardcoded to LLM-Research/Meta-Llama-3.1-8B-Instruct. However, this test is intended for vision-language models and should use vl_model_name (Qwen/Qwen3-VL-8B-Instruct), which is correctly passed as a fixture. This means the test is not comparing against the correct reference model, making the test results invalid.

Suggested change

ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False)

ref_llm = LLM(model=vl_model_name, max_model_len=2048, enforce_eager=False)

gemini-code-assist · 2025-12-09T15:30:15Z

vllm_ascend/spec_decode/eagle_proposer.py

+            if hasattr(self.model.model, "embed_tokens"):
+                self.model.model.embed_tokens = self.model.model.embed_tokens
+            elif hasattr(self.model.model, "embedding"):
+                self.model.model.embed_tokens = self.model.model.embedding
+            # self.model.model.embed_tokens = model.model.embed_tokens


The logic for sharing vocabulary embeddings between the target model (model) and the draft model (self.model) is incorrect. The line self.model.model.embed_tokens = self.model.model.embed_tokens is a no-op, and the hasattr checks are performed on the draft model instead of the target model. This will result in the draft model not using the target model's embeddings. The logic should check for embed_tokens or embedding on the target model and assign it to the draft model.

Suggested change

if hasattr(self.model.model, "embed_tokens"):

self.model.model.embed_tokens = self.model.model.embed_tokens

elif hasattr(self.model.model, "embedding"):

self.model.model.embed_tokens = self.model.model.embedding

# self.model.model.embed_tokens = model.model.embed_tokens

if hasattr(model.model, "embed_tokens"):

self.model.model.embed_tokens = model.model.embed_tokens

elif hasattr(model.model, "embedding"):

self.model.model.embed_tokens = model.model.embedding

gemini-code-assist · 2025-12-09T15:30:15Z

vllm_ascend/spec_decode/eagle_proposer.py

+        )
+        self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens
+        self.uses_mrope = self.vllm_config.model_config.uses_mrope
+        print("self.uses_mrope  = ", self.uses_mrope)


This print statement appears to be for debugging purposes and should be removed from the final code.

github-actions · 2025-12-09T15:40:41Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-10T01:50:13Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-12T14:42:01Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-15T13:25:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jesse <szxfml@gmail.com>

Signed-off-by: jesse <szxfml@gmail.com> update test Signed-off-by: jesse <szxfml@gmail.com>

Signed-off-by: jesse <szxfml@gmail.com> gh add new model Signed-off-by: jesse <szxfml@gmail.com>

Signed-off-by: jesse <szxfml@gmail.com>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...

### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: jesse <szxfml@gmail.com>

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

github-actions bot added the module:tests label Dec 9, 2025

jesse996 force-pushed the eagle-qwen3vl branch from ed3b2f6 to 6b9a542 Compare December 10, 2025 01:19

github-actions bot added the merge-conflicts label Dec 10, 2025

jesse996 force-pushed the eagle-qwen3vl branch from 6b9a542 to 89e7748 Compare December 10, 2025 03:46

github-actions bot removed the merge-conflicts label Dec 10, 2025

github-actions bot added the merge-conflicts label Dec 12, 2025

jesse996 force-pushed the eagle-qwen3vl branch from 1c07129 to d1b7610 Compare December 15, 2025 01:39

github-actions bot removed the merge-conflicts label Dec 15, 2025

jesse996 force-pushed the eagle-qwen3vl branch 7 times, most recently from 78f3c4a to 6dcc202 Compare December 15, 2025 12:16

github-actions bot added the merge-conflicts label Dec 15, 2025

jesse996 force-pushed the eagle-qwen3vl branch 6 times, most recently from f7383ec to 614f1c9 Compare December 16, 2025 04:38

github-actions bot removed the merge-conflicts label Dec 16, 2025

jesse996 force-pushed the eagle-qwen3vl branch 2 times, most recently from 5b47bcc to 4e1776a Compare December 16, 2025 06:34

Merge branch 'main' into eagle-qwen3vl

1af997c

github-actions bot removed the merge-conflicts label Jan 8, 2026

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026

wangxiyuan approved these changes Jan 13, 2026

View reviewed changes

jesse996 added 3 commits January 14, 2026 09:29

Merge branch 'main' into eagle-qwen3vl

30aef44

Merge branch 'main' into eagle-qwen3vl

1874acb

fix modelrunner positions

a760e14

Signed-off-by: jesse <szxfml@gmail.com>

jesse996 force-pushed the eagle-qwen3vl branch 2 times, most recently from 8cdc644 to a760e14 Compare January 14, 2026 09:01

Merge branch 'main' into eagle-qwen3vl

8d34834

jesse996 requested review from MengqingCao and Yikun as code owners January 15, 2026 01:20

jesse996 added 2 commits January 15, 2026 09:53

update test

5c30916

Signed-off-by: jesse <szxfml@gmail.com> update test Signed-off-by: jesse <szxfml@gmail.com>

gh add new model

21631a7

Signed-off-by: jesse <szxfml@gmail.com> gh add new model Signed-off-by: jesse <szxfml@gmail.com>

jesse996 force-pushed the eagle-qwen3vl branch from 619ee1d to 21631a7 Compare January 15, 2026 01:53

wangxiyuan added the model-download label Jan 15, 2026

jesse996 requested a review from weijinqian0 as a code owner January 15, 2026 06:29

jesse996 force-pushed the eagle-qwen3vl branch from 315a68b to 28ef6e5 Compare January 15, 2026 06:31

Merge branch 'main' into eagle-qwen3vl

103d4a3

jesse996 force-pushed the eagle-qwen3vl branch from 28ef6e5 to 103d4a3 Compare January 15, 2026 06:32

Merge branch 'main' into eagle-qwen3vl

880dafe

jesse996 force-pushed the eagle-qwen3vl branch from 7d74b5f to 880dafe Compare January 16, 2026 02:45

jesse996 added 3 commits January 16, 2026 16:26

Merge branch 'main' into eagle-qwen3vl

c8b54d7

update

928d38d

Signed-off-by: jesse <szxfml@gmail.com>

Merge branch 'main' into eagle-qwen3vl

8c72466

wangxiyuan merged commit 2b6dc10 into vllm-project:main Jan 19, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagle3 mm support, enablement on qwen3vl#4848

Eagle3 mm support, enablement on qwen3vl#4848
wangxiyuan merged 28 commits intovllm-project:mainfrom
jesse996:eagle-qwen3vl

jesse996 commented Dec 9, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False)
	ref_llm = LLM(model=vl_model_name, max_model_len=2048, enforce_eager=False)

Conversation

jesse996 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jesse996 commented Dec 9, 2025 •

edited

Loading