Eagle3 mm support, enablement on qwen3vl#4848
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Eagle3 speculative decoding for multi-modal models, specifically targeting Qwen3-VL. The changes primarily focus on updating the EagleProposer to handle multi-modal inputs, including M-RoPE and image embeddings, and adding a new end-to-end test to verify correctness. While the changes are in the right direction, I've identified a couple of critical issues. The new test is using an incorrect base model for generating reference outputs, which invalidates the test's purpose. Additionally, there's a critical bug in the EagleProposer where the logic for sharing vocabulary embeddings between the target and draft models is flawed, which would lead to incorrect behavior. I've also pointed out a leftover debug print statement that should be removed.
| Compare the outputs of a original LLM and a speculative LLM | ||
| should be the same when using eagle speculative decoding. | ||
| ''' | ||
| ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False) |
There was a problem hiding this comment.
The reference LLM is being initialized with model_name, which is hardcoded to LLM-Research/Meta-Llama-3.1-8B-Instruct. However, this test is intended for vision-language models and should use vl_model_name (Qwen/Qwen3-VL-8B-Instruct), which is correctly passed as a fixture. This means the test is not comparing against the correct reference model, making the test results invalid.
| ref_llm = LLM(model=model_name, max_model_len=2048, enforce_eager=False) | |
| ref_llm = LLM(model=vl_model_name, max_model_len=2048, enforce_eager=False) |
| if hasattr(self.model.model, "embed_tokens"): | ||
| self.model.model.embed_tokens = self.model.model.embed_tokens | ||
| elif hasattr(self.model.model, "embedding"): | ||
| self.model.model.embed_tokens = self.model.model.embedding | ||
| # self.model.model.embed_tokens = model.model.embed_tokens |
There was a problem hiding this comment.
The logic for sharing vocabulary embeddings between the target model (model) and the draft model (self.model) is incorrect. The line self.model.model.embed_tokens = self.model.model.embed_tokens is a no-op, and the hasattr checks are performed on the draft model instead of the target model. This will result in the draft model not using the target model's embeddings. The logic should check for embed_tokens or embedding on the target model and assign it to the draft model.
| if hasattr(self.model.model, "embed_tokens"): | |
| self.model.model.embed_tokens = self.model.model.embed_tokens | |
| elif hasattr(self.model.model, "embedding"): | |
| self.model.model.embed_tokens = self.model.model.embedding | |
| # self.model.model.embed_tokens = model.model.embed_tokens | |
| if hasattr(model.model, "embed_tokens"): | |
| self.model.model.embed_tokens = model.model.embed_tokens | |
| elif hasattr(model.model, "embedding"): | |
| self.model.model.embed_tokens = model.model.embedding |
| ) | ||
| self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens | ||
| self.uses_mrope = self.vllm_config.model_config.uses_mrope | ||
| print("self.uses_mrope = ", self.uses_mrope) |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
ed3b2f6 to
6b9a542
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
6b9a542 to
89e7748
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1c07129 to
d1b7610
Compare
78f3c4a to
6dcc202
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
f7383ec to
614f1c9
Compare
5b47bcc to
4e1776a
Compare
Signed-off-by: jesse <szxfml@gmail.com>
8cdc644 to
a760e14
Compare
Signed-off-by: jesse <szxfml@gmail.com> update test Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com> gh add new model Signed-off-by: jesse <szxfml@gmail.com>
619ee1d to
21631a7
Compare
315a68b to
28ef6e5
Compare
28ef6e5 to
103d4a3
Compare
7d74b5f to
880dafe
Compare
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: jesse <szxfml@gmail.com>
What this PR does / why we need it?
follow pr vllm-project/vllm#20788 , Eagle3 mm support, enablement on qwen3vl
target model Qwen/Qwen3-VL-8B-Instruct
eagle3 MNN/Qwen3-VL-8B-Instruct-Eagle3
Does this PR introduce any user-facing change?
No
How was this patch tested?
pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv
vLLM with eagle3 :
vLLM without eagle3 :
bench:
eagle3 result:

without eagle3 result
