[Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3#6139
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug causing a decline in the acceptance rate for the Qwen3-30B-A3B-EAGLE3 model by synchronizing code with an upstream version. The core changes are within vllm_ascend/spec_decode/eagle_proposer.py, where the logic for sharing token embeddings between the target and draft models has been significantly improved. The new implementation is more robust, handling multimodal models, various embedding layer names, and differentiating between EAGLE and MTP models. However, I've identified a critical issue where a refactoring accidentally removed the definition of an attribute that is still used elsewhere in the class, which would lead to a runtime error. My review includes a suggestion to fix this. Overall, the changes are a good step forward, but this critical issue must be addressed.
| assert len(draft_attn_layer_names) == 1 | ||
| self.attn_layer_name = list(draft_attn_layer_names) | ||
| self.attn_layer_names = self.attn_layer_name | ||
| self.attn_layer_names = list(draft_attn_layer_names) |
There was a problem hiding this comment.
It seems that while refactoring, the assignment to self.attn_layer_name was removed. However, this attribute is still used later in the dummy_run (line 345) and _propose (line 458) methods. This change will cause an AttributeError at runtime. To fix this, please restore the assignment.
| self.attn_layer_names = list(draft_attn_layer_names) | |
| self.attn_layer_names = list(draft_attn_layer_names) | |
| self.attn_layer_name = self.attn_layer_names |
eeb96af to
821889f
Compare
821889f to
635a10c
Compare
| "Keeping separate embedding weights from the target model." | ||
| ) | ||
| else: | ||
| # MTP model |
There was a problem hiding this comment.
Please follow the original logic.
635a10c to
b5f180d
Compare
| elif (isinstance(target_embed_tokens.weight, torch.Tensor) | ||
| and isinstance(self.model.model.embed_tokens.weight, | ||
| torch.Tensor) | ||
| # TODO: Offload to CPU for comparison to avoid extra GPU memory |
d972281 to
831a3bb
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
831a3bb to
b7562ee
Compare
…3B-EAGLE3 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
b7562ee to
8420a65
Compare
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
…3B-EAGLE3 (vllm-project#6139) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>
What this PR does / why we need it?
Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(#5967). Now, synchronize with the upstream and fix this bug
Does this PR introduce any user-facing change?
No
How was this patch tested?