Skip to content

[Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3#6139

Merged
wangxiyuan merged 2 commits intovllm-project:releases/v0.13.0from
zhaomingyu13:releases/v0.13.0
Jan 23, 2026
Merged

[Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3#6139
wangxiyuan merged 2 commits intovllm-project:releases/v0.13.0from
zhaomingyu13:releases/v0.13.0

Conversation

@zhaomingyu13
Copy link
Copy Markdown
Contributor

What this PR does / why we need it?

Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug(#5967). Now, synchronize with the upstream and fix this bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="Qwen/Qwen3-30B-A3B",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.9,
            enforce_eager=True,
            speculative_config={
                "method": "eagle3",
                "model": "AngelSlim/Qwen3-a3B_eagle3"
                "num_speculative_tokens": 3,
            },
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug causing a decline in the acceptance rate for the Qwen3-30B-A3B-EAGLE3 model by synchronizing code with an upstream version. The core changes are within vllm_ascend/spec_decode/eagle_proposer.py, where the logic for sharing token embeddings between the target and draft models has been significantly improved. The new implementation is more robust, handling multimodal models, various embedding layer names, and differentiating between EAGLE and MTP models. However, I've identified a critical issue where a refactoring accidentally removed the definition of an attribute that is still used elsewhere in the class, which would lead to a runtime error. My review includes a suggestion to fix this. Overall, the changes are a good step forward, but this critical issue must be addressed.

assert len(draft_attn_layer_names) == 1
self.attn_layer_name = list(draft_attn_layer_names)
self.attn_layer_names = self.attn_layer_name
self.attn_layer_names = list(draft_attn_layer_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

It seems that while refactoring, the assignment to self.attn_layer_name was removed. However, this attribute is still used later in the dummy_run (line 345) and _propose (line 458) methods. This change will cause an AttributeError at runtime. To fix this, please restore the assignment.

Suggested change
self.attn_layer_names = list(draft_attn_layer_names)
self.attn_layer_names = list(draft_attn_layer_names)
self.attn_layer_name = self.attn_layer_names

@zhaomingyu13 zhaomingyu13 changed the title [Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE [Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3 Jan 23, 2026
"Keeping separate embedding weights from the target model."
)
else:
# MTP model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the original logic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it's my mistake.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

elif (isinstance(target_embed_tokens.weight, torch.Tensor)
and isinstance(self.model.model.embed_tokens.weight,
torch.Tensor)
# TODO: Offload to CPU for comparison to avoid extra GPU memory
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPU

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 23, 2026
@zhaomingyu13 zhaomingyu13 force-pushed the releases/v0.13.0 branch 2 times, most recently from d972281 to 831a3bb Compare January 23, 2026 09:42
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…3B-EAGLE3

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarkblood@qq.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
@wangxiyuan wangxiyuan merged commit 72a8f51 into vllm-project:releases/v0.13.0 Jan 23, 2026
11 checks passed
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…3B-EAGLE3 (vllm-project#6139)

### What this PR does / why we need it?
Due to the long-term lack of synchronization with the upstream code, a
problem that led to a decrease in the acceptance rate of the
Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the
bug(vllm-project#5967). Now, synchronize with the upstream and fix this bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```python
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="Qwen/Qwen3-30B-A3B",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.9,
            enforce_eager=True,
            speculative_config={
                "method": "eagle3",
                "model": "AngelSlim/Qwen3-a3B_eagle3"
                "num_speculative_tokens": 3,
            },
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarkblood@qq.com>
@zhaomingyu13 zhaomingyu13 deleted the releases/v0.13.0 branch March 2, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants