Skip to content

[BugFix] Fix Qwen3-Next because of TP Attn + EP MoE modified#3221

Merged
wangxiyuan merged 2 commits intovllm-project:mainfrom
wxsIcey:qwen3_next_fix
Sep 29, 2025
Merged

[BugFix] Fix Qwen3-Next because of TP Attn + EP MoE modified#3221
wangxiyuan merged 2 commits intovllm-project:mainfrom
wxsIcey:qwen3_next_fix

Conversation

@wxsIcey
Copy link
Collaborator

@wxsIcey wxsIcey commented Sep 28, 2025

The upstream changes to the TP Attn + EP MoE module caused the qwen3-next inference to fail, so this issue was fixed, caused by vllm-project/vllm#24982.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiyuan
Copy link
Collaborator

@wxsIcey got it

@github-actions
Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wxsIcey wxsIcey changed the title [BugFix] Fix Qwen3-Next by vllm #25400 #24982 [BugFix] Fix Qwen3-Next by vllm #24982 Sep 29, 2025
@wxsIcey wxsIcey changed the title [BugFix] Fix Qwen3-Next by vllm #24982 [BugFix] Fix Qwen3-Next because of vllm #24982 Sep 29, 2025
@wxsIcey wxsIcey added ready read for review ready-for-test start test by label for PR labels Sep 29, 2025
@wxsIcey wxsIcey marked this pull request as ready for review September 29, 2025 01:25
@wxsIcey wxsIcey requested a review from wangxiyuan September 29, 2025 01:25
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
@wangxiyuan wangxiyuan merged commit 83092d9 into vllm-project:main Sep 29, 2025
19 checks passed
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
- Fixes Qwen3-Next because of vllm #24982

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@releases/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
@wxsIcey wxsIcey changed the title [BugFix] Fix Qwen3-Next because of vllm #24982 [BugFix] Fix Qwen3-Next because of TP Attn + EP MoE modified Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants