Skip to content

[Misc] Upgrade vllm commit to 0414#8172

Merged
wangxiyuan merged 11 commits intovllm-project:mainfrom
Potabk:main2main_0412
Apr 17, 2026
Merged

[Misc] Upgrade vllm commit to 0414#8172
wangxiyuan merged 11 commits intovllm-project:mainfrom
Potabk:main2main_0412

Conversation

@Potabk
Copy link
Copy Markdown
Collaborator

@Potabk Potabk commented Apr 13, 2026

What this PR does / why we need it?

Upgrade vllm commit to 6f786f2c506cb07f4566771fdc62e640e2c4a176

  1. fix [Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment)
we should re-read compilation_config.cudagraph_capture_sizes after the super() call in _check_and_update_cudagraph_mode to keep self.cudagraph_batch_sizes in sync with the (possibly rewritten) sizes in model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode.

For example, when speculative decoding (e.g. eagle3) is enabled and cudagraph_capture_sizes is explicitly specify as [5, 12], vLLM's _check_and_update_cudagraph_mode calls adjust_cudagraph_sizes_for_spec_decode, which rounds cudagraph_capture_sizes up to a multiple of
(num_speculative_tokens + 1). For example, with num_speculative_tokens=2 , [5, 12] becomes [6, 12].

However, in vllm-ascend, self.cudagraph_batch_sizes was cached during init with the original [5, 12]. When set_graph_params(self.cudagraph_batch_sizes) runs later, it creates graph_params.events keyed by {5, 12}. Meanwhile, the CudagraphDispatcher uses the updated [6, 12] from compilation_config, so it tries to capture at num_tokens=6 — causing KeyError: 6 in graph_params.events[num_tokens] inside full_graph_fia.

you can also re-produce the issue with the script:

import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()
  1. fix [Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" vllm#39604

Does this PR introduce any user-facing change?

How was this patch tested?

  1. For 310P, we are

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the reference to the vLLM main branch commit hash across the project's documentation and configuration files. This ensures that the project remains aligned with the latest upstream changes in the vLLM repository.

Highlights

  • Dependency Update: Updated the vLLM main branch commit hash to 620e8924d9c6b2a0b1d49ac0dcf2588fffcbe390.
  • Documentation Sync: Synchronized the documentation and README files to reflect the updated vLLM commit reference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions Bot added documentation Improvements or additions to documentation ci/build labels Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the vLLM commit hash to 620e8924d9c6b2a0b1d49ac0dcf2588fffcbe390 in both the documentation configuration and the Model Runner V2 README. The reviewer pointed out that the current PR title and summary do not follow the repository's style guide and provided a formatted suggestion to fix this.

Comment thread docs/source/conf.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Potabk Potabk added ready read for review ready-for-test start test by label for PR labels Apr 15, 2026
@Potabk Potabk changed the title [Misc] Upgrade vllm commit to 0412 [Misc] Upgrade vllm commit to 0414 Apr 15, 2026
@zhangxinyuehfad zhangxinyuehfad removed the ready-for-test start test by label for PR label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zhangxinyuehfad zhangxinyuehfad added ready read for review and removed ready read for review labels Apr 15, 2026
@zhangxinyuehfad zhangxinyuehfad added the ready read for review label Apr 15, 2026
@zhangxinyuehfad zhangxinyuehfad added the ready-for-test start test by label for PR label Apr 16, 2026
Comment thread docs/source/conf.py Outdated
Copy link
Copy Markdown
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use self.cudagraph_dispatcher.get_capture_descs() instead of self.cudagraph_batch_sizes now.

Potabk added 8 commits April 16, 2026 17:09
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Potabk added 3 commits April 16, 2026 17:12
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
@wangxiyuan wangxiyuan merged commit 3ada044 into vllm-project:main Apr 17, 2026
71 of 72 checks passed
wangxiyuan pushed a commit that referenced this pull request Apr 17, 2026
### What this PR does / why we need it?
This PR is partially cherry-picked from #8172.

This PR aims to fix mismatched capture sizes after rounding operations
when using sp or speculative. The reason is that original
`self.cudagraph_capture_sizes` is no longer updated and remains as the
initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs`
to the get up-to-date sizes.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

Signed-off-by: Zetong Li <slippersss@126.com>
1kzk pushed a commit to 1kzk/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
tfhddd pushed a commit to ascend-gha-runners/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: tfhddd <2272751277@qq.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants