Skip to content

[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend#32936

Merged
njhill merged 3 commits intovllm-project:mainfrom
izhuhaoran:MRV2-cg-mode-verify
Apr 10, 2026
Merged

[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend#32936
njhill merged 3 commits intovllm-project:mainfrom
izhuhaoran:MRV2-cg-mode-verify

Conversation

@izhuhaoran
Copy link
Copy Markdown
Contributor

@izhuhaoran izhuhaoran commented Jan 23, 2026

Purpose

A follow-up PR of #32771 and #32820.

After #32820 we can select any attention backend, but some of them have limitations for CUDA-graph. This PR, like Model-Runner V1, adds a CUDA-graph check that adjusts the cudagraph mode & capture_sizes according to the attention backend. For example, with FLASHINFER + spec decode, a user-specified FULL_AND_PIECEWISE is automatically resolved to PIECEWISE.

[WARNING 01-23 20:32:46 [compilation.py:1148] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE

@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@WoosukKwon Could you please review this PR when you have time ?

@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@WoosukKwon Now that #32771 has been merged, this follow-up PR (which adds the CUDA-graph safety checks / auto-adjustment logic) is unblocked. Could you or someone please take a look when you get a chance? Thanks!

@izhuhaoran izhuhaoran requested a review from njhill as a code owner March 1, 2026 08:59
@njhill
Copy link
Copy Markdown
Member

njhill commented Mar 10, 2026

@izhuhaoran do you think you could rework/rebase this now that we've done a bunch of cudagraph rework/fixes?

I actually did it myself for testing if you want to use that, have pushed it to this branch: https://github.com/njhill/vllm/tree/mrv2-cg-mode-verify

However, that might also not be the final state, we may want a bit more review and code simplification.

Thanks!!

@izhuhaoran
Copy link
Copy Markdown
Contributor Author

izhuhaoran commented Mar 11, 2026

work/rebase this now that w

@njhill Sure, thanks for you time on this PR, I'll rebase/rework this PR today

@izhuhaoran izhuhaoran closed this Mar 11, 2026
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Mar 11, 2026
@izhuhaoran izhuhaoran reopened this Mar 11, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 11, 2026
@izhuhaoran izhuhaoran force-pushed the MRV2-cg-mode-verify branch from e13582e to ccd16b3 Compare March 11, 2026 05:53
@mergify mergify Bot removed the needs-rebase label Mar 11, 2026
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@njhill I've updated the codes, PTAL when you have time.

@izhuhaoran izhuhaoran force-pushed the MRV2-cg-mode-verify branch from ccd16b3 to 9755b46 Compare March 12, 2026 11:16
Copy link
Copy Markdown
Member

@njhill njhill Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WoosukKwon I know not part of this PR but I wonder if we should rename this to DefaultCudaGraphManager?

@njhill
Copy link
Copy Markdown
Member

njhill commented Mar 19, 2026

Thanks for this @izhuhaoran! I got claude to review the PR, I have not looked at the suggestions in detail but at a glance some of them at least look reasonable w.r.t. cleaner code structure. Not suggesting that all of these suggestions correct/wanted but perhaps you could consider them:

Key Issues

1. Deferred initialization is unnecessarily complex

The PR changes CudaGraphManager.init to accept no cudagraph_mode, initializing with NONE, then requires a separate set_cg_mode_and_candidates() call later. This two-phase init is
fragile — there's a window where the manager exists but is not properly configured. The _init_candidates() call now happens twice (once in init with NONE, doing nothing, and once in
set_cg_mode_and_candidates).

A simpler approach: keep the constructor as-is, but move cudagraph manager construction to after init_attn_backend() returns the resolved mode. In initialize_kv_cache() (line 317 of
model_runner.py), init_attn_backend is called at line 345-346, and the cudagraph manager is created earlier in init at line 237. Instead of making the manager deferrable, just move
its creation to initialize_kv_cache() after the attention backends are known.

2. Putting resolution logic in CompilationConfig is a layering concern

CompilationConfig is a data/configuration class. resolve_cudagraph_mode() involves attention-backend-specific logic (AttentionCGSupport enum, backend names, spec-decode awareness). This
couples the config layer to the attention backend layer. The V1 approach of having this logic in the model runner (closer to where it's used) is arguably better layering, even if
duplicated.

A better home might be a standalone function in cudagraph_utils.py or attn_utils.py that takes the config + backend support info and returns the resolved mode.

3. init_attn_backend side effects

The current init_attn_backend() is a pure function that returns backends and attention groups. The PR adds a side effect: it mutates vllm_config.compilation_config.cudagraph_mode as part
of resolution. Functions that initialize attention backends shouldn't silently mutate compilation config. If resolution must happen here, it should at minimum return the resolved mode
rather than mutating global config.

4. Missing logic from V1

The V1 _check_and_update_cudagraph_mode also handles:

  • Mamba cache block size capping (adjust_cudagraph_sizes_for_mamba_cache)
  • Spec-decode cudagraph size adjustment (adjust_cudagraph_sizes_for_spec_decode)
  • The splitting_ops_contain_attention() / use_inductor_graph_partition checks for fallback decisions

It's unclear from the PR description whether these are intentionally omitted (not applicable to V2 yet?) or overlooked.

5. Eagle cudagraph manager changes

The PR removes cudagraph_mode from EagleCudaGraphManager.init and moves the PIECEWISE assertion to set_cg_mode_and_candidates(). The current code has the assertion in the constructor
(fail-fast), which is preferable. With the deferred approach, you could construct an EagleCudaGraphManager and use it in an invalid state before set_cg_mode_and_candidates is called.

Minor Issue

  • No tests are visible in the diff. Given the complexity of cudagraph mode resolution (multiple code paths, downgrades, error cases), at minimum unit tests for resolve_cudagraph_mode()
    with various AttentionCGSupport levels would be valuable.

Suggestions

  1. Keep constructor signatures unchanged. Move cudagraph manager creation to after attention backend initialization rather than adding two-phase init.
  2. Make the resolution a standalone function (not a method on CompilationConfig) that takes the needed inputs and returns the resolved mode without mutation.
  3. Add unit tests for the resolution logic — it has many branches and edge cases.
  4. Clarify which V1 behaviors are intentionally excluded from the V2 path (Mamba capping, spec-decode size adjustment, etc.).

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran izhuhaoran force-pushed the MRV2-cg-mode-verify branch from a24f430 to 49d60d2 Compare April 8, 2026 13:00
@izhuhaoran izhuhaoran changed the title [Model Runner V2] support cudagraph check based on attn backend [Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend Apr 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 8, 2026

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
@izhuhaoran
Copy link
Copy Markdown
Contributor Author

@njhill Sorry for the delay! I have refactored the code based on Claude's draft suggestions above. Could you please take another look when you have time?

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @izhuhaoran for the great work!

@github-project-automation github-project-automation Bot moved this from Done to Ready in NVIDIA Apr 9, 2026
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 9, 2026
@njhill njhill merged commit 8f121f7 into vllm-project:main Apr 10, 2026
69 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 10, 2026
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
…attn backend (vllm-project#32936)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 17, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
1kzk pushed a commit to 1kzk/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
tfhddd pushed a commit to ascend-gha-runners/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: tfhddd <2272751277@qq.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…attn backend (vllm-project#32936)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…attn backend (vllm-project#32936)

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)

def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs

def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")

if __name__ == "__main__":
    main()

```

3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version:
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?
Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176`
1. fix vllm-project/vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run),
vllm-project/vllm#28207 (comment)
we should re-read `compilation_config.cudagraph_capture_sizes` after the
super() call in `_check_and_update_cudagraph_mode` to keep
`self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes
in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`.

For example, when speculative decoding (e.g. eagle3) is enabled and
`cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's
`_check_and_update_cudagraph_mode` calls
`adjust_cudagraph_sizes_for_spec_decode`, which rounds
`cudagraph_capture_sizes` up to a multiple of
(num_speculative_tokens + 1). For example, with
`num_speculative_tokens=2` , [5, 12] becomes [6, 12].

However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during
__init__ with the original [5, 12]. When
`set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates
`graph_params.events` keyed by {5, 12}. Meanwhile, the
`CudagraphDispatcher` uses the updated [6, 12] from
`compilation_config`, so it tries to capture at num_tokens=6 — causing
KeyError: 6 in `graph_params.events[num_tokens]` inside
`full_graph_fia`.

you can also re-produce the issue with the script:
```python
import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

```


3. fix vllm-project/vllm#39604
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
1. For 310P, we are
- vLLM version: 
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants