[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend by izhuhaoran · Pull Request #32936 · vllm-project/vllm

izhuhaoran · 2026-01-23T13:14:45Z

Purpose

A follow-up PR of #32771 and #32820.

After #32820 we can select any attention backend, but some of them have limitations for CUDA-graph. This PR, like Model-Runner V1, adds a CUDA-graph check that adjusts the cudagraph mode & capture_sizes according to the attention backend. For example, with FLASHINFER + spec decode, a user-specified FULL_AND_PIECEWISE is automatically resolved to PIECEWISE.

[WARNING 01-23 20:32:46 [compilation.py:1148] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE

izhuhaoran · 2026-01-23T13:20:09Z

@WoosukKwon Could you please review this PR when you have time ?

izhuhaoran · 2026-02-25T16:05:56Z

@WoosukKwon Now that #32771 has been merged, this follow-up PR (which adds the CUDA-graph safety checks / auto-adjustment logic) is unblocked. Could you or someone please take a look when you get a chance? Thanks!

njhill · 2026-03-10T23:02:23Z

@izhuhaoran do you think you could rework/rebase this now that we've done a bunch of cudagraph rework/fixes?

I actually did it myself for testing if you want to use that, have pushed it to this branch: https://github.com/njhill/vllm/tree/mrv2-cg-mode-verify

However, that might also not be the final state, we may want a bit more review and code simplification.

Thanks!!

izhuhaoran · 2026-03-11T02:42:59Z

work/rebase this now that w

@njhill Sure, thanks for you time on this PR, I'll rebase/rework this PR today

mergify · 2026-03-11T02:43:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @izhuhaoran.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

izhuhaoran · 2026-03-11T05:58:06Z

@njhill I've updated the codes, PTAL when you have time.

njhill · 2026-03-12T21:53:12Z

@WoosukKwon I know not part of this PR but I wonder if we should rename this to DefaultCudaGraphManager?

njhill · 2026-03-19T18:06:40Z

Thanks for this @izhuhaoran! I got claude to review the PR, I have not looked at the suggestions in detail but at a glance some of them at least look reasonable w.r.t. cleaner code structure. Not suggesting that all of these suggestions correct/wanted but perhaps you could consider them:

Key Issues

1. Deferred initialization is unnecessarily complex

The PR changes CudaGraphManager.init to accept no cudagraph_mode, initializing with NONE, then requires a separate set_cg_mode_and_candidates() call later. This two-phase init is
fragile — there's a window where the manager exists but is not properly configured. The _init_candidates() call now happens twice (once in init with NONE, doing nothing, and once in
set_cg_mode_and_candidates).

A simpler approach: keep the constructor as-is, but move cudagraph manager construction to after init_attn_backend() returns the resolved mode. In initialize_kv_cache() (line 317 of
model_runner.py), init_attn_backend is called at line 345-346, and the cudagraph manager is created earlier in init at line 237. Instead of making the manager deferrable, just move
its creation to initialize_kv_cache() after the attention backends are known.

2. Putting resolution logic in CompilationConfig is a layering concern

CompilationConfig is a data/configuration class. resolve_cudagraph_mode() involves attention-backend-specific logic (AttentionCGSupport enum, backend names, spec-decode awareness). This
couples the config layer to the attention backend layer. The V1 approach of having this logic in the model runner (closer to where it's used) is arguably better layering, even if
duplicated.

A better home might be a standalone function in cudagraph_utils.py or attn_utils.py that takes the config + backend support info and returns the resolved mode.

3. init_attn_backend side effects

The current init_attn_backend() is a pure function that returns backends and attention groups. The PR adds a side effect: it mutates vllm_config.compilation_config.cudagraph_mode as part
of resolution. Functions that initialize attention backends shouldn't silently mutate compilation config. If resolution must happen here, it should at minimum return the resolved mode
rather than mutating global config.

4. Missing logic from V1

The V1 _check_and_update_cudagraph_mode also handles:

Mamba cache block size capping (adjust_cudagraph_sizes_for_mamba_cache)

Spec-decode cudagraph size adjustment (adjust_cudagraph_sizes_for_spec_decode)

The splitting_ops_contain_attention() / use_inductor_graph_partition checks for fallback decisions

It's unclear from the PR description whether these are intentionally omitted (not applicable to V2 yet?) or overlooked.

5. Eagle cudagraph manager changes

The PR removes cudagraph_mode from EagleCudaGraphManager.init and moves the PIECEWISE assertion to set_cg_mode_and_candidates(). The current code has the assertion in the constructor
(fail-fast), which is preferable. With the deferred approach, you could construct an EagleCudaGraphManager and use it in an invalid state before set_cg_mode_and_candidates is called.

Minor Issue

No tests are visible in the diff. Given the complexity of cudagraph mode resolution (multiple code paths, downgrades, error cases), at minimum unit tests for resolve_cudagraph_mode()
with various AttentionCGSupport levels would be valuable.

Suggestions

Keep constructor signatures unchanged. Move cudagraph manager creation to after attention backend initialization rather than adding two-phase init.

Make the resolution a standalone function (not a method on CompilationConfig) that takes the needed inputs and returns the resolved mode without mutation.

Add unit tests for the resolution logic — it has many branches and edge cases.

Clarify which V1 behaviors are intentionally excluded from the V2 path (Mamba capping, spec-decode size adjustment, etc.).

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

mergify · 2026-04-08T13:04:41Z

Hi @izhuhaoran, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2026-04-08T14:22:00Z

@njhill Sorry for the delay! I have refactored the code based on Claude's draft suggestions above. Could you please take another look when you have time?

njhill

Thanks @izhuhaoran for the great work!

…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

izhuhaoran requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners January 23, 2026 13:14

mergify Bot added nvidia v1 labels Jan 23, 2026

github-project-automation Bot added this to NVIDIA Jan 23, 2026

izhuhaoran force-pushed the MRV2-cg-mode-verify branch from ceb32e7 to 6536f23 Compare February 25, 2026 16:01

izhuhaoran requested a review from njhill as a code owner March 1, 2026 08:59

izhuhaoran closed this Mar 11, 2026

github-project-automation Bot moved this to Done in NVIDIA Mar 11, 2026

izhuhaoran reopened this Mar 11, 2026

mergify Bot added the needs-rebase label Mar 11, 2026

izhuhaoran force-pushed the MRV2-cg-mode-verify branch from e13582e to ccd16b3 Compare March 11, 2026 05:53

mergify Bot removed the needs-rebase label Mar 11, 2026

izhuhaoran force-pushed the MRV2-cg-mode-verify branch from ccd16b3 to 9755b46 Compare March 12, 2026 11:16

njhill reviewed Mar 12, 2026

View reviewed changes

support cudagraph model verify for model runner v2

49d60d2

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran force-pushed the MRV2-cg-mode-verify branch from a24f430 to 49d60d2 Compare April 8, 2026 13:00

izhuhaoran changed the title ~~[Model Runner V2] support cudagraph check based on attn backend~~ [Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend Apr 8, 2026

fix lint error

64bcff7

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

njhill approved these changes Apr 9, 2026

View reviewed changes

github-project-automation Bot moved this from Done to Ready in NVIDIA Apr 9, 2026

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 9, 2026

Merge branch 'main' into MRV2-cg-mode-verify

94cde92

njhill merged commit 8f121f7 into vllm-project:main Apr 10, 2026
69 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 10, 2026

wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026

[Model Runner V2] support auto resolve cudagraph mode/sizes based on …

a363918

…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Potabk mentioned this pull request Apr 16, 2026

[Misc] Upgrade vllm commit to 0414 vllm-project/vllm-ascend#8172

Merged

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

[Model Runner V2] support auto resolve cudagraph mode/sizes based on …

4c2e640

…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend#32936

[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend#32936
njhill merged 3 commits intovllm-project:mainfrom
izhuhaoran:MRV2-cg-mode-verify

izhuhaoran commented Jan 23, 2026 •

edited by github-actions Bot

Loading

Uh oh!

izhuhaoran commented Jan 23, 2026

Uh oh!

izhuhaoran commented Feb 25, 2026

Uh oh!

njhill commented Mar 10, 2026

Uh oh!

izhuhaoran commented Mar 11, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

izhuhaoran commented Mar 11, 2026

Uh oh!

njhill Mar 12, 2026 •

edited

Loading

Uh oh!

njhill commented Mar 19, 2026

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

izhuhaoran commented Apr 8, 2026

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

izhuhaoran commented Jan 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

izhuhaoran commented Jan 23, 2026

Uh oh!

izhuhaoran commented Feb 25, 2026

Uh oh!

njhill commented Mar 10, 2026

Uh oh!

izhuhaoran commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

izhuhaoran commented Mar 11, 2026

Uh oh!

njhill Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Mar 19, 2026

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

izhuhaoran commented Apr 8, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

izhuhaoran commented Jan 23, 2026 •

edited by github-actions Bot

Loading

izhuhaoran commented Mar 11, 2026 •

edited

Loading

njhill Mar 12, 2026 •

edited

Loading