[Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend#32936
Conversation
|
@WoosukKwon Could you please review this PR when you have time ? |
ceb32e7 to
6536f23
Compare
|
@WoosukKwon Now that #32771 has been merged, this follow-up PR (which adds the CUDA-graph safety checks / auto-adjustment logic) is unblocked. Could you or someone please take a look when you get a chance? Thanks! |
|
@izhuhaoran do you think you could rework/rebase this now that we've done a bunch of cudagraph rework/fixes? I actually did it myself for testing if you want to use that, have pushed it to this branch: https://github.com/njhill/vllm/tree/mrv2-cg-mode-verify However, that might also not be the final state, we may want a bit more review and code simplification. Thanks!! |
@njhill Sure, thanks for you time on this PR, I'll rebase/rework this PR today |
|
This pull request has merge conflicts that must be resolved before it can be |
e13582e to
ccd16b3
Compare
|
@njhill I've updated the codes, PTAL when you have time. |
ccd16b3 to
9755b46
Compare
There was a problem hiding this comment.
@WoosukKwon I know not part of this PR but I wonder if we should rename this to DefaultCudaGraphManager?
|
Thanks for this @izhuhaoran! I got claude to review the PR, I have not looked at the suggestions in detail but at a glance some of them at least look reasonable w.r.t. cleaner code structure. Not suggesting that all of these suggestions correct/wanted but perhaps you could consider them:
|
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
a24f430 to
49d60d2
Compare
|
Hi @izhuhaoran, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
@njhill Sorry for the delay! I have refactored the code based on Claude's draft suggestions above. Could you please take another look when you have time? |
njhill
left a comment
There was a problem hiding this comment.
Thanks @izhuhaoran for the great work!
…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
…attn backend (vllm-project#32936) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
Purpose
A follow-up PR of #32771 and #32820.
After #32820 we can select any attention backend, but some of them have limitations for CUDA-graph. This PR, like Model-Runner V1, adds a CUDA-graph check that adjusts the cudagraph mode & capture_sizes according to the attention backend. For example, with FLASHINFER + spec decode, a user-specified FULL_AND_PIECEWISE is automatically resolved to PIECEWISE.