[Misc] Upgrade vllm commit to 0414#8172
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the reference to the vLLM main branch commit hash across the project's documentation and configuration files. This ensures that the project remains aligned with the latest upstream changes in the vLLM repository. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request updates the vLLM commit hash to 620e8924d9c6b2a0b1d49ac0dcf2588fffcbe390 in both the documentation configuration and the Model Runner V2 README. The reviewer pointed out that the current PR title and summary do not follow the repository's style guide and provided a formatted suggestion to fix this.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
4a9fbd8 to
6c5e2c3
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
8c1df8a to
d840a97
Compare
dd9683f to
e6ae840
Compare
e6ae840 to
39c1042
Compare
yiz-liu
left a comment
There was a problem hiding this comment.
We should use self.cudagraph_dispatcher.get_capture_descs() instead of self.cudagraph_batch_sizes now.
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? This PR is partially cherry-picked from #8172. This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original `self.cudagraph_capture_sizes` is no longer updated and remains as the initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs` to the get up-to-date sizes. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
What this PR does / why we need it?
Upgrade vllm commit to
6f786f2c506cb07f4566771fdc62e640e2c4a176This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment)
we should re-read
compilation_config.cudagraph_capture_sizesafter the super() call in_check_and_update_cudagraph_modeto keepself.cudagraph_batch_sizesin sync with the (possibly rewritten) sizes inmodel_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode.For example, when speculative decoding (e.g. eagle3) is enabled and
cudagraph_capture_sizesis explicitly specify as [5, 12], vLLM's_check_and_update_cudagraph_modecallsadjust_cudagraph_sizes_for_spec_decode, which roundscudagraph_capture_sizesup to a multiple of(num_speculative_tokens + 1). For example, with
num_speculative_tokens=2, [5, 12] becomes [6, 12].However, in vllm-ascend,
self.cudagraph_batch_sizeswas cached during init with the original [5, 12]. Whenset_graph_params(self.cudagraph_batch_sizes)runs later, it createsgraph_params.eventskeyed by {5, 12}. Meanwhile, theCudagraphDispatcheruses the updated [6, 12] fromcompilation_config, so it tries to capture at num_tokens=6 — causing KeyError: 6 ingraph_params.events[num_tokens]insidefull_graph_fia.you can also re-produce the issue with the script:
Does this PR introduce any user-facing change?
How was this patch tested?