[Misc] Upgrade vllm commit to 0414 by Potabk · Pull Request #8172 · vllm-project/vllm-ascend

Potabk · 2026-04-13T01:27:08Z

What this PR does / why we need it?

Upgrade vllm commit to 6f786f2c506cb07f4566771fdc62e640e2c4a176

fix [Model Runner V2] support auto resolve cudagraph mode/sizes based on attn backend vllm#32936

This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment)
we should re-read compilation_config.cudagraph_capture_sizes after the super() call in _check_and_update_cudagraph_mode to keep self.cudagraph_batch_sizes in sync with the (possibly rewritten) sizes in model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode.

For example, when speculative decoding (e.g. eagle3) is enabled and cudagraph_capture_sizes is explicitly specify as [5, 12], vLLM's _check_and_update_cudagraph_mode calls adjust_cudagraph_sizes_for_spec_decode, which rounds cudagraph_capture_sizes up to a multiple of
(num_speculative_tokens + 1). For example, with num_speculative_tokens=2 , [5, 12] becomes [6, 12].

However, in vllm-ascend, self.cudagraph_batch_sizes was cached during init with the original [5, 12]. When set_graph_params(self.cudagraph_batch_sizes) runs later, it creates graph_params.events keyed by {5, 12}. Meanwhile, the CudagraphDispatcher uses the updated [6, 12] from compilation_config, so it tries to capture at num_tokens=6 — causing KeyError: 6 in graph_params.events[num_tokens] inside full_graph_fia.

you can also re-produce the issue with the script:

import os

os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn")

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig

EXAMPLE_PROMPTS = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

SAMPLING_PARAMS = SamplingParams(
    max_tokens=300,
    temperature=0.0,
    ignore_eos=False,
)


def run_spec():
    """Run with eagle3 speculative decoding."""
    llm = LLM(
        model="Qwen/Qwen3-8B",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        disable_log_stats=False,
        max_model_len=4096,
        seed=1024,
        async_scheduling=False,
        speculative_config={
            "disable_padded_drafter_batch": False,
            "method": "eagle3",
            "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
            "num_speculative_tokens": 2,
            "draft_tensor_parallel_size": 1,
            "max_model_len": 128,
        },
        compilation_config=CompilationConfig(
            cudagraph_mode="FULL",
            cudagraph_capture_sizes=[5, 12],
        ),
    )
    spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS)
    del llm
    return spec_outputs


def main():
    spec_outputs = run_spec()
    for o in spec_outputs:
        print(f"  PROMPT: {o.prompt!r}")
        print(f"  OUTPUT: {o.outputs[0].text[:80]}...")


if __name__ == "__main__":
    main()

fix [Quantization] [Refactor] Create special "GptOssMxfp4MoeMethod" vllm#39604

Does this PR introduce any user-facing change?

How was this patch tested?

For 310P, we are

vLLM version:
vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

gemini-code-assist · 2026-04-13T01:27:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the reference to the vLLM main branch commit hash across the project's documentation and configuration files. This ensures that the project remains aligned with the latest upstream changes in the vLLM repository.

Highlights

Dependency Update: Updated the vLLM main branch commit hash to 620e8924d9c6b2a0b1d49ac0dcf2588fffcbe390.
Documentation Sync: Synchronized the documentation and README files to reflect the updated vLLM commit reference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (3)
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-04-13T01:27:24Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request updates the vLLM commit hash to 620e8924d9c6b2a0b1d49ac0dcf2588fffcbe390 in both the documentation configuration and the Model Runner V2 README. The reviewer pointed out that the current PR title and summary do not follow the repository's style guide and provided a formatted suggestion to fix this.

github-actions · 2026-04-13T08:34:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-04-13T08:34:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-04-15T06:25:22Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu

We should use self.cudagraph_dispatcher.get_capture_descs() instead of self.cudagraph_batch_sizes now.

Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? This PR is partially cherry-picked from #8172. This PR aims to fix mismatched capture sizes after rounding operations when using sp or speculative. The reason is that original `self.cudagraph_capture_sizes` is no longer updated and remains as the initial sizes. Now we use `self.cudagraph_dispatcher.get_capture_descs` to the get up-to-date sizes. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci Signed-off-by: Zetong Li <slippersss@126.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: tfhddd <2272751277@qq.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

### What this PR does / why we need it? Upgrade vllm commit to `6f786f2c506cb07f4566771fdc62e640e2c4a176` 1. fix vllm-project/vllm#32936 This pr leads to the issue in Capture Phase (_dummy_run), vllm-project/vllm#28207 (comment) we should re-read `compilation_config.cudagraph_capture_sizes` after the super() call in `_check_and_update_cudagraph_mode` to keep `self.cudagraph_batch_sizes` in sync with the (possibly rewritten) sizes in `model_runner_v1.NPUModelRunner_check_and_update_cudagraph_mode`. For example, when speculative decoding (e.g. eagle3) is enabled and `cudagraph_capture_sizes` is explicitly specify as [5, 12], vLLM's `_check_and_update_cudagraph_mode` calls `adjust_cudagraph_sizes_for_spec_decode`, which rounds `cudagraph_capture_sizes` up to a multiple of (num_speculative_tokens + 1). For example, with `num_speculative_tokens=2` , [5, 12] becomes [6, 12]. However, in vllm-ascend, `self.cudagraph_batch_sizes` was cached during __init__ with the original [5, 12]. When `set_graph_params(self.cudagraph_batch_sizes)` runs later, it creates `graph_params.events` keyed by {5, 12}. Meanwhile, the `CudagraphDispatcher` uses the updated [6, 12] from `compilation_config`, so it tries to capture at num_tokens=6 — causing KeyError: 6 in `graph_params.events[num_tokens]` inside `full_graph_fia`. you can also re-produce the issue with the script: ```python import os os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "spawn") from vllm import LLM, SamplingParams from vllm.config import CompilationConfig EXAMPLE_PROMPTS = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] SAMPLING_PARAMS = SamplingParams( max_tokens=300, temperature=0.0, ignore_eos=False, ) def run_spec(): """Run with eagle3 speculative decoding.""" llm = LLM( model="Qwen/Qwen3-8B", tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_log_stats=False, max_model_len=4096, seed=1024, async_scheduling=False, speculative_config={ "disable_padded_drafter_batch": False, "method": "eagle3", "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "max_model_len": 128, }, compilation_config=CompilationConfig( cudagraph_mode="FULL", cudagraph_capture_sizes=[5, 12], ), ) spec_outputs = llm.generate(EXAMPLE_PROMPTS, SAMPLING_PARAMS) del llm return spec_outputs def main(): spec_outputs = run_spec() for o in spec_outputs: print(f" PROMPT: {o.prompt!r}") print(f" OUTPUT: {o.outputs[0].text[:80]}...") if __name__ == "__main__": main() ``` 3. fix vllm-project/vllm#39604 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? 1. For 310P, we are - vLLM version: - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

Potabk requested review from LCAIZJ, MengqingCao, Yikun and wangxiyuan as code owners April 13, 2026 01:27

github-actions Bot added documentation Improvements or additions to documentation ci/build labels Apr 13, 2026

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread docs/source/conf.py Outdated

github-actions Bot added the merge-conflicts label Apr 13, 2026

Potabk force-pushed the main2main_0412 branch from b6e2808 to 710171a Compare April 14, 2026 10:46

github-actions Bot removed merge-conflicts labels Apr 14, 2026

Potabk added ready read for review ready-for-test start test by label for PR labels Apr 15, 2026

Potabk changed the title ~~[Misc] Upgrade vllm commit to 0412~~ [Misc] Upgrade vllm commit to 0414 Apr 15, 2026

zhangxinyuehfad removed the ready-for-test start test by label for PR label Apr 15, 2026

zhangxinyuehfad force-pushed the main2main_0412 branch from 4a9fbd8 to 6c5e2c3 Compare April 15, 2026 06:25

zhangxinyuehfad requested review from weijinqian0 and whx-sjtu as code owners April 15, 2026 06:25

github-actions Bot added the merge-conflicts label Apr 15, 2026

Potabk force-pushed the main2main_0412 branch from 6c5e2c3 to 4a9fbd8 Compare April 15, 2026 06:33

github-actions Bot removed the merge-conflicts label Apr 15, 2026

zhangxinyuehfad force-pushed the main2main_0412 branch from 8c1df8a to d840a97 Compare April 15, 2026 06:50

zhangxinyuehfad added ready read for review and removed ready read for review labels Apr 15, 2026

zhangxinyuehfad added the ready read for review label Apr 15, 2026

zhangxinyuehfad force-pushed the main2main_0412 branch from dd9683f to e6ae840 Compare April 16, 2026 01:48

zhangxinyuehfad added the ready-for-test start test by label for PR label Apr 16, 2026

zhangxinyuehfad force-pushed the main2main_0412 branch from e6ae840 to 39c1042 Compare April 16, 2026 02:14

Potabk force-pushed the main2main_0412 branch from 39c1042 to beb2297 Compare April 16, 2026 02:39

wangxiyuan reviewed Apr 16, 2026

View reviewed changes

Comment thread docs/source/conf.py Outdated

yiz-liu reviewed Apr 16, 2026

View reviewed changes

Potabk added 8 commits April 16, 2026 17:09

upgrade vllm to 0412

80476e6

Signed-off-by: wangli <wangli858794774@gmail.com>

some fix

d53fb0a

Signed-off-by: wangli <wangli858794774@gmail.com>

upgrade vllm to 0414

afca406

Signed-off-by: wangli <wangli858794774@gmail.com>

add 310P fallback

d065471

Signed-off-by: wangli <wangli858794774@gmail.com>

fix quant

ae86afe

Signed-off-by: wangli <wangli858794774@gmail.com>

fix graph capture mismatch

d8cbd96

Signed-off-by: wangli <wangli858794774@gmail.com>

fix mdrnv2

c11c63e

Signed-off-by: wangli <wangli858794774@gmail.com>

read from batchdispatcher

d788a37

Signed-off-by: wangli <wangli858794774@gmail.com>

Potabk force-pushed the main2main_0412 branch from beb2297 to d788a37 Compare April 16, 2026 09:09

Potabk added 3 commits April 16, 2026 17:12

drop modelrunnerv2 019 support

55268d6

Signed-off-by: wangli <wangli858794774@gmail.com>

trigger

42733e3

Signed-off-by: wangli <wangli858794774@gmail.com>

add skip reason

b596ca4

Signed-off-by: wangli <wangli858794774@gmail.com>

slippersss mentioned this pull request Apr 17, 2026

[0.18.0][BugFix] Update capture sizes after rounding operations #8380

Merged

wangxiyuan merged commit 3ada044 into vllm-project:main Apr 17, 2026
71 of 72 checks passed

yiz-liu mentioned this pull request Apr 17, 2026

[Bug]: qwen3.5 mtp+graph cannot start vllm service #7598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Upgrade vllm commit to 0414#8172

[Misc] Upgrade vllm commit to 0414#8172
wangxiyuan merged 11 commits intovllm-project:mainfrom
Potabk:main2main_0412

Potabk commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

yiz-liu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Potabk commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Potabk commented Apr 13, 2026 •

edited

Loading