[Bugfix] Cap SWA/chunked-local runtime admission to startup pool-sizing bound#40946
[Bugfix] Cap SWA/chunked-local runtime admission to startup pool-sizing bound#40946njhill merged 4 commits intovllm-project:mainfrom
Conversation
`SlidingWindowSpec.max_memory_usage_bytes` and `ChunkedLocalAttentionSpec.max_memory_usage_bytes` size the pool at startup with a recycling-aware bound (one window/chunk window + `max_num_batched_tokens`, plus a 1-block alignment slack for SWA). At runtime, however, `SingleTypeKVCacheManager.get_num_blocks_to_allocate` returns `cdiv(num_tokens, block_size)` for a fresh request, since `get_num_skipped_tokens(0) == 0`. That over-counts: chunked prefill invokes `remove_skipped_blocks` between chunks, which swaps out-of-window blocks for the null block and returns their slots to the pool, so the per-request real-held block count plateaus at the recycling-aware bound. The mismatch deadlocks long prompts on hybrid full+SWA models when the pool is sized at the startup minimum -- the admission gate rejects what startup was sized to admit (issue vllm-project#39734). Fix: - Hoist the recycling-aware bound onto the spec as `max_admission_blocks_per_request`, and have `max_memory_usage_bytes` call it so the startup pool sizer and the runtime admission gate share one source of truth (drift would re-introduce vllm-project#39734 or, worse, mid-prefill OOM). - Plumb `max_num_batched_tokens` through `KVCacheManager` -> `KVCacheCoordinator` -> `get_manager_for_kv_cache_spec`. `KVCacheManager` defaults the parameter to `max_model_len` (a no-op cap) so non-scheduler call sites keep their prior behavior; the scheduler and the simple CPU offload scheduler pass the real value. - `SlidingWindowManager` and `ChunkedLocalAttentionManager` cap demand at the same per-request bound in `get_num_blocks_to_allocate`. The invariant remains `sum(reservations) <= pool` and per-request peak <= reservation (held by `remove_skipped_blocks`), so total real-held <= pool. Tests: - `test_can_fit_full_sequence_swa_cap_admits_long_prompt`: hybrid full+SWA with the pool at the startup minimum admits a prompt longer than the SWA window + chunk. - `test_can_fit_full_sequence_full_attention_still_gates_oversized`: the cap doesn't loosen the full-attention gate. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Dao Le <Dao007forever@gmail.com>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request implements a recycling-aware admission cap for Sliding Window Attention (SWA) and Chunked Local Attention to ensure consistency between startup pool sizing and runtime admission gating. By introducing max_admission_blocks_per_request in the attention specifications and overriding get_num_blocks_to_allocate in the corresponding managers, the system now correctly accounts for block recycling during chunked prefill, preventing unnecessary request rejections for long prompts. The changes also include propagating max_num_batched_tokens through the KV cache coordination layers and adding comprehensive tests to verify the new admission logic. I have no feedback to provide.
njhill
left a comment
There was a problem hiding this comment.
Thanks @Dao007forever! I just have some minor simplification suggestions, see: njhill@f0c7ac6
…ng bound (vllm-project#40946) Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
|
Hey @Dao007forever do we need #39866 with your PR? Seems that we don't. |
|
Hey @jhaotingc, thanks for reaching out! I don't think we need #39866 since they are doing the same thing, just at different places (coordinator vs manager). I don't have perf data for Gemma-4, I only have GB200 and don't think it'd be interesting :(. |
…ng bound (vllm-project#40946) Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…ng bound (vllm-project#40946) Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Adrian <info@zzit.ch>
|
Thanks @Dao007forever! |
commit 8586369f617a964235d0d9d32d6ebb1076a4581d
Author: Matthew Santiago <carag.matthew@gmail.com>
Date: Sat May 2 01:22:14 2026 -0500
Refactor Step3Text loading to use AutoWeightsLoader (#41492)
Signed-off-by: Matthew Santiago <carag.matthew@gmail.com>
commit ae3b4deb8a5987759d4732e67767146a46ee72ed
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Sat May 2 13:27:43 2026 +0800
[Doc] Add Codex usage example (#41358)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
commit c293ccc58ef6e1a0976a62f79f57bc045108073d
Author: Rita Brugarolas <Rita.BrugarolasBrufau@amd.com>
Date: Fri May 1 21:13:15 2026 -0700
[ROCm][Bugfix] Fix init-time bias dtype cast when gate.out_dtype is None (#41405)
Signed-off-by: Rita Brugarolas Brufau <rita.brugarolasbrufau@amd.com>
commit d58c42e19cb792e24eb335b75164356a4f71bff0
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date: Fri May 1 23:41:15 2026 -0400
[vLLM IR] 2/N fused_add_rms_norm and maybe_inplace overload (#36823)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
commit 3e49479c4b766a601804f0c6f5f1c9a3def5ad0c
Author: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Date: Fri May 1 23:19:07 2026 -0400
Limit concurrency on `test_transcription_api_correctness.py` (#41478)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
commit 964a4bc2a57aca2a42d04538b27cab4d333d0f5d
Author: John Calderon <81483067+johncalesp@users.noreply.github.com>
Date: Fri May 1 23:10:14 2026 -0400
[MM][CG] Support ViT CG for Qwen2.5-VL (#40830)
Signed-off-by: John Calderon <jcalderon@nvidia.com>
commit c408fdd663afb34ab82a10b26f553bec9e8052d9
Author: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Date: Sat May 2 05:06:54 2026 +0200
[Fix] Sync gemma4 chat template from hf (#39570)
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
commit 5737770c6c346d918fdfb13e9378f9514f616186
Author: Andy Lo <andy@mistral.ai>
Date: Sat May 2 00:01:37 2026 +0100
Re-enable allreduce rms fusion for DP / PP (#41458)
Signed-off-by: Andy Lo <andy@mistral.ai>
commit 0c99629ede51524f00b88cb758c895fd76a5f6f9
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 1 17:45:03 2026 -0400
[Build] Make bundled DeepGEMM wheel portable across Python versions (#41476)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit edd60ac93a3247c7ef1bf1e2a3e9c0e95bc83bf6
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Fri May 1 17:42:52 2026 -0400
[Bugfix] Fix persistent_topk inter-CTA init race on RadixRowState (#41444)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
commit bcf5cac9fb956788f649d1f5297b74c886a9d6d3
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Fri May 1 15:23:17 2026 -0400
[DSV4] Add knob to enable pre-attn gemm (#41443)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
commit a9484dac7b734096ed26db4902454da7e497d2c3
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Sat May 2 03:01:17 2026 +0800
[Perf] Intergrate Tile Kernels `head_compute_mix_kernel` for Deepseek-V4 (#41255)
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit f3fef123504db07b3ac83ad4ef677915b53e8386
Author: Matthew Bonanni <mbonanni@redhat.com>
Date: Fri May 1 13:36:20 2026 -0400
[Attention] Abstract the MLA prefill backends and eliminate cuDNN (#32623)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 51295793a2eed0eefc7505cb9a7d5f96effd7773
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Fri May 1 13:02:03 2026 -0400
[Model Runner V2] Add `logprob_token_ids` support (#40559)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 3ccc1ff4958dd07dbffeaa1c48463325c892b518
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 1 12:00:38 2026 -0400
[Eval][CI] Add basic mrcr eval to tests/evals/ (#40164)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit 529c671e8075d265a48b72e0eaaeb5e30d2f1630
Author: vllmellm <vllm.ellm@embeddedllm.com>
Date: Fri May 1 23:07:18 2026 +0800
[ROCm][FEAT] AITER Fused Allreduce + RMSNorm (#37646)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Rita Brugarolas Brufau <rita.brugarolasbrufau@amd.com>
Signed-off-by: junkang1991 <junkangchow@gmail.com>
Co-authored-by: Rita Brugarolas <Rita.BrugarolasBrufau@amd.com>
Co-authored-by: junkang1991 <junkangchow@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
commit bc635fad2389e228a31d6bc6e698caf53d395e13
Author: Pleaplusone <ygan@amd.com>
Date: Fri May 1 22:06:00 2026 +0800
[ROCm][Deepseek] dsv3.2 further optimization (#41217)
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
commit c3e64696cdea5df92eb15e20b18ba979a536c1e3
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Fri May 1 17:04:11 2026 +0300
[Perf] Warmup forward_native sampler kernel (#41375)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 4f7bde572ad05a6c013dde5f8874898fff3c1253
Author: sungsoo ha <sungsooh@nvidia.com>
Date: Fri May 1 06:01:17 2026 -0700
[Kernel] Pack output and LSE in DCP A2A (#41160)
commit 2fa1f8ec00cf85b15422cb4c0e8eb3632ee13ea8
Author: Or Ozeri <oro@il.ibm.com>
Date: Fri May 1 14:30:03 2026 +0300
[kv_offload+HMA][13/N]: Enable HMA support (#41445)
This is the final PR in a series to enables HMA support for the
offloading connector. The connector advertises `SupportsHMA`
and is validated with unit tests and e2e tests.
Signed-off-by: Or Ozeri <oro@il.ibm.com>
commit 7075df79b3094bb6f6d28021c4df8631af10b2b8
Author: raviguptaamd <ravi.gupta@amd.com>
Date: Fri May 1 02:18:30 2026 -0700
[ROCm] Enable DBO (Dynamic Batch Optimization) on ROCm (#34726)
Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
commit 0dbaf9daad2031235344428d2a574496bb4d9a3b
Author: Yuyi Ao <yuyiao772@gmail.com>
Date: Fri May 1 05:07:23 2026 -0400
Refractor longcat loading to use AutoWeightsLoader (#41448)
Signed-off-by: George-ao <yuyiao772@gmail.com>
commit a3ec4a35f5943c250974d504706d22297d423468
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Fri May 1 00:43:39 2026 -0700
[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared labeled child (#40840)
When vLLM runs with Ray Prometheus `vllm:request_success{finished_reason=...}`
only ever increments the repetition bucket regardless of the request's actual finish
reason; stop, length, abort, and error stay at zero. Root cause was `labels()` mutated
the wrapped Ray metric's default tags in place and returned self, so every `.labels(...)`
call on a given wrapper returned the same object.
Co-authored-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 32964e770041fa4124f98c66efb3ff721ca608b6
Author: Andreas Karatzas <akaratza@amd.com>
Date: Fri May 1 02:40:47 2026 -0500
[ROCm][CI] Upgraded UCX and RIXL (#41210)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit a07642667db1284ad2128d7f9ef089e6b0d24a4c
Author: Bugen Zhao <i@bugenzhao.com>
Date: Fri May 1 14:38:02 2026 +0800
[Bugfix] Pass reasoning parser kwargs to structured output (#41199)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit c3868bbbe4b160d89adf339bcc069f6956314345
Author: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Date: Fri May 1 01:08:34 2026 -0400
[compile] Add FlashInfer FP8 async TP fusion and preserve allreduce fusion ordering #27893 (#39505)
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
commit 947138b6c22f9a4751b63b9aa75a2bc4b42835e9
Author: sychen52 <41452870+sychen52@users.noreply.github.com>
Date: Thu Apr 30 21:55:16 2026 -0700
Add nvfp4 kv cache support (#40177)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
commit 941fb5083552516eee947fc5f6c4d2031af76ea4
Author: Or Ozeri <oro@il.ibm.com>
Date: Fri May 1 06:59:17 2026 +0300
[kv_offload+HMA][12/N]: Scheduler-side support for sliding window groups (#41228)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
commit 6b6ac6c3c737b69e99264731f010588613dada58
Author: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com>
Date: Thu Apr 30 20:37:43 2026 -0700
[Kernel][MoE] Support GELU on TRT-LLM NvFP4 fused MoE for Gemma4 (#41050)
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b542bdf7fb3611879b30917e03bef62963496e83
Author: Stefano Castagnetta <scastagnetta@nvidia.com>
Date: Fri May 1 05:08:49 2026 +0200
[Bugfix] Disable FlashInfer CUTLASS MoE on SM110 (Jetson Thor AGX) (#40808)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
commit 415a8798996c518bbc33f377c9652c2776e074a7
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Fri May 1 05:18:38 2026 +0300
[KV Offload] Use `Collection` instead of `Sequence/Iterable` for OffloadingManager key parameters (#41361)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit 7198940b395e74337101e7b62437c55a7e1d4284
Author: Dong W <89223086+sniper35@users.noreply.github.com>
Date: Thu Apr 30 19:06:48 2026 -0700
[Model] Add Moondream3 model support(only query and caption skills) (#32325)
Signed-off-by: Dong Wang <dongw2019@gmail.com>
commit 14043dfecd35dd2f12b4d51eb9fa166184a0ca0f
Author: Luis 🚀 <luisfabian1545@gmail.com>
Date: Thu Apr 30 22:05:55 2026 -0400
feat: Enable `prompt_embeds` Content Part Support in vLLM Chat Completions API (#40720)
Signed-off-by: Luis Robaina <luis@protopia.ai>
Signed-off-by: Luis Robaina 🚀 <luisfabian1545@gmail.com>
Signed-off-by: LuisRobaina <luis@protopia.ai>
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com>
commit 1adaa5056b0e1b7ce9918d30baa5c7b8b7d86e0d
Author: Andreas Karatzas <akaratza@amd.com>
Date: Thu Apr 30 20:59:35 2026 -0500
[ROCm][CI] Add ROCm score absolute tolerance floor (#41341)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 4d5c89295b763e642129fcf598580ec63dc1d45f
Author: Soyaazz <523420504@qq.com>
Date: Fri May 1 09:59:26 2026 +0800
(bugfix): block_size check for flex attn (#41363)
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
commit dd5506a15759abfc28ed1ca2746221c9ab5e1180
Author: Nick Hill <nickhill123@gmail.com>
Date: Thu Apr 30 18:10:00 2026 -0700
[Core] Simplify handling of `scheduler_reserve_full_isl` option (#41064)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit a3c83ff2fd050bc6392260d6e84ee1150f238f26
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Thu Apr 30 21:09:55 2026 -0400
Faster per-token fp8 group quant packed kernel for blackwell (#41326)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit 9c61864bf8a911a8369f35d79d538c7f11cf3dc2
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Thu Apr 30 16:28:57 2026 -0700
[DeepSeek] Use torch.mm for bf16xbf16->fp32 gemm (#41300)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 71725f6730ec7076ee914ba06fae482a10f2f159
Author: Tran Le <43319264+lequytra@users.noreply.github.com>
Date: Thu Apr 30 16:19:59 2026 -0700
[Bugfix] Fix RoutedExpertsCapturer for Gemma 4 MoE (top_k_experts) (#41401)
Signed-off-by: Tran Le <tranle@fireworks.ai>
commit b4806c8ee12d5c5bbfebd6070b389e2f4daad1fd
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Thu Apr 30 18:33:12 2026 -0400
[DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided (#40960)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 526927be94c63ee677840c9074754fb6fdbdf5a1
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu Apr 30 18:20:11 2026 -0400
[Model Runner v2] Fix v2 compile counter `num_gpu_runner_capture_triggers` and `num_cudagraph_captured` (#41285)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 75a4c166f25f8407591328d7ed92aa2231b0b841
Author: Michael Goin <mgoin64@gmail.com>
Date: Thu Apr 30 18:02:14 2026 -0400
Fix typo in log message for indexer cache (#41419)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
commit 2917d6363ad722bff647168d3a36261254d7ad42
Author: fxmarty-amd <felmarty@amd.com>
Date: Thu Apr 30 23:35:48 2026 +0200
[NVFP4][Hopper/AMD Instinct] Add Triton kernels for NVFP4 dequantization and QDQ emulation (#40033)
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit efb4cdf2b8000c850d04706eb6f788903e3ee544
Author: Stefano Castagnetta <scastagnetta@nvidia.com>
Date: Thu Apr 30 21:47:55 2026 +0200
[CI/Build] Skip Prithvi/Terratorch model-registry tests when terratorch is missing (#41389)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
commit 92a7c121b62a1484b68c0a27d1ecefd1a84f78fc
Author: Stefano Castagnetta <scastagnetta@nvidia.com>
Date: Thu Apr 30 21:24:09 2026 +0200
[CI] Add MTP coverage: Qwen3.5 correctness + no-sync spec decode (#40472)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 307b17ce33165b41c490a099f54f0d0cd12e7f76
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Fri May 1 00:57:27 2026 +0800
[DSV4] Avoid redundant dtype conversion. (#41374)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
commit 3ca6ca210fc2cbf786713f515a6af2fba875884b
Author: wenjun liu <wenjun.liu@intel.com>
Date: Fri May 1 00:02:23 2026 +0800
xpu docker: pin oneAPI to 2025.3 and avoid unintended 2026 upgrade (#41380)
Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 10558f5f4608f825d06018def1317e7d3a96d6fe
Author: Stefano Castagnetta <scastagnetta@nvidia.com>
Date: Thu Apr 30 16:59:07 2026 +0200
[CI/Build] Skip terratorch + torchgeo while PyPI has lightning quarantined (#41377)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
commit 121dbe7a221d2bf6415caea0acb812c666a7665e
Author: tej <37236721+itej89@users.noreply.github.com>
Date: Thu Apr 30 09:46:59 2026 -0500
[ROCm] ROCm DeepEP API updated to latest (#39721)
Signed-off-by: Tej Kiran <vpolamre@amd.com>
Signed-off-by: tej <37236721+itej89@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: HAIAI <39548240+HAIAI@users.noreply.github.com>
commit f03d82efdd88fbd85ddf7a5475e237ae3abaf01e
Author: Matthew Bonanni <mbonanni@redhat.com>
Date: Thu Apr 30 10:46:54 2026 -0400
[UX][Bugfix] Fix OOM by setting PyTorch `max_split_size_mb` during model loading (#41268)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
commit a7fb00851030b6991987c559b885dfd8eb039d15
Author: Ilya Markov <markovilya197@gmail.com>
Date: Thu Apr 30 16:46:49 2026 +0200
[EPLB] Optimize memory overhead in Nixl communicator (#40013)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
commit ff449b6426812d1e5e107715af899fcff5e81419
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Thu Apr 30 13:48:38 2026 +0100
Stop mergify labelling from skipping pre-commit (#41362)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 3527229517f01a5f2406fa6fbf35ff9223c65ed5
Author: Stefano Castagnetta <scastagnetta@nvidia.com>
Date: Thu Apr 30 14:06:44 2026 +0200
[Doc] Fix RTD build: pytorch.org/docs/stable/objects.inv returns 404 (#41353)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit b55b26520c088eeed791cbe4648c73ea015f3613
Author: Xiaoshuang Wang <1790571317@qq.com>
Date: Thu Apr 30 18:31:08 2026 +0800
[MoE] Make MoERunnerInterface a PluggableLayer for OOT support (#35178)
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 3179e53135dbc0bbb9d845fb56e5380a6b88157e
Author: snadampal <87143774+snadampal@users.noreply.github.com>
Date: Thu Apr 30 03:14:20 2026 -0700
[P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes (#32553)
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
commit efdc95674db5c7b441d52ae02fa57e57c6bb3855
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Thu Apr 30 11:10:50 2026 +0200
[KVConnector] MultiConnector SupportsHMA (#39571)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 54146a9bf951b8c70ad85fb1a1bee241964209e0
Author: Chenxi Qian <chenxi.qian.cq@outlook.com>
Date: Thu Apr 30 16:22:41 2026 +0800
[Bugfix] correct h matrix layout in chunk_kda output kernel (#40956)
Signed-off-by: ChenxiQian <chenxi.qian.cq@outlook.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit ca97f7b9bbf2e904065dafc6918af3f9f386fdf0
Author: Baekpica <35071468+Baekpica@users.noreply.github.com>
Date: Thu Apr 30 16:12:42 2026 +0900
Fix Gemma4 MoE expert weight remapping (#41206)
Signed-off-by: sunghoon.baek <sunghoon.baek@connectfy.cloud>
Co-authored-by: sunghoon.baek <sunghoon.baek@connectfy.cloud>
Co-authored-by: OpenAI Codex <codex@openai.com>
commit a04e0cf3b8cd7bf6b643eacab15033025f462166
Author: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Date: Thu Apr 30 02:39:04 2026 -0400
Fix Cohere ASR after HF upgrade (#40582)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
commit cb1b02d0e8159f25678b96667651ec539546022b
Author: Dhruv Singal <dhruvsingalabc@gmail.com>
Date: Wed Apr 29 23:19:09 2026 -0700
[Frontend] Add VLLM_SKIP_MODEL_NAME_VALIDATION environment variable (#34676)
Signed-off-by: Dhruv Singal <dhruvsingalabc@gmail.com>
Signed-off-by: Dhruv Singal <dsingal@Dhruvs-MacBook-Pro.local>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: vLLM Assistant <assistant@vllm.ai>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Dhruv Singal <dsingal@Dhruvs-MacBook-Pro.local>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: OpenCode <noreply@openai.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
commit a749a33d8d05acdd3ab346bd3f0c6b5c9c80474f
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Thu Apr 30 00:03:45 2026 -0400
[Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024 (#41189)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit c42981d034713da814913ebd2e53269346f3ecea
Author: Martin Hickey <martin.hickey@ie.ibm.com>
Date: Thu Apr 30 03:55:31 2026 +0100
[Refactor][kv_offload] KV Offloading maintainability improvements (#40538)
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
commit 0ff1bf9bb1ee31ba1f416a4688e705be92643711
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Wed Apr 29 21:44:07 2026 -0400
[Bugfix] Fix failure to allocate KV blocks error (#41282)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
commit 0ab67c02225c18ed664864f75c3a38c0659f4be1
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Wed Apr 29 16:59:16 2026 -0700
[CI] Add key field to all test_areas pipeline steps (#41201)
Signed-off-by: khluu <khluu000@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit 3795d7acf431980e62e738493f437ae2a51549da
Author: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Date: Wed Apr 29 18:39:01 2026 -0500
[ROCm][Bugfix][GPTOSS]: fix input_ids and expert_map args for quark w4a8 gptoss (#41165)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
commit 18599bfdf2f9dc117b004491f1eba766934310fc
Author: Nick Hill <nickhill123@gmail.com>
Date: Wed Apr 29 16:31:00 2026 -0700
[Ci][BugFix] Fix slow DP tests due to bad teardown logic (#41166)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 296741d0257107a9d0301409005c85d38bb247bc
Author: Thien Tran <gau.nernst@yahoo.com.sg>
Date: Thu Apr 30 06:16:40 2026 +0700
[DSv4] Use `cvt` PTX for FP32->FP4 conversion (#41015)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
commit a966aaed30b9132191e8ba88d4f61e76657b690d
Author: Uranus <109661872+UranusSeven@users.noreply.github.com>
Date: Thu Apr 30 07:14:50 2026 +0800
[Bugfix][MLA] Size arange_buffer to max_num_batched_tokens to prevent CUDA IMA (#39277)
Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
commit 6841f5dc77e9200a2fa45a4bf935b23bd843bf30
Author: Hemanth Acharya <heachary@amd.com>
Date: Thu Apr 30 04:37:46 2026 +0530
[ROCm] Add env flags to disable dynamic MXFP4 quant and enable AITER tuned GEMMs for Attention Projection Layers (#39987)
Signed-off-by: Hemanth Acharya <heachary@amd.com>
commit c2fb013312e107c6809b1bf5cc4f22e499e1b81d
Author: roikoren755 <26850796+roikoren755@users.noreply.github.com>
Date: Thu Apr 30 00:59:18 2026 +0300
[Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper (#41235)
Signed-off-by: Roi Koren <roik@nvidia.com>
commit ccfb620c62533c0dbfa8d5a0307fab9682b7c29f
Author: Rishi Puri <riship@nvidia.com>
Date: Wed Apr 29 18:56:56 2026 -0300
Create tests/distributed/test_mnnvl_alltoall.py (#35241)
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Claude <claude@anthropic.com>
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com>
commit 0335316a9ba245e5e82a20ef1b53ba3da108afd5
Author: Aaron Hao <ahao@anyscale.com>
Date: Wed Apr 29 14:51:03 2026 -0700
[BUG] Two phase pause to prevent deadlock (#39366)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Junjie Zhang <junj.jay.zhang@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 944e138bcf39e9236bbfd49d98f00fb45e6cea54
Author: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Date: Wed Apr 29 16:39:03 2026 -0500
[ROCm][Bugfix]: W4A4 MOE using emulation instead of AITER on MXFP4-supported hardware (#41175)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
commit b58669cb427effa49928b7be5b6e0f4fd707bce5
Author: Luochao Wang <wangluochao902@gmail.com>
Date: Wed Apr 29 14:20:13 2026 -0700
[Perf][Spec Decode] Avoid per-step numpy allocation in prepare_next_t… (#41043)
Signed-off-by: wangluochao902 <wangluochao902@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 1628239eb234739646e21c4053e3fa652058e245
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Thu Apr 30 05:16:19 2026 +0800
[Multimodal][Render] Skip mm processor initialization and warmup for text-only mode (#41246)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 93da1fe97abf71ac81e7daea21547292f9b39aa4
Author: yzong-rh <yzong@redhat.com>
Date: Wed Apr 29 17:01:57 2026 -0400
[CI] Add temperature to bfcl eval, default greedy (#41059)
Signed-off-by: Yifan Zong <yzong@redhat.com>
commit 169988a3c0e0912fc20be2d104a4b76a51ad9fa4
Author: Andrew Barnes <bortstheboat@gmail.com>
Date: Wed Apr 29 16:46:01 2026 -0400
[ROCm] Use quant_dtype in per_token_quant instead of hardcoded FP8 (#39121)
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
commit faab18955407f256c7ced2d227ce097f472db16d
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Thu Apr 30 03:15:35 2026 +0800
[Feature]: IndexCache support for DSA models (#37735)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 6f20f81cbf1d12dc9d499d25ea0a64ef4c816c00
Author: Laith Sakka <lsakka@meta.com>
Date: Wed Apr 29 11:32:15 2026 -0700
Replace shape_invariants with simpler apprach in dynamic_arg_dims utilizing shape_id property. (#36194)
Signed-off-by: Laith Sakka <lsakka@meta.com>
commit d1a75e303d81eaaa3d0bb5622e0a6d380ccc22fa
Author: danisereb <daserebrenik@nvidia.com>
Date: Wed Apr 29 20:39:49 2026 +0300
Fix timeout when using LoRA adapters with Nemotron Super (#40916)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
commit 4a42aba380bf9cac47009a7307a9d91dd2222d84
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date: Thu Apr 30 00:48:52 2026 +0800
[CI/Build] Enable FP8 on NVIDIA Thor (#39712)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
commit a80d6f150c39e9e7121a54d293aa1d09619118c2
Author: Avshalom Manevich <avshalom.manevich@hcompany.ai>
Date: Wed Apr 29 18:48:47 2026 +0200
better logging for large uncachable items (#41145)
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai>
commit 91a2d3901416fcff11e192f32683ca963726989b
Author: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com>
Date: Wed Apr 29 11:54:54 2026 -0400
[Models] Cohere MoE (#40817)
Signed-off-by: Terrencezzj <terrence@cohere.ai>
commit a05848e255614401e3813c656b8cfa94969952d4
Author: Frederik Gossen <frgossen@meta.com>
Date: Wed Apr 29 11:32:03 2026 -0400
[Bugfix] Report compile time for in-memory cache hit path (#41023)
Signed-off-by: Frederik Gossen <frgossen@meta.com>
commit 51fda1ba44ff3fd08e9202ce4f404cf3a1feaec1
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed Apr 29 11:30:33 2026 -0400
[Model Runner v2] Fix block table IMA issue (#40648)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 39a7f4f4e2635012ead0ad127970d7b6778890af
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed Apr 29 11:11:04 2026 -0400
[Perf] Optimize `AllPool.forward` by slicing first, 51% faster in the method level benchmark (#41163)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit b92ef9ec5a041b538f44d9584bef0e34bfbeecd1
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Wed Apr 29 18:10:34 2026 +0300
[Perf] Enable FlashInfer top-k/top-p sampler by default (#40376)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit 5560cac7e25b1a2c3c15506c885af4911c5611d9
Author: Lalithnarayan C <Lalithnarayan.C@amd.com>
Date: Wed Apr 29 19:51:55 2026 +0530
[Bugfix][CPU] Backport PT cpp codegen indirect_assert scalar-mask fix (#40973)
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 5b39b268f506150dbab38f6f6c04b7c843e37c07
Author: pmaybank <113125070+pmaybank@users.noreply.github.com>
Date: Wed Apr 29 13:57:58 2026 +0100
hf_name argument for vllm bench throughput CLI (#41012)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 22524f7a92b71c8e65eade20ef274fa3b4006d3e
Author: Tianmu Li <tianmu.li@intel.com>
Date: Wed Apr 29 05:43:21 2026 -0700
[Feat] CPU fp8 attn for AMX/AVX-512 (#39445)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
commit 9d8ad5b408bf447e41a3629fc21a453720aaf52b
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Wed Apr 29 20:29:55 2026 +0800
[Bugfix] Fix repeated DSv4 RoPE cache initialization (#41148)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 11b69129e2221b64302fb672552c0bc04dddece5
Author: Jared Wen <w13431838023@gmail.com>
Date: Wed Apr 29 19:35:50 2026 +0800
[Frontend] Add `defer_loading` and `tool_reference` support for Anthropic and OpenAI APIs (#40190)
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
commit 33f36d42605a476a09ed75936e7c931cb8b432c5
Author: Bugen Zhao <i@bugenzhao.com>
Date: Wed Apr 29 19:03:47 2026 +0800
[DSV4] Support `max` reasoning effort (#40982)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
commit 37e288214bc3fa89d974b4d323373f2b2878d604
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Wed Apr 29 13:50:42 2026 +0300
[KV Offload] Tighten `keys` type from `Iterable` to `Sequence` in `OffloadingManager` (#41200)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit 5371d6fb4023a1a08021135e46e9354ba0923e50
Author: Rohit Kumar Singh <9626333+SKRohit@users.noreply.github.com>
Date: Wed Apr 29 15:47:51 2026 +0530
Fix PP in Gemma4 (#40786)
Signed-off-by: Rohit kumar Singh <rksingh@habana.ai>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
commit 6d7d4da99e41c4ccc0d52d74e2bf36d1ff31034d
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Wed Apr 29 18:08:55 2026 +0800
[Bugfix] BailingMoeV2.5: rotate full qk_rope_head_dim in MLA RoPE (#41185)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 3f1a4bb639a9b65e2634a6529c90da36944d6472
Author: Alec <35311602+alec-flowers@users.noreply.github.com>
Date: Wed Apr 29 03:07:41 2026 -0700
build: embed image provenance metadata in vLLM containers (#40653)
Signed-off-by: Alec Flowers <aflowers@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
commit 762022cafb1afc4c51ce706c043e2f1f5826003a
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Wed Apr 29 17:55:07 2026 +0800
[Bugfix] DSV32/V4 add missing type conversion for non-streaming tool calls (#41198)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
commit 3885d340a4779c54662b10892555ae6928b3a090
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Wed Apr 29 17:11:27 2026 +0800
[Frontend]Responses API supports Tool/Function calling with streaming with named tool/function (#41110)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
commit ef70057ca76688fc786c7fdee926ce2bd129b2c0
Author: haosdent <haosdent@gmail.com>
Date: Wed Apr 29 16:28:45 2026 +0800
[CI][CPU] Split CPU-Distributed Tests into per-scenario labels (#41203)
Signed-off-by: haosdent <haosdent@gmail.com>
commit e48cb85185d792f5b4a595c2af3cbc37ac742aac
Author: Shengqi Chen <harry-chen@outlook.com>
Date: Wed Apr 29 15:37:14 2026 +0800
[CI/Build] Auto-detect manylinux ABI tag for nightly wheels (#41149)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit 92879e12ba130e12bcc2728939eba86b2644122f
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Wed Apr 29 15:32:37 2026 +0800
[CI] fix test_rotary_embedding_opcheck format error (#41202)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
commit 68dd7db81001267c846907769adc14bb32566190
Author: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com>
Date: Wed Apr 29 02:14:52 2026 -0400
[Reasoning] Support for speculative decoding with thinking budget (#34668)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Signed-off-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
commit 8a8c9b564ef015c76cf398200b8f0891e6e51bb8
Author: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Date: Wed Apr 29 08:52:55 2026 +0300
[KV Offload] Per-job store completion for CPU offloading connector (#39186)
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
commit a269744e9f733ec9bac4bb6a33f70cc5af38afc3
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Wed Apr 29 13:42:35 2026 +0800
[Bugfix] Fix rope (#41113)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
commit 8b49cf3a37eb1a267a06b0df23328909330af1e6
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Wed Apr 29 00:33:06 2026 -0400
[Bugfix] Fix max_num_batched_token not captured in cuda graph (#40734)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: Wei Zhao (Engrg-Hardware 1) <weizha@login-bia02.bia.clusters.nvidia.com>
commit 2ae73c758ceed55ad2f70a69b47c8a994fce5662
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Wed Apr 29 12:18:46 2026 +0800
[Bugfix] fix inductor error for dpsk v4 (#41135)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit d95d03c719cd69e634e567d6cad3228557151393
Author: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Date: Wed Apr 29 05:08:35 2026 +0100
[BugFix][CPU] fix error on CPU runner shutdown (#41034)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
commit 803b9d7881cd3a8482aaa1e6bf990193b55c6331
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Wed Apr 29 00:08:16 2026 -0400
[Bugfix] Fix Deepseek V4 import error due to AOT compile cache loading (#41090)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
commit 1312f0753115cb36410334e3667961d1237a287b
Author: Walter Beller-Morales <walterbm@users.noreply.github.com>
Date: Wed Apr 29 00:07:53 2026 -0400
[Feature] add cohere reasoning and tool parsers (#40422)
Signed-off-by: walterbm <walter.beller.morales@gmail.com>
commit fa1b9840f6d87ef6e3b247a78514ccc1d6e5f1ce
Author: Lucas Kabela <lucasakabela@gmail.com>
Date: Tue Apr 28 21:07:24 2026 -0700
[BE][Torch 2.12] Remove workaround code for fixed cublas issue (#40845)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucasakabela@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
commit 916e56c05c997155b865dd4f46172f26e755da3d
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Wed Apr 29 00:06:54 2026 -0400
[QeRL] Add warnings for extra memory buffering (#40309)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
commit a085b5257dd8cc8d6c255e9b92e4642ee12fc3aa
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Wed Apr 29 00:06:38 2026 -0400
[Docs] [QeRL] Layerwise Reloading Documentation (#40317)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
commit 7fd05e05aeb3664ca19346771dc559d93423acd4
Author: liangel-02 <liangel@meta.com>
Date: Wed Apr 29 00:05:14 2026 -0400
uncomment flex backend for batch invariant mode (#40842)
Signed-off-by: Angel Li <liangel@meta.com>
commit 99255f3cb5cec7466bf9fa5310fd310baf87d711
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Wed Apr 29 12:04:49 2026 +0800
[UX] Allow enable/disable model weights loading tracking by config (#41086)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Copilot <copilot@github.com>
commit 75a7cf2c10f2dcc484c3e0444af33e0eaf3f4311
Author: haosdent <haosdent@gmail.com>
Date: Wed Apr 29 11:23:59 2026 +0800
[CI] De-flake test_chat_completion_n_parameter_non_streaming (#41147)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 4b95e9cec4e9a1a90d3f4b2afa62e88459b2b90e
Author: haosdent <haosdent@gmail.com>
Date: Wed Apr 29 10:23:26 2026 +0800
[CI] Return HTTP 400 for unsupported chat content part type (#41121)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 856b15c62c8a574a1a0a289444d5b9a8120433e3
Author: rasmith <Randall.Smith@amd.com>
Date: Tue Apr 28 21:12:17 2026 -0500
[CI][AMD][BugFix] Patch has_flashinfer decorator for test_select_rocm_aiter_backend (#41072)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
commit 6fb3f7b46b12ea63265afbe6d53d6f15a5de7b3a
Author: qizixi <22851944+zixi-qi@users.noreply.github.com>
Date: Tue Apr 28 17:22:03 2026 -0700
[DSV4] Align aux stream API with DeepseekV4DecoderLayer (#41171)
Signed-off-by: zixi-qi <zixi@inferact.ai>
commit d109eacd05f774008c7e1d17afc76fc48c4fcbc5
Author: chelnnexy <86009079+chelnnexy@users.noreply.github.com>
Date: Tue Apr 28 19:04:53 2026 -0500
[Bugfix][ROCm] Fix gemm_a4w4 call to use updated AITER API signature (#40754)
Signed-off-by: cheiluno <cheiluno@amd.com>
commit e68fa1b90a7bc52510c11fe2edeae11db15f98fc
Author: Nick Hill <nickhill123@gmail.com>
Date: Tue Apr 28 15:44:09 2026 -0700
[Core] Account for `num_gpu_blocks_override` in `max_model_len` checks (#41069)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit f05f3664c35804bf2b5b64eecd17ddfdbb8ed5e3
Author: Russell Bryant <rbryant@redhat.com>
Date: Tue Apr 28 17:53:19 2026 -0400
[Doc] Add missing API endpoints to security documentation (#40532)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
commit e9f8f31e9a4c31d6842ca1adffe2619ed204fafb
Author: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Date: Tue Apr 28 21:22:20 2026 +0200
[FEATURE] Add EagleMistralForCausalLM (#41024)
Signed-off-by: juliendenize <julien.denize@mistral.ai>
commit de3fe8dc62f3d77eb8dab8125ca90436f606bccb
Author: yangrz <37785043+yangrz7@users.noreply.github.com>
Date: Wed Apr 29 02:38:43 2026 +0800
[Bugfix] release KV blocks for skipped P-ranks to prevent invalid KV errors and timeouts when P_tp > D_tp and MLA (#40449)
Signed-off-by: yangruize <yangruize7@163.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit 0899f436aab42f798fb8e728872334c83aaebb79
Author: Joe Rowell <joerowell4@gmail.com>
Date: Tue Apr 28 20:23:00 2026 +0200
[New Model] Laguna XS.2 implementation (#41129)
Signed-off-by: Joe Rowell <joerowell4@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com>
commit 358a755e43b07b9454904df9d3c3fae3340058f1
Author: rasmith <Randall.Smith@amd.com>
Date: Tue Apr 28 13:14:59 2026 -0500
[CI][AMD][BugFix] Update request URL in test_moriio_connector to match vllm-router compatibility changes (#41076)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
commit a60883644be0bcf5219b792b5abbc448e4ea0dcf
Author: Benoit Tigeot <benoittgt@users.noreply.github.com>
Date: Tue Apr 28 19:27:18 2026 +0200
[Build] Defer flashinfer cubin download to avoid ~2.5 GB (decompressed) layer duplication (#41134)
Signed-off-by: Benoit Tigeot <benoit.tigeot@lifen.fr>
commit 5aa371dc8e38e053754d89b444abca0a1d63f676
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Tue Apr 28 12:08:55 2026 -0400
[DSV4] Enable Multi-stream for Pre-Attn GEMM (#41061)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
commit de3da0b97cd9db8b1d429312992a5759c89ef881
Author: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Date: Tue Apr 28 18:38:48 2026 +0800
Add tuned triton fused_moe configs on H100 for gpt-oss (#39904)
Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
commit 9e92de51c61a47e5abb32d99b1930862473741d5
Author: Roy Wang <jasonailu87@gmail.com>
Date: Tue Apr 28 15:52:54 2026 +0800
[Bugfix] Exclude numa_bind fields from ParallelConfig DP hash (#41098)
Signed-off-by: yasong <yasong.wang@inferact.ai>
commit bde0efdbb78a57dc10375e8d0686cf862332192c
Author: artem-spector <artem_spector@yahoo.com>
Date: Tue Apr 28 10:43:30 2026 +0300
[Bugfix][Granite4Vision] Fix deepstack buffer causing decode slowdown in compiled mode (#40917)
Signed-off-by: artemspector <artems@il.ibm.com>
Co-authored-by: artemspector <artems@il.ibm.com>
commit ea74f701db6c0dd4b2d954f5e79841101d0d8a5d
Author: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Date: Tue Apr 28 15:33:49 2026 +0800
Bugfix: fix SpecBench sample argument error (#40927)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
commit a8208e6a81befd781b2a9a8b6b29fd61f5333c66
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Tue Apr 28 15:33:41 2026 +0800
[Examples] Resettle features examples. (#40995)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 76c9cccc368fde7f1b8d8a546e3638f4f434c8fd
Author: anthonsu <50185138+anthonsu@users.noreply.github.com>
Date: Mon Apr 27 23:42:47 2026 -0700
[Core] Fix redundant None append in StepPool.forward for chunked prefill (#41049)
Signed-off-by: Anthony Su <xsuanthony@gmail.com>
commit ed57f771923703998a17ad656536ffb460447a2c
Author: JiangWeixiang <854746559@qq.com>
Date: Tue Apr 28 13:39:23 2026 +0800
[Bugfix ] fix bailing_moe_linear (#40859)
Signed-off-by: ghphotoframe <854746559@qq.com>
commit 7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Tue Apr 28 12:52:54 2026 +0800
[Model] update for mimo v25 (#41029)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Copilot <copilot@github.com>
commit c2e88a281c53059d023a2aee43217a7379509a4a
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Tue Apr 28 12:43:04 2026 +0800
[Bugfix] Fix broken example opeanai client (#41088)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit fd74c90d9c3b5c35308f1f0ab308469235fa5277
Author: Matthew Bonanni <mbonanni@redhat.com>
Date: Mon Apr 27 22:38:09 2026 -0400
[Attention][Spec Decode] Allow independent drafter attention backend selection (#39930)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
commit 146f44b77d079f5a16fe7094fa0dde6b1be95f38
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Tue Apr 28 10:36:58 2026 +0800
[Frontend]Responses API supports Tool/Function calling with streaming with required (#40700)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
commit 0d4f71420822a6f9eb386fe1f3f690ecbf31153b
Author: yzong-rh <yzong@redhat.com>
Date: Mon Apr 27 22:36:54 2026 -0400
[Bugfix] Remove tokenizer encode/decode calls from Olmo3 reasoning parser (#40855)
Signed-off-by: Yifan <yzong@redhat.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
commit 03aeed802f374c0319ad9eca34fae8e7e784769a
Author: Angela Yi <angelayi@meta.com>
Date: Mon Apr 27 17:51:15 2026 -0700
[Test] Fix test_dynamic_shapes_compilation for torch 2.12 (#40743)
Signed-off-by: Angela Yi <angelayi@meta.com>
commit 2c8b76c5cb2683f05650f20d90a63f3d9799e909
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Tue Apr 28 08:16:55 2026 +0800
[Model][DSV4] Support base model (#41006)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
commit 407b34be263320f843c0251af64a8521760871ea
Author: Kunshang Ji <kunshang.ji@intel.com>
Date: Tue Apr 28 08:04:54 2026 +0800
[xpu] bump up vllm-xpu-kernel v0.1.7 (#41019)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
commit 4c7c69b4e0aa7062b8a48268abb06c041bcec53d
Author: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com>
Date: Mon Apr 27 15:38:05 2026 -0700
[Model Runner V2] Skip attention metadata rebuild before draft prefill (#40410)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
commit 5e2c37facde9f5edd68a7de8293107089e9887d8
Author: Andreas Karatzas <akaratza@amd.com>
Date: Mon Apr 27 15:08:57 2026 -0500
[ROCm][CI] Add missing quantization methods and fix online quant test failures (#39801)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit c8bbe05189babd69312876c1dcdc80912207e154
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Mon Apr 27 14:16:22 2026 -0400
[Perf] Update TRTLLM supported MoE routing methods (#39141)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: root <root@bia0030.bia.clusters.nvidia.com>
Co-authored-by: root <root@bia0036.bia.clusters.nvidia.com>
commit 6232fb4b66b42c8e5f4ef1cc4c5163442cc99208
Author: Zhewen Li <zhewenli@meta.com>
Date: Mon Apr 27 10:58:06 2026 -0700
[Docker] Install numactl CLI in CUDA runtime image (#41032)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 2c06cf3486a67efcbdf265b8a183f9ed836cebb7
Author: Moritz Sanft <58110325+msanft@users.noreply.github.com>
Date: Mon Apr 27 17:22:35 2026 +0200
[Bugfix] use `served_model_name` for multimodal error message (#41003)
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
commit e6f710a87f3ce8b137d15ffa4b3a12568e1c8aa3
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Mon Apr 27 16:19:57 2026 +0100
Deprecate support for Transformers v4 (#40389)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit c245d35ff467bb3e9a73fcb3c4b02e6c7a3d2964
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Mon Apr 27 21:26:51 2026 +0800
[Model] Add MiMo-V2.5 support (#40967)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: zjy0516 <zhujiangyun@inferact.ai>
Co-authored-by: yasong <yasong.wang@inferact.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <copilot@github.com>
commit f8ac0c7cf0e3d4ac8894346005bdffe3bd7bd378
Author: Xiaoshuang Wang <1790571317@qq.com>
Date: Mon Apr 27 20:57:13 2026 +0800
[Bugfix] Fix k_norm weight sharding in MiniMaxM2Attention when total_num_kv_heads < tp_size (#38191)
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit ebf862c351dc4bcaf65de34c3caebe6df6e9e214
Author: Simon Mo <simon.mo@hey.com>
Date: Mon Apr 27 01:17:52 2026 -0700
Add system_fingerprint field to OpenAI-compatible API responses (#40537)
Co-authored-by: Claude <noreply@anthropic.com>
commit 8d8062d0a7014b4cde064024ae5d5a8715a833b3
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Mon Apr 27 15:48:37 2026 +0800
[Examples] Resettle generate examples. (#36464)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 985961345a13f3e3bb15d29c94b011ba9a6b858b
Author: Roy Wang <jasonailu87@gmail.com>
Date: Mon Apr 27 15:47:39 2026 +0800
[Bugfix] Install libcublas-dev in Dockerfile for FlashInfer CuTe DSL JIT (#39855)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit 706a04d34ba64ea23d430d5e50038791aacfae96
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Mon Apr 27 03:37:43 2026 -0400
[DSV4] Add silu clamp limit to shared expert (#40950)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
commit 22631f80a01a04b06398952e77d7890ab660ab10
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Mon Apr 27 15:27:06 2026 +0800
[Bugfix] Remove invalid deepstack boundary check for Qwen3-VL (#40932)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 2cc008e7b491bb77f0caf4d27ad55a83f196114c
Author: Bhoomit <bhoomit.2010@gmail.com>
Date: Sun Apr 26 22:48:36 2026 -0700
[Attention][TurboQuant] Share dequant buffers, eliminate float16_copy (#40941)
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
Signed-off-by: Vasani Bhoomit <bhoomit.2010@gmail.com>
commit 5d5c7764446da0d4888add9a060604e376e4e856
Author: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com>
Date: Mon Apr 27 06:44:15 2026 +0100
[Perf] FP8 FlashInfer Attn for ViT (#38065)
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com>
Co-authored-by: Yubo Gao <ybgao-nvidia@users.noreply.github.com>
commit 592ae6805cb87c9a44c29bd9c6eef9d04d91e39b
Author: ojhaanshika <anshikao@nvidia.com>
Date: Sun Apr 26 22:15:29 2026 -0700
Cutlass W4A16 (Machete) Tests (#35450)
Signed-off-by: Anshika Ojha <anshikao@nvidia.com>
commit 7b1bc0a3eb01a6bc2650eda9970049f7825240d7
Author: Dao007forever <dao007forever@gmail.com>
Date: Sun Apr 26 21:33:13 2026 -0700
[Bugfix] Cap SWA/chunked-local runtime admission to startup pool-sizing bound (#40946)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit c0879d94839a4cc0febba20cc1cb5642fc5c9cc4
Author: Silu Panda <31051721+SiluPanda@users.noreply.github.com>
Date: Sun Apr 26 19:26:51 2026 -0700
[Tests] Gate Isaac under Transformers v5 (#40907)
Signed-off-by: Silu Panda <31051721+SiluPanda@users.noreply.github.com>
commit f5f987851493e6f09ab2ddeb3f33ae878eda0353
Author: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com>
Date: Sun Apr 26 19:12:08 2026 -0700
[Model Runner V2] Fix rejection sampling acceptance rate gap vs MRV1 (#40651)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
commit 2ce95a761b9acd925d1bc69cfd1d4fc13de9e2b7
Author: youkaichao <youkaichao@gmail.com>
Date: Mon Apr 27 09:37:22 2026 +0800
Auto-disable expandable_segments around cumem memory pool (#40812)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 4d51588e2381018348f1022dfa3a7698899805b7
Author: Yifan Qiao <yifanqiao@inferact.ai>
Date: Sun Apr 26 18:31:08 2026 -0700
[Feat] DeepSeek V4 Rebased (#40860)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: qizixi <zixi@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Yongye Zhu <yongye@inferact.ai>
Co-authored-by: Simon Mo <simon@inferact.ai>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roy Wang <yasong.wang@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Zhewen Li <jerven.vllm@gmail.com>
Co-authored-by: Zijing Liu <liuzijing2014@gmail.com>
Co-authored-by: khluu <khluu000@gmail.com>
Co-authored-by: qizixi <zixi@inferact.ai>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
commit 32e45636e3d7e02615facc8c63645ce4ac1d7e11
Author: Xinan Miao <1403572259@qq.com>
Date: Mon Apr 27 01:44:42 2026 +0800
[torch.compile]: Disable Sequence Parallelism (SP) for piecewise compilation (#38373)
Signed-off-by: SouthWest7 <am1ao@qq.com>
Signed-off-by: Xinan Miao <1403572259@qq.com>
Co-authored-by: SouthWest7 <am1ao@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Wang Xingran <72983099+wangxingran222@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b39c266dae8cd7aee31f667c973e9698ed0b2361
Author: omerpaz95 <73347585+omerpaz95@users.noreply.github.com>
Date: Sun Apr 26 15:06:01 2026 +0300
[KV Offload] Offload all KV blocks when doing prefill in P/D (#40346)
Signed-off-by: omerpaz95 <omerpaz95@gmail.com>
Signed-off-by: omerpaz95 <73347585+omerpaz95@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
commit 9558f43903faa1b6db08ac98802bf88111196345
Author: Dao007forever <dao007forever@gmail.com>
Date: Sun Apr 26 01:26:34 2026 -0700
[Bugfix] Size FlashInfer NVLink MNNVL workspace to EP group (#40893)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
commit 8cd174fa358326d5cc4195446be2ebcd65c481ce
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Sun Apr 26 09:55:19 2026 +0800
[LoRA] MoE LoRA Refactor (#40338)
commit c798593f0d88cec583c599ea7ea40a2cc26c312b
Author: Chauncey <chaunceyjiang@gmail.com>
Date: Sun Apr 26 08:58:50 2026 +0800
[Bugfix] Fix the DSML token leakage in DSV4/3.2 (#40806)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: Windswithyou 1694599440@qq.com
commit 12a3f6454b973d7cd8806d398ba287a7e1d22c63
Author: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Date: Sat Apr 25 23:50:12 2026 +0300
[Bugfix][MoE] Only unpad routed output before shared expert add or routed output transform (#40865)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
commit 60cd878a3beca91e63d9a34a9c60fd335e780182
Author: Or Ozeri <oro@il.ibm.com>
Date: Sat Apr 25 20:00:46 2026 +0300
[kv_offload+HMA][11/N]: Support store with multiple KV groups (#39403)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
commit 1e9f19ca3fd29f83442ab83b08d4642e691c95bd
Author: rasmith <Randall.Smith@amd.com>
Date: Sat Apr 25 08:34:14 2026 -0500
[CI][AMD]BugFix] Fix deadlock occuring in test_moe_layer (#40767)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
commit 6646c0c7e0c921709c9b194e3988dfaabda5ee15
Author: labAxiaoming <34019940+labAxiaoming@users.noreply.github.com>
Date: Sat Apr 25 21:04:26 2026 +0800
[Opt] Optimize deepstack buffer handling for multimodal Qwen3 models (#40145)
Signed-off-by: xiaoming <1259730330@qq.com>
commit 95995bbef81292e3ee1ef0df5ca3989bb481bdd5
Author: Andreas Karatzas <akaratza@amd.com>
Date: Sat Apr 25 00:25:20 2026 -0500
[ROCm][Engine] Fix GPU memory leaks in engine shutdown and test workaround for async KV prefix cache reset (#38503)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 07351e0883470724dd5a7e9730ed10e01fc99d08
Author: Chenguang Zheng <645327136@qq.com>
Date: Sat Apr 25 11:57:41 2026 +0800
[Feature] Warm up readonly multimodal processor during renderer startup (#40797)
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
commit 428b988c98a0dee06c47d4a70858317b60169461
Author: Andreas Karatzas <akaratza@amd.com>
Date: Fri Apr 24 21:59:31 2026 -0500
[ROCm][CI] Fix `trust_remote_code` AttributeError in EAGLE3 acceptance length test (#40306)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit e54894fc85a9861fb38a49701b5844462c3d77e4
Author: Andreas Karatzas <akaratza@amd.com>
Date: Fri Apr 24 21:20:59 2026 -0500
[ROCm][CI] Fix TestSiluMulGroupFp8QuantModel after W8A8 block linear refactor (#39799)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit bc2ae5a3d6b59690b6a3312f0ed63842e8bc600b
Author: Angela Yi <angelayi@meta.com>
Date: Fri Apr 24 17:59:20 2026 -0700
[Test] Increase qwen2_vl num_logprobs to fix torch 2.12 update (#40818)
Signed-off-by: Angela Yi <angelayi@meta.com>
commit a474da28131f61684849b31e29af0eebaaedc383
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Fri Apr 24 19:28:18 2026 -0400
[Refactor] Remove unused dead code (#40640)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit ce6a199ecc0996254efcf6fe532c40d9b9432922
Author: Lucas Kabela <lucaskabela@meta.com>
Date: Fri Apr 24 16:25:03 2026 -0700
[BE][Bugfix] Respect TORCH_COMPILE_DISABLE env var at the vLLM config level for torch 2.12 (#40715)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
commit f88763efc35f8da4d3cfe611a0497d3d3251b9e9
Author: Ignacio Sica <mignacio.sica@gmail.com>
Date: Fri Apr 24 20:13:52 2026 -0300
[Bugfix] add seq_lens_cpu_upper_bound to CommonAttentionMetadata in mla_runner.py (#40844)
Signed-off-by: ignaciosica <mignacio.sica@gmail.com>
commit 333529deae59cd4100df540f225470c9bc539bee
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Sat Apr 25 01:06:41 2026 +0300
[EPLB] Fix replica selection bias in fused_moe router (#40810)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit 88256082058fdbd41281c4f1f9a19663a4d7a668
Author: Zhang Jian <jianmusings@gmail.com>
Date: Sat Apr 25 04:40:07 2026 +0800
[Bugfix][CI] Fix wrong residual shape in TestFusedAddRMSNorm.example_inputs that causes flaky test (#40629)
Signed-off-by: Zhang Jian <jianmusings@gmail.com>
commit 095d2f87e8519de27f1fc39d9d22b299efdf0010
Author: qli88 <qiang.li2@amd.com>
Date: Fri Apr 24 14:54:40 2026 -0500
[Bug] Fix GLM-5.1 running error on ROCm platform (#40763)
Signed-off-by: Qiang Li <qiang.li2@amd.com>
commit 21792520e727676e4d4e8bd24a8fe29da4dab152
Author: Neil Schemenauer <nas-github@arctrix.com>
Date: Fri Apr 24 10:24:05 2026 -0700
[Build] Add Python 3.14 to supported version list. (#34770)
Signed-off-by: Neil Schemenauer <nas@arctrix.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
commit 5e11b403657ebd5507e07200c2ba2b8f186d07da
Author: Alex Brooks <albrooks@redhat.com>
Date: Fri Apr 24 10:30:00 2026 -0600
[Frontend] Delegate to vLLM Omni When `--omni` Passed (#40744)
Signed-off-by: Alex Brooks <albrooks@redhat.com>
commit f768b4473e1bd55023dcaff63984cfdd08902fc8
Author: labAxiaoming <34019940+labAxiaoming@users.noreply.github.com>
Date: Fri Apr 24 23:26:09 2026 +0800
[Docs] Add docs for context extension using the yarn method (#37430)
Signed-off-by: xiaoming <1259730330@qq.com>
Signed-off-by: labAxiaoming <34019940+labAxiaoming@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
commit 914d0464c1b1ec77d1560b485624f32002532b83
Author: JartX <sagformas@epdcenter.es>
Date: Fri Apr 24 17:18:06 2026 +0200
[Refactor] Unify 2D/3D kernels in triton_unified_attention (#40631)
Signed-off-by: JartX <sagformas@epdcenter.es>
commit 9f771b3ab92d26a7d91a8255572c5d8d2b3ad601
Author: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Date: Fri Apr 24 21:29:44 2026 +0800
[Quantization] add humming quantization kernel (#34556)
commit c9d3c6e6af7fb848d3f03e256484f68a00201020
Author: Itay Alroy <75032521+itayalroy@users.noreply.github.com>
Date: Fri Apr 24 16:05:31 2026 +0300
fused_moe: treat NIXL EP as batched experts (#40412)
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
commit 51adca74e6be951c86e920046a83bfc061193ba2
Author: Or Ozeri <oro@il.ibm.com>
Date: Fri Apr 24 15:32:29 2026 +0300
[kv_offload+HMA][9/N]: Support lookup with multiple KV groups (#39401)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
commit e8eb0490ce098b1add05877363b185f3a7b570c5
Author: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Date: Fri Apr 24 14:53:23 2026 +0300
[Bugfix][MoE] Unpad routed output before shared expert add [Fixes #35949] (#40794)
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
commit e8ee2a78dbc08d398d5e798a149657b8aa821850
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Fri Apr 24 19:25:55 2026 +0800
[Attention] use diff kv backend for mimo v2 flash (#40045)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 2ec18f5df43e7f6c51e95125759904d39bd01630
Author: Thomas <153741656+thomasmaindron@users.noreply.github.com>
Date: Fri Apr 24 13:01:56 2026 +0200
[Bugfix][Parser] Fix Mistral tool parser for HF tokenizers (#39294)
Signed-off-by: thomasmaindron <thomasmaindron@users.noreply.github.com>
Co-authored-by: thomasmaindron <thomasmaindron@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
commit 6dec49f27ece339c59d5eb92f33120c11c0c3b74
Author: Dmitry Tokarev <dtokarev@nvidia.com>
Date: Fri Apr 24 06:27:11 2026 -0400
[Build] Bump CUDA to 13.0.2 to match PyTorch 2.11.0 (#40669)
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
commit b5587e1013d0e352bb33c30b456d5221aebecd8c
Author: Shanshan Shen <467638484@qq.com>
Date: Fri Apr 24 18:12:14 2026 +0800
[CI/Build] Add e2e test for ViT CUDA graph (#40780)
Signed-off-by: shen-shanshan <467638484@qq.com>
commit 9ad5abe7722ba4eb9cb484689dd90529e76c41c5
Author: milesial <milesial@users.noreply.github.com>
Date: Fri Apr 24 02:18:55 2026 -0700
Fix Nano Nemotron VL static image inputs (#40724)
Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com>
commit 7d3195ea9fc88e31131099d2d2122fe38558a87a
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Fri Apr 24 01:40:20 2026 -0700
[Bugfix] Fix IMA in DSA + MTP (#40772)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 512f52219240b0aa1be687955ab52fcdd0c5a40e
Author: Luciano Martins <lucianomartins@google.com>
Date: Fri Apr 24 01:27:46 2026 -0700
[Model] Gemma4: add bidirectional vision attention for sliding layers with window guard (#40534)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Luciano Martins <lucianomartins@google.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 4c34b2f6fc63435c791c9054c579ca3f8c902bb6
Author: Yuwen Zhou <yuwen.zhou@intel.com>
Date: Fri Apr 24 16:26:16 2026 +0800
[XPU] Enable torch.compile for XPU GDN attention (#39466)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit cf8a613a87264183058801309868722f9013e101
Author: Xin Yang <105740670+xyang16@users.noreply.github.com>
Date: Thu Apr 23 23:51:05 2026 -0700
Support only half types for concat_mla_q kernel (#37892)
Signed-off-by: Xin Yang <xyangx@amazon.com>
commit 01acf96c6f57914479e6bfe79d7bd5777a2fc49f
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Fri Apr 24 14:08:45 2026 +0800
[XPU][CI] Fix Docker cleanup races on Intel CI runners (#40761)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
commit 079a4cf399ad548d442fd92bfffbfbe460b6613…
Purpose
Fixes #39734 (and the same root cause that motivates #39866 / #40027): on hybrid full + sliding-window models with the pool sized at the startup minimum, runtime admission rejects long-but-fittable prompts because the SWA admission gate budgets the full sequence even though
SlidingWindowManager.remove_skipped_blocksrecycles out-of-window blocks each chunk. The startup pool sizer already accounts for this; the runtime gate did not. The two formulas drifted, producing scheduler stalls and head-of-line blocking on Gemma-4 31B-class workloads.This PR makes startup pool sizing and runtime admission use the same recycling-aware bound by introducing a
max_admission_blocks_per_requestmethod onSlidingWindowSpecandChunkedLocalAttentionSpec. Bothmax_memory_usage_bytes(startup sizer) andget_num_blocks_to_allocate(runtime gate) call it. They cannot drift.Relationship to existing PRs (per AGENTS.md duplicate-work check)
get_num_blocks_needed_for_admission) alongside the existingget_num_blocks_to_allocate. The token-cap formulamin(num_tokens, sliding_window + max_num_batched_tokens)is duplicated separately from the startup pool-sizing formula inmax_memory_usage_bytes, so the two can drift. SWA only.ValueError: Cannot get 1 free blocks from the pool.This PR's design avoids that gap by construction:
SlidingWindowManager.get_num_blocks_to_allocatemax_admission_blocks_per_requeston the spec, called from bothmax_memory_usage_bytesand the managersliding_window + max_num_batched_tokens)cdiv(sliding_window-1+max_num_batched_tokens, block_size) + 1, capped atmax_model_len)I'd be happy to fold useful pieces (e.g., the head-of-line repro in #39866's bench numbers) into the discussion, or close in favor of either if the maintainers prefer the other shape.
Changes
vllm/v1/kv_cache_interface.py— addSlidingWindowSpec.max_admission_blocks_per_requestandChunkedLocalAttentionSpec.max_admission_blocks_per_request. Refactor each spec'smax_memory_usage_bytesto call its own admission method, locking startup and runtime to one formula.vllm/v1/core/single_type_kv_cache_manager.py—SlidingWindowManagerandChunkedLocalAttentionManageracceptmax_admission_blocks_per_requestand capnum_tokensinsideget_num_blocks_to_allocatetomax_admission_blocks_per_request * block_size(recycling-aware reservation, not lifetime touches).vllm/v1/core/kv_cache_coordinator.py,vllm/v1/core/kv_cache_manager.py,vllm/v1/core/sched/scheduler.py,vllm/v1/simple_kv_offload/manager.py— plumbmax_num_batched_tokensthrough so the spec method can be called at manager-construction time. Defaults tomax_model_lenwhen not supplied (preserves the legacy uncapped behavior for direct callers).Safety argument
Per-request peak real-held blocks ≤ reservation, because
SlidingWindowManager.remove_skipped_blocks(and the chunked-local equivalent) is invoked fromallocate_slotsbefore each chunk'sget_num_blocks_to_allocate, swapping out-of-window blocks fornull_block. The admission gate enforcessum(reservations) ≤ pool. Thereforesum(peak_real_held) ≤ pool. Drift between the two formulas would re-introduce #39734 or, worse, mid-prefill OOM — they MUST stay in sync, which is exactly what the single-source-of-truth design guarantees.Test plan
New tests:
test_can_fit_full_sequence_swa_cap_admits_long_prompt— hybrid full+SWA, pool sized at startup minimum, prompt longer than SWA window. Without the cap: 32 (full) + 32 (SWA) = 64 blocks demanded → reject. With the cap: 32 (full) + 13 (SWA) = 45 ≤ pool. Asserts admission succeeds.test_can_fit_full_sequence_full_attention_still_gates_oversized— confirms the cap does not weaken the genuine-too-big rejection path on full-attention groups.Reproduce head-of-line blocking on Gemma3:
Send 4 requests of size:
[100_000, 5_000, 13_000, 14_000]("hello" * n).Before the fix:
After the fix all 4 went through.
Test results
AI-assisted contribution disclosure
This change was developed with AI assistance (Claude Code). The submitter has reviewed every changed line, runs the test suite locally, and stands behind the design choice and the safety argument above.
Signed-off-by: Dao Le Dao007forever@gmail.com
Co-authored-by: Claude