[CI] Fix "test_awq_load[gemma4-moe-*]" failure#43296
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Gemma4 vision encoder's batching logic to prevent out-of-memory (OOM) errors by dynamically sizing chunks based on currently free GPU memory. It replaces the static budget calculation with a more accurate model of transient memory costs, specifically targeting the F.one_hot allocation. Feedback highlights that the new _encoder_chunk method lacks a safety check for zero-cost scenarios, which could lead to a division by zero. Additionally, the direct use of torch.cuda.mem_get_info() in both image and video processing paths breaks portability for non-CUDA backends and should be abstracted through vLLM's platform layer.
|
@lucianommartins @Isotr0py, may you help review this? This is the follow-up to #43169. Thank you in advance! |
47c6974 to
2d95a89
Compare
|
Hi @haosdent, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
f21d2af to
5109970
Compare
|
Hi @haosdent, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
PR vllm-project#43169 sized the batched vision-encoder chunk by ``budget = 5 % * total_memory`` with a cost model that counted only the encoder residual stream. On a 22 GiB L4 with a 26B AWQ model loaded (~3 GiB free), the heuristic admitted ``chunk ~= 53`` and OOMed inside HF's ``Gemma4VisionPatchEmbedder._position_embeddings`` allocating a 2.88 GiB ``F.one_hot(num_classes=position_embedding_size)`` int64 buffer, which is the actual dominant transient and ~4x larger than the old cost term. Replace the heuristic with a free-memory-based budget (``min(free // 2, total // 10)``) and a cost model that counts the ``F.one_hot`` transient. Hoist the memory query out of the per-resolution-bucket loop, route it through a portable ``_query_device_memory`` helper that uses ``current_platform.get_device_total_memory()`` and gracefully falls back to chunk=1 on non-CUDA backends, and extract the math into a pure static ``_encoder_chunk`` so it can be unit-tested without a GPU. Fixes the nightly ``test_awq_load[gemma4-moe-*]`` failures while preserving PR vllm-project#43169's batching speedup on roomy GPUs. Signed-off-by: haosdent <haosdent@gmail.com>
|
Hi @haosdent, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
The pre-commit failures fixed in #43378 |
|
Likely it also fixes Multi-Modal Models (Extended Generation 3) Let me monitor if https://buildkite.com/vllm/ci/builds/67561/list?sid=019e4d84-f66a-472b-9bdb-73d6a1b9193b&tab=output pass |
commit 843715739b7b555c61dd6190cafb5ab7a44c41f1
Author: Yongye Zhu <zyy1102000@gmail.com>
Date: Fri May 22 13:06:31 2026 -0400
[Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149)
commit b21f3d56d4a2ab5504b56504e87e0475c6d84eb2
Author: Dao007forever <dao007forever@gmail.com>
Date: Fri May 22 09:14:11 2026 -0700
[KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit c7624bea5ebba1c688eb4c216bd4ede7a94f2a82
Author: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com>
Date: Fri May 22 12:10:03 2026 -0400
[Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650)
Signed-off-by: zhanda <zhandazhu@gmail.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
commit 91f5b92438a568c89e8b9d6c2c55de5a552291f6
Author: Bugen Zhao <i@bugenzhao.com>
Date: Fri May 22 23:22:11 2026 +0800
[Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
commit f0feb15e7fc521544d23c2d23de0e327a509876b
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Fri May 22 22:31:00 2026 +0800
[Multimodal] Simplify ViT CUDA graph interfaces (#41234)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit fb21d8b4f9027f4642637c7bb0acc08c29dce387
Author: sychen52 <41452870+sychen52@users.noreply.github.com>
Date: Fri May 22 07:21:51 2026 -0700
Add NVFP4 MOE support for Deepseek V4. (#42209)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
commit a377631d21cc97db678727455d33c4257435f417
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 22:06:24 2026 +0800
[CI] Fix AMD docker build tests (#43329)
Signed-off-by: haosdent <haosdent@gmail.com>
commit d3a563501bcc6134a348f8458b1a797c94336f1f
Author: Ilya Markov <markovilya197@gmail.com>
Date: Fri May 22 15:43:27 2026 +0200
[EPLB] Change default EPLB communicator (#43110)
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
commit 15f7cd33dc8bd4d2270b70ba49d511827d2413ff
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Fri May 22 21:41:56 2026 +0800
[LoRA] Reduce memory of 2D weights when EP is set (#42737)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 79ff0ffa98dc8dd14a8651bce36ce6265ff4d35d
Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com>
Date: Fri May 22 05:26:41 2026 -0700
[BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118)
Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
commit 4658bf882b881287fc85797a23037aa91740b7a7
Author: Tobias Wasner <wasnertobias@users.noreply.github.com>
Date: Fri May 22 12:54:29 2026 +0200
[Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001)
Signed-off-by: Tobias Wasner <wasnertobias@gmail.com>
commit b3c7ffcab82c2439726f8cb213800f6f38c023d3
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Fri May 22 05:43:33 2026 -0500
[Misc] Replace assert with proper exceptions for security and validation in pooling (#43286)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit d3d1cf6972607c53327b5ce1748e56a95fc41c37
Author: Ma Jian <jian1.ma@intel.com>
Date: Fri May 22 18:22:45 2026 +0800
[XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951)
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 7e1b45a09252a5b513cd83116aa7a2f310220c34
Author: wangxiyuan <wangxiyuan1007@gmail.com>
Date: Fri May 22 17:13:12 2026 +0800
[Attention] Mamba attention module refactor (#41126)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
commit 65b7a812a2dabd212d78c7b5b8a320b4efb9750d
Author: Li, Jiang <jiang1.li@intel.com>
Date: Fri May 22 16:48:17 2026 +0800
[CPU] Experimentally enable Triton and MRV2 (#43225)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
commit 2380bfc2104267914eea36015e2a347b9318c6c0
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Fri May 22 16:43:14 2026 +0800
[Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit a7616977176e12ddb14c0daab00cd2a2161ba37c
Author: mrjunwan-lang <mrjunwan@google.com>
Date: Fri May 22 01:36:17 2026 -0700
Fix the docker build failure in tpu-inference (#43360)
Signed-off-by: mrjunwan-lang <mrjunwan@google.com>
commit 694d9a81bbb07977e7a72a597acb44f6a848f774
Author: Nick Hill <nickhill123@gmail.com>
Date: Fri May 22 00:25:10 2026 -0700
[BugFix] Fix setuptools-rust dep in requirements files (#43377)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 6bb8753db1076f498c240fffdd88b1ab983b7f40
Author: Weida Hong <wdhongtw@google.com>
Date: Fri May 22 15:21:35 2026 +0800
Correcting the mock classes for MM GC tests (#43321)
Signed-off-by: Weida Hong <wdhongtw@google.com>
commit 025d4f5cd2617bb767663f9e7d62354039887757
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 15:13:59 2026 +0800
[CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 5ea76fa89aa2e307f0d9a2e7fc19d13aed65a82f
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 14:24:18 2026 +0800
[CI] Fix test_lora_with_spec_decode on V2 model runner (#43314)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
commit fa1ff88b3145d1897558408a9001c030c39383b9
Author: tc-mb <157115220+tc-mb@users.noreply.github.com>
Date: Fri May 22 13:44:06 2026 +0800
[Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213)
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
commit e746a2eebf09b1f99beb6b3c60a5ba9d2f8c4875
Author: Furkan F <id+git@yufufi.com>
Date: Fri May 22 07:28:23 2026 +0200
[Model] Use `AutoWeightsLoader` for Voyage (#42972)
Signed-off-by: Furkan Fidan <dev@yufufi.com>
commit 1fe3303983e1829fae25edfb0b93e8cbcfad96e6
Author: haosdent <haosdent@gmail.com>
Date: Fri May 22 12:15:22 2026 +0800
[CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 8c8b1825eb26c1ffae776baaab16f2eebf92b7d3
Author: Xiaochang Wu <xiaochang.wu@intel.com>
Date: Fri May 22 12:02:51 2026 +0800
[XPU] Enable multiple key kernels for sparse attention (#37888)
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 18a27cc9a3641cc1dd3eae5113b75c7ccc029b5f
Author: qizixi <22851944+zixi-qi@users.noreply.github.com>
Date: Thu May 21 20:36:22 2026 -0700
[Bugfix] Make CuMemAllocator free callback stream-aware (#43020)
Signed-off-by: zixi-qi <zixi@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
commit 0ddd7dd6564f5e403a15bd7c973c7d358ec82454
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 23:33:16 2026 -0400
[Frontend] DP Supervisor (#40841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 60af5c16ee64ea3c1c573d67d0773a713c87a22e
Author: ruizhang <rza21.bc@gmail.com>
Date: Thu May 21 20:32:31 2026 -0700
[Frontend] Add truncation side to OpenAI endpoints (#43260)
Signed-off-by: Rui Zhang <rza21.bc@gmail.com>
Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net>
Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net>
commit 35d0141a0b68a188777e277e372f211098419f58
Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Date: Thu May 21 23:17:54 2026 -0400
[ROCm][CI] add warmup to mem_util test before measurement (#43236)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
commit 86ccef7d4400a54441057773d8ffb1f61a20af94
Author: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Date: Fri May 22 05:06:40 2026 +0200
[ROCm] Add XGMI backend for MoRI Connector (#41753)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
commit 2998a047aad7d48bf0399f19b36f1a4d749c59c2
Author: Chengze Fan <fancz2002@gmail.com>
Date: Thu May 21 19:43:01 2026 -0700
[Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855)
Signed-off-by: Chengze Fan <chengze@meta.com>
Signed-off-by: Chengze Fan <fancz2002@gmail.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
commit ba369b7eb5a3c6593b55f2005655d6586997fa07
Author: Isotr0py <mozf@mail2.sysu.edu.cn>
Date: Fri May 22 10:26:05 2026 +0800
[CI] Fix dockerfile dependency graph failure for pre-commit (#43378)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 39910f2b25aacc09f5e7f166cdf0030b19f8b9e8
Author: Bugen Zhao <i@bugenzhao.com>
Date: Fri May 22 08:21:48 2026 +0800
[Rust Frontend] Move code from `vllm-frontend-rs` (#43283)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Signed-off-by: Will.hou <1205157517@qq.com>
Signed-off-by: Will.hou <willamhou@ceresman.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Eric Curtin <eric.curtin@docker.com>
Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Co-authored-by: Will.hou <1205157517@qq.com>
Co-authored-by: Will.hou <willamhou@ceresman.com>
Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history.
commit 39d5fa96a7c687f9ed7e14a5a52064965356cede
Author: Lanze Liu <86434077+liulanze@users.noreply.github.com>
Date: Thu May 21 15:42:42 2026 -0700
[Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
commit 565b745ec5d28dafd14585f1b695b159ba336a04
Author: Nick Hill <nickhill123@gmail.com>
Date: Thu May 21 15:42:20 2026 -0700
[BugFix] Use correct logprobs for `logprob_token_ids` (#43125)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit e26e1f09280b6c54e1bc1d1fbc0118f7e309cb10
Author: fangyuchu <fangyuchu@qq.com>
Date: Fri May 22 06:42:07 2026 +0800
[Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968)
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <389750525@qq.com>
Co-authored-by: zWaNg3 <389750525@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 0f66623b0d739dc94afddb67863c37d6f5816579
Author: Nick Hill <nickhill123@gmail.com>
Date: Thu May 21 15:36:58 2026 -0700
[Frontend] Rework fastokens integration (#43168)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 0b59fc45dd475f96f6f46f2c3e699d7bc13b3b04
Author: ylangtsou <149562838+ylangtsou@users.noreply.github.com>
Date: Fri May 22 06:00:52 2026 +0800
Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038)
Signed-off-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit 17b69828a013acb7af0cd1d16d24ecc8d7582094
Author: Zheng Luo <zheluo@nvidia.com>
Date: Thu May 21 13:05:01 2026 -0700
[Core] Add native ModelExpress load format (#43105)
Signed-off-by: Zheng Luo <zheluo@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
commit b29cbf06525254693f29d98686e038eaf225be8c
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 16:00:29 2026 -0400
[Perf] `zeros` -> `empty` to remove additional fill (#42988)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 9b54e50e2c1c61ea3b7def032fbafc56dd3179c1
Author: Michael Goin <mgoin64@gmail.com>
Date: Thu May 21 15:51:12 2026 -0400
[Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
commit 1c78f76c29a642379ad0ec953a77af9bc44376b6
Author: anish <145943060+anishesg@users.noreply.github.com>
Date: Thu May 21 11:07:46 2026 -0400
[Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079)
Signed-off-by: anish <anishesg@users.noreply.github.com>
Signed-off-by: Your Name <ak8686@princeton.edu>
Signed-off-by: anish <145943060+anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
commit 9b9d5dbaab852a1c615fe83a7f92881d353503db
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 22:28:34 2026 +0800
[CI] Fix CPU tests failing on `tl.exp2` import (#43311)
Signed-off-by: haosdent <haosdent@gmail.com>
commit b730c4635288d75da4788bc28d8d26b5e5c3726c
Author: Francesco Fusco <ffu@zurich.ibm.com>
Date: Thu May 21 13:50:54 2026 +0200
[Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit c68c55d43e504745dbfc2d46b552e80acb74d4b9
Author: velonica0 <47554626+velonica0@users.noreply.github.com>
Date: Thu May 21 19:50:49 2026 +0800
[CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
commit 5ecd8e9c708821916323d25d5f7beddb7f41d22b
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Thu May 21 18:41:38 2026 +0800
[XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit caf69823d61119ac3f4b066f20a910b62078e41c
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 18:38:07 2026 +0800
[CI] Pin protoc binary in rust-build stages (#43292)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 68e07d59161a8d268b773c181fab17994a7c5d0a
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Thu May 21 04:58:09 2026 -0400
[Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
commit ebbfb34e3e058bd539db9e5015d0c18b7ce5a5e0
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Thu May 21 01:57:47 2026 -0700
[Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
commit edafea35550fab0b185b885711ec048dfd2e1a4d
Author: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Date: Thu May 21 16:17:12 2026 +0800
Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223)
Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
commit b719b1635b4899e2372905def0badf96d4dd242a
Author: zexplorerhj <zhjoneson@163.com>
Date: Thu May 21 16:16:27 2026 +0800
Update KDA chunk prefill decay to use exp2 semantics (#43195)
Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
commit 0a54df28471be07b3d668ea21c5e411569d3baea
Author: Kunshang Ji <kunshang.ji@intel.com>
Date: Thu May 21 07:14:13 2026 +0000
[XPU] add setuptools-rust for xpu dependency (#43287)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
commit a950e9447e38727fc956afdc242bc6e3796ccb77
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 14:30:14 2026 +0800
[CI] De-flake test_models for bigscience/bloom-560m (#43197)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 050611a3dd19271a3c729788ff69b3470ccfb238
Author: Yiyang "Ian" Liu <yiyangliu@microsoft.com>
Date: Wed May 20 22:58:59 2026 -0700
[Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601)
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
commit 905b97adfaf7b08f3cc95b328579e5336ed6d3b6
Author: yzong-rh <yzong@redhat.com>
Date: Thu May 21 01:13:15 2026 -0400
[Benchmark] Add num-warmup to vllm bench throughput (#43245)
Signed-off-by: Yifan Zong <yzong@redhat.com>
commit a6682d1d259cca69a9ae737ea5608fbbe7520031
Author: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
Date: Wed May 20 21:35:08 2026 -0700
[Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905)
Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
commit f2ace1d57d28df8d4c5e973dd62d87f47d628cb3
Author: Nick Hill <nickhill123@gmail.com>
Date: Wed May 20 21:24:48 2026 -0700
[Frontend][RFC] Rust front-end integration (#40848)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
commit d97ba29fdcf2538359fac5c644c0f07e59bc1988
Author: 손세정 <maze0717@g.skku.edu>
Date: Thu May 21 13:24:08 2026 +0900
[ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973)
Signed-off-by: AAISSJ <maze0717@g.skku.edu>
Signed-off-by: <>
Signed-off-by: sejung-son <sejung.son@nhn.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local>
Co-authored-by: sejung-son <sejung.son@nhn.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
commit 6441cf4a44856f4eb4dce7d19a51fd69e1b423cf
Author: Flora Feng <4florafeng@gmail.com>
Date: Thu May 21 00:24:06 2026 -0400
[Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 346cf163a11b55e069aa3143ae2878967393ddc2
Author: Ben Browning <bbrownin@redhat.com>
Date: Thu May 21 00:23:47 2026 -0400
[Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 7e5070934ee5f28103c5b95cb776904a12fc36f5
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 12:22:10 2026 +0800
[CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082)
Signed-off-by: haosdent <haosdent@gmail.com>
commit 2b75a73b8e23f5df6de92d01a191e059424487e3
Author: Luciano Martins <22145370+lucianommartins@users.noreply.github.com>
Date: Thu May 21 01:22:06 2026 -0300
[Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
commit e45df8c3f77572d03f638feded5b5efbccdbcc05
Author: sonusflow <git@sonusflow.pl>
Date: Thu May 21 06:22:01 2026 +0200
[Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329)
Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit ee05e8137ec48b8e7375228a1142b4c5f2e3360c
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Thu May 21 12:20:57 2026 +0800
[Minor] Bigger overlap for FI AR (#43103)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 5d041cc1fe5181daabf39943efc7b678380d57bd
Author: Louie Tsai <louie.tsai@intel.com>
Date: Wed May 20 20:57:48 2026 -0700
update GPU json file based on h200 recipes (#43262)
Signed-off-by: louie-tsai <louie.tsai@intel.com>
commit 9640970de20b15ade9eb3859825637f64e81ed8c
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed May 20 21:00:30 2026 -0400
[Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit 63ea11709bd9e9b14669e3973dff92d2dcea3cb1
Author: Ace Eldeib <alexeldeib@gmail.com>
Date: Thu May 21 02:36:16 2026 +0200
[CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255)
Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
commit bde560ed6e1dc889debf68410ccbcb00b749513b
Author: akii96 <aakif.nawaz@amd.com>
Date: Thu May 21 01:46:51 2026 +0300
[ROCm] Add QuickReduce min-size override and codec threshold (#41675)
Signed-off-by: <>
commit 6dc0a71843878ef45e29d4732147290b797b70fd
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Thu May 21 05:19:50 2026 +0800
[Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 5774aad9c5b67c5bb67bb7d306a9652a035ed0aa
Author: Michael Goin <mgoin64@gmail.com>
Date: Wed May 20 17:13:12 2026 -0400
[Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit 452baa860b1169787cc8540a1772c4d96f682c40
Author: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Date: Wed May 20 16:10:44 2026 -0500
Add dllehr-amd to CODEOWNERS and committers list (#42772)
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com>
commit 2a43b407c5093b1255a172139da6a5151f410b7a
Author: Flora Feng <4florafeng@gmail.com>
Date: Wed May 20 14:59:12 2026 -0400
[Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 53ff50fcd3d2012a406e5053026ea6a46c88b2b6
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Wed May 20 14:57:42 2026 -0400
[Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
commit 363fc84407f8c966c1cee6786e45e9e6ab289684
Author: meena-at-work <80416898+meena-at-work@users.noreply.github.com>
Date: Wed May 20 10:21:11 2026 -0700
Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082)
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
commit f2d5e3d3aeac4cb1f6d285e4a567a502ae507777
Author: haosdent <haosdent@gmail.com>
Date: Thu May 21 01:00:24 2026 +0800
[CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186)
Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
commit 2d6b3489b9a325988ad52507236409747d2098a7
Author: Aaron Hao <ahao@anyscale.com>
Date: Wed May 20 09:07:59 2026 -0700
[R3] Add routed experts to openai entrypoint (#38939)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
commit 9c78c99995b70726f9ea929ff2e535d6303383d6
Author: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Date: Wed May 20 19:50:24 2026 +0400
[MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
commit a10d69116cb25c8137eeb3f320add71d4e04fda9
Author: Flora Feng <4florafeng@gmail.com>
Date: Wed May 20 10:21:00 2026 -0400
[Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 644b2a28e7eb3b11191f157416cfedebd2da995b
Author: Joel Smith <j.smith9103@outlook.com>
Date: Wed May 20 15:10:01 2026 +0100
[Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215)
Signed-off-by: j9smith <j.smith9103@outlook.com>
Signed-off-by: Joel Smith <j.smith9103@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
commit ded871201a424dd0d28a00aaf74c5786457a18ee
Author: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com>
Date: Wed May 20 10:08:58 2026 -0400
[Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
commit df84fb07a6e57969941841c6363d1efbac1ba1e8
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed May 20 10:01:45 2026 -0400
Remove additional dead code as a follow-up to #42889 (#43144)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
commit 0a508743d42a26786c1432bb7f2e93f8111b6383
Author: Benjamin Chislett <bchislett@nvidia.com>
Date: Wed May 20 09:15:52 2026 -0400
[Spec Decode] Support non-MTP speculation for NemotronH (#43130)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
commit 19cf334207ed81d3ed75a473acd1a95c785d9ed3
Author: Kebe <mail@kebe7jun.com>
Date: Wed May 20 21:58:30 2026 +0900
[Feature] Support manually enabling the cumem allocator (#33648)
Signed-off-by: Kebe <mail@kebe7jun.com>
commit 87e31455b056c6ce59bf5dcb3c622155431851db
Author: Ray Wang <roguerui6@gmail.com>
Date: Wed May 20 02:32:03 2026 -0700
[Doc] Sync CLI guide with actual help modes and launch subcommand (#40326)
Signed-off-by: Rui Wang <raygorous@gmail.com>
Co-authored-by: Rui Wang <raygorous@gmail.com>
commit cb600d1cdbb079ab9432348f128e71c4e2e0a373
Author: hallerite <git@hallerite.com>
Date: Wed May 20 10:58:46 2026 +0200
[Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330)
Signed-off-by: hallerite <git@hallerite.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 6f21558da1ec7362d2b4f3d012bce2b612a74459
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Wed May 20 16:54:58 2026 +0800
[XPU][CI] Add 2 server model test files in Intel GPU CI (#42499)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
commit 1cb224430bea0d037b57e24cf91001f47b69ddf3
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Wed May 20 11:46:55 2026 +0300
[GDN] Enable FI Blackwell GDN prefill kernel (#40717)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit 9b343dd4f54a9870f3ba1e41f5a5b3f4a1e25340
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Wed May 20 17:10:00 2026 +0900
Enable mermaid diagrams in the docs (#43192)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 07aeaf9d4df870a76d5a0dc19d6a7e74b4be5d3b
Author: Chris Leonard <chleonar@redhat.com>
Date: Wed May 20 03:18:12 2026 -0400
[6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
commit 40651c020772b80f9ca80272aebe749fe01cd38a
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Wed May 20 09:02:36 2026 +0200
[Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 7e4bc2cecb3a8aede2d10c86a3a1a4bd98e26100
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Wed May 20 08:58:25 2026 +0200
[Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 85959567c3e71a9965616ebebe1853ca48d8d20f
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Tue May 19 23:01:41 2026 -0700
[ci] Revert model executor test back to L4 (#43188)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
commit 4f940896a32c9e2a0eba7f50d521bf5f6b4de458
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Wed May 20 06:32:08 2026 +0300
[KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit cd0ff26e7acf2c691a33d4c44276db6980bab24b
Author: Michael Goin <mgoin64@gmail.com>
Date: Tue May 19 23:21:01 2026 -0400
[CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit 2ae910ed88121d7c3acdcb9bab14cd968257b6e6
Author: Izik Golan <47969623+izikgo@users.noreply.github.com>
Date: Wed May 20 06:16:07 2026 +0300
[Perf] Avoid forward scan for async output placeholders (#42938)
commit fadf5d332c6e9bb6e552c1ca529511bce0f79802
Author: pmaybank <113125070+pmaybank@users.noreply.github.com>
Date: Tue May 19 23:16:02 2026 -0400
add enqueue all option to throughput benchmark (#42975)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit c628a93a64fb4929c3c11d8e2c7244c4826b4f76
Author: Benjamin Chislett <bchislett@nvidia.com>
Date: Tue May 19 23:15:57 2026 -0400
[Perf][Bugfix] Update dflash aux layer indexing (#40727)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
commit 5774aaed0cbeaa74ca7a75d372c1e8bd4aa11cdb
Author: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com>
Date: Tue May 19 22:32:06 2026 -0400
[Cohere] Enable Cohere MoE (#43143)
Signed-off-by: Terrencezzj <terrence@cohere.ai>
commit 39bba710bed5b6018718af3e0fd7984f6082118e
Author: Nick Hill <nickhill123@gmail.com>
Date: Tue May 19 19:19:05 2026 -0700
[MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit 73dd2f33b7a5a8a237fe7296039cec246e4c68bd
Author: Aaron Hao <ahao@anyscale.com>
Date: Tue May 19 18:01:29 2026 -0700
[bug] fix WeightTransferConfig.backend to allow for all strings (#43121)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
commit be16785998087f80ffac08b980603241e5da16ab
Author: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Date: Wed May 20 00:31:15 2026 +0100
[CPU][DOC] Fix installation commands for Arm CPUs (#43115)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
commit 117afeea4665367a3066c1df58d4082d07fcc946
Author: Max de Bayser <mbayser@br.ibm.com>
Date: Tue May 19 17:27:54 2026 -0400
Fix error in Dynamic NTK scaling (#41277)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 12421962955ac28b6f80a0307f554fad939174dd
Author: Doğaç Eldenk <dogacel@gmail.com>
Date: Tue May 19 15:39:00 2026 -0500
[Model] Support post-norm architecture for EAGLE-3 supeculators (#42764)
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
commit a65093c1a39a8ddd8455365128ecbe259350e22c
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Tue May 19 11:51:34 2026 -0700
[ci] Move language models tests (hybrid) back to L4 (#43129)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
commit 9aaf83ef502fc37bc647f6e474314d48ba36cd1c
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Tue May 19 14:44:32 2026 -0400
[CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f54721bcc3e072d71b0e09c0b0bd6d692eb06161
Author: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Date: Tue May 19 21:43:04 2026 +0300
[Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
commit aed2eb355a9d9136c8e17690b932983b55fb343f
Author: Dao007forever <dao007forever@gmail.com>
Date: Tue May 19 11:14:43 2026 -0700
[Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit d247a931cc25e7253feccbd6260d48216ff5c081
Author: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Date: Tue May 19 17:02:05 2026 +0100
[feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
commit 8200fbe1ac73f00a46b1cdd6c4c93bdaf2c33022
Author: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Date: Tue May 19 23:36:47 2026 +0800
[Misc] add humming to dependencies (#42540)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
commit 42b4f1fdf7269de8aa83755a805555fe78add28b
Author: Flora Feng <4florafeng@gmail.com>
Date: Tue May 19 11:21:12 2026 -0400
[Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 1c6158083a6fc3aff408660d2defd7602f78f556
Author: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
Date: Tue May 19 23:17:42 2026 +0800
[Model] Openvla support (#42654)
Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
commit d740e2c02919cfba5a86a40d1c12439d03f5ac07
Author: Xinyu Chen <xinyu1.chen@intel.com>
Date: Tue May 19 23:09:07 2026 +0800
[XPU] update xpu graph usage (#43043)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
commit b82e908b4c65a1f162e2d35a8106f09d95d8aa02
Author: Nick Hill <nickhill123@gmail.com>
Date: Tue May 19 07:35:54 2026 -0700
[Perf][4/n] Eliminate various GPU<->CPU syncs (#42347)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
commit a78b842d0e85d287176031334f4721cd96b6e47d
Author: Sage <80211083+sagearc@users.noreply.github.com>
Date: Tue May 19 13:21:49 2026 +0300
[Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
commit 129019f3342f1b7346ed8f4c1ac9fdefd8fe6ef8
Author: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Date: Tue May 19 05:44:33 2026 -0400
[CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
commit ef54a4d604ef3725bd52aa2893f71d671bf5329a
Author: Shanshan Shen <467638484@qq.com>
Date: Tue May 19 16:43:16 2026 +0800
[Misc][MM] Remove redundant code in CLIPAttention (#43046)
Signed-off-by: shen-shanshan <467638484@qq.com>
commit 07beaed8422d2df34a20e8ebd22b7924d563a566
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Tue May 19 01:12:46 2026 -0700
[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 056bc2e16646599a96ac94e761c953e680e6fba9
Author: Yifan Qiao <yifanqiao@inferact.ai>
Date: Tue May 19 01:07:46 2026 -0700
[KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
commit f34623bf3cac5b33451a761e802c9531e83d1c68
Author: Aaron Hao <ahao@anyscale.com>
Date: Tue May 19 01:06:21 2026 -0700
[bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117)
Signed-off-by: hao-aaron <ahao@anyscale.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b14be81c1f63b70668d26d65a377b6383fbca936
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Tue May 19 00:52:54 2026 -0700
[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 301d986473a0ffc1df563422e01eac4a1efd59e0
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Tue May 19 15:37:40 2026 +0800
[Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 257af77bc2b612d5ebd0aecea777139036543af3
Author: wang.yuqi <yuqi.wang@daocloud.io>
Date: Tue May 19 14:43:18 2026 +0800
[Docs] Reorganize online serving docs. (#41907)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 4a4fdabe28f3e2c8f9d05bcc80c4bf6d656b1ead
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Tue May 19 01:16:42 2026 -0500
[Misc] Aligning tokwise pooler heads for consistency (#43041)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit f1e3f0e6d685082bdb313c20914099ac5ede5f14
Author: Chaojun Zhang <chaojun.zhang@intel.com>
Date: Tue May 19 14:14:59 2026 +0800
[XPU] Use custom op collective behavior (#41354)
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit 9fd8487d2f56468aeec8154123641eb7c2eeacdf
Author: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com>
Date: Tue May 19 13:50:38 2026 +0800
[Docs] Add SVG images for pooling models. (#42626)
Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 27f4ba94811ef14bd45bcdc0c0b8e288a7cc6bc6
Author: Junyan Xu <junyanxu5513@gmail.com>
Date: Mon May 18 22:29:04 2026 -0700
fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671)
Signed-off-by: junyanxu <junyanxu5513@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 6e889b582b6a0b11f22b3764be174266faa9ff5e
Author: Kevin H. Luu <khluu000@gmail.com>
Date: Mon May 18 21:58:36 2026 -0700
[ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
commit fab07e4d0f7f266643c6ac0dc944f9f433ef2140
Author: Qiuyang Yue <yueqiuyang1389@gmail.com>
Date: Mon May 18 21:22:33 2026 -0700
[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289)
Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist <noreply@google.com>
commit 3ca8db2ef88ec5a6686e62ee3ac899afae85c7af
Author: gnovack <gnovack@amazon.com>
Date: Mon May 18 21:17:56 2026 -0700
add cutedsl dsv4 indexer fp8 kernel (#42899)
Signed-off-by: george <george@inferact.ai>
Co-authored-by: george <george@inferact.ai>
commit 87b08c5f6460cf487e47872c5fbc2595c97e74ef
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Mon May 18 21:00:58 2026 -0700
[Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit fba010dd74e2f94e4f7223b164ec9097d1b8a6af
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Tue May 19 05:25:41 2026 +0200
[Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
commit da03e549b34685c4e63a091e973d907aee48a68c
Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Date: Tue May 19 11:25:37 2026 +0800
[UX] Add a persistent cache for FlashInfer autotuning (#42537)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
commit 36dcaf25d8e091ea0f47b9ce7dcfca05de56f16d
Author: Kunshang Ji <kunshang.ji@intel.com>
Date: Tue May 19 03:17:09 2026 +0000
[XPU] add gptq(int4) support (#37844)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
commit 8f16c4a5c0feb01f106e5981f22ae8808a94a28b
Author: Ofir Zafrir <ofir.zafrir@intel.com>
Date: Tue May 19 06:16:07 2026 +0300
[BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#42468)
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com>
commit afd7b1dce94fed484351fafd5bf5ea6601ac621e
Author: Revital Sur <eres@il.ibm.com>
Date: Tue May 19 06:12:04 2026 +0300
[Bugfix] Use platform-agnostic device in example_connector load (#42926)
Signed-off-by: Revital Sur <eres@il.ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 287471b99442b44c5a16c4d70b0f3e178dd52732
Author: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Date: Mon May 18 19:50:02 2026 -0700
[Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
commit 239b5ff30cf46f9196149c888a20be2096fdff03
Author: Michael Goin <mgoin64@gmail.com>
Date: Mon May 18 20:22:27 2026 -0400
[Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (#42476)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit f85c76d701fc049a722c17b3affd9401380be1bf
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Tue May 19 02:58:15 2026 +0300
[CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
commit a171e6b52dff47dc567657e7d51f641bdcb22774
Author: shanjiaz <zsjwpianpian@gmail.com>
Date: Mon May 18 19:39:09 2026 -0400
Add parallel drafting to v2 model runner unsupported features (#43010)
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
commit 37ece593c105b5bb818aa94885617b863d390d7f
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 19:38:12 2026 -0400
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7% e2e performance improvement (#42774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 57fef4e0bf0bfaddf117dfdc9367e1fb957b423f
Author: Flora Feng <4florafeng@gmail.com>
Date: Mon May 18 17:55:39 2026 -0400
[Refactor] Extract shared coerce_to_schema_type utility from Minimax M2 tool parser (#43006)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
commit 0191354827560fe38f68b4e7207f8824d6152ca3
Author: haosdent <haosdent@gmail.com>
Date: Tue May 19 05:29:10 2026 +0800
[Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885)
Signed-off-by: haosdent <haosdent@gmail.com>
commit cd49a05d5aa3cc296912297b3c2b577efe4183c8
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 16:41:22 2026 -0400
[Refactor] Remove dead code (#42889)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 84747489ded65265ee7d43815bfa3373b0d42279
Author: Ronen Schaffer <ronen.schaffer@ibm.com>
Date: Mon May 18 22:41:58 2026 +0300
Tier offload followup (#42529)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
commit 8fc1c284b94668b60c30737e178cb7e6cd651e89
Author: Tuukka Sarvi <tuukka.sarvi@amd.com>
Date: Mon May 18 21:56:22 2026 +0300
[ROCm] Guard AITER GDN decode fast path by layout (#42880)
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
commit ce88f01c9ac4fcde9dd43a983074d4e893cde65d
Author: Amit Portnoy <1131991+amitport@users.noreply.github.com>
Date: Mon May 18 21:22:56 2026 +0300
[Docs] update attribution to reflect EDEN foundation (#41666)
Signed-off-by: amitport <1131991+amitport@users.noreply.github.com>
commit 00e20e76f775b88f47469ae9fcb0f1ecd7580bb9
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 14:14:21 2026 -0400
[Refactor] Remove dead cuda kernels (#42767)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 9758a6e5c5a556275c030db456d5d434ee999d58
Author: czhu-cohere <conway.zhu@cohere.com>
Date: Mon May 18 11:12:06 2026 -0700
[BugFix] support PP for Cohere vision model (#42819)
Signed-off-by: <conway.zhu@cohere.com>
Signed-off-by: root <conway.zhu@cohere.com>
commit a2c8fc66573664395f491a94da1882fdf92e034b
Author: Bowen Bao <bowenbao@amd.com>
Date: Mon May 18 10:46:13 2026 -0700
[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436)
Signed-off-by: Bowen Bao <bowenbao@amd.com>
commit 6859ca76159fdd403b687c0c296e5a12850ba24e
Author: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Date: Tue May 19 01:32:26 2026 +0800
[Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#42541)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit 67f58ce23f469e118688a50687ef0fbb14a1c028
Author: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Date: Tue May 19 01:02:01 2026 +0800
[Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
commit 8c296de63b47664fc5979831e1ae2d2a14a05b1a
Author: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Date: Mon May 18 12:12:27 2026 -0400
[Perf] Re-enable flashinfer autotune by default and cleanup (#42857)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
commit b12745e4f31ffacf401cc20a97c592d6a49f3269
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Tue May 19 00:56:09 2026 +0900
Fix `--convert` passed without `--runner` on causal models (#42935)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit e26736973a1981dbb4054dc1ac430e78d8006ef2
Author: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Date: Mon May 18 11:27:21 2026 -0400
[Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors must match` error (#42778)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
commit 47829b1159335a010521ea3e5361d51744a36b0a
Author: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Date: Mon May 18 18:26:00 2026 +0300
[Bugfix] mamba: run single-token extends as decodes (#42430)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
commit 4a39b4f55374d48ebaa2ca02312e24639db8e0b8
Author: Blanc Swan <85233612+blancsw@users.noreply.github.com>
Date: Mon May 18 17:20:04 2026 +0200
[Model] Add Apertus Tool Parser (#41154)
Signed-off-by: Blanc <swan.blanc@infomaniak.com>
commit 78e7a7b9b0b9c285bf6978c3fc09eeecea3ff230
Author: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com>
Date: Mon May 18 08:02:43 2026 -0700
Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483)
Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com>
Signed-off-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com>
Co-authored-by: Robert Shaw <robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit f5d3dc7115cf77472ba5e274f6becbbeddbf4bd5
Author: Michael Goin <mgoin64@gmail.com>
Date: Mon May 18 10:26:07 2026 -0400
[Model Runner v2] Support update_config (#42783)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 1ac10f159a09897baada01b14b6a0dd6442aefd6
Author: vllm-agent <claw@inferact.ai>
Date: Mon May 18 06:02:51 2026 -0700
Revert "[torch.compile] Add patch for fullgraph compilation" (#42686) (#42913)
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
commit e5417657e55ec2f42809816e4aa5c9753f390cdd
Author: liranschour <liranschour@users.noreply.github.com>
Date: Mon May 18 15:59:42 2026 +0300
[KV Connector][Offloading] Flush all pending jobs on last step (#42611)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 2e40faf08b2cae4ff6e27a255fe10833365de0e8
Author: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Date: Mon May 18 20:34:48 2026 +0800
[XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_and_lora[1] in Intel GPU CI (#42954)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
commit 69c91d010a596bb74b553fe157497a1fd6edb47c
Author: Nicolò Lucchesi <nlucches@redhat.com>
Date: Mon May 18 14:34:16 2026 +0200
[MRv2] Default to MRv1 when a connector is present (#42955)
Signed-off-by: NickLucche <nlucches@redhat.com>
commit 737bfa3a43ce386bd1894792f3302d9f3f9d73fa
Author: roikoren755 <26850796+roikoren755@users.noreply.github.com>
Date: Mon May 18 14:54:00 2026 +0300
[Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative decoding crash (#41233)
Signed-off-by: Roi Koren <roik@nvidia.com>
commit e414e1f1c020108593526b706efaf89e427c05a2
Author: Kfir Toledo <kfir.toledo@ibm.com>
Date: Mon May 18 14:36:02 2026 +0300
[Bugfix][KV Offload] count appended GPU blocks in store group_sizes (#42945)
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
commit df852ed503ac1a79e568271cd6f136a7b2698f5e
Author: inisis <desmond.yao@buaa.edu.cn>
Date: Mon May 18 18:33:29 2026 +0800
fix: remove unused norm for dpskv4 (#41710)
Signed-off-by: inisis <desmond.yao@buaa.edu.cn>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
commit 88a860d7545aad69661daad7a1c2b04f59c76144
Author: Yuwen Zhou <yuwen.zhou@intel.com>
Date: Mon May 18 18:04:45 2026 +0800
[CPU] Add MXFP4 W4A16 MoE support (#41922)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Yuwen Zhou <yuwen.zhou@intel.com>
commit cac81b6eda418fb5ca86b81197914dd02666353e
Author: Tianmu Li <tianmu.li@intel.com>
Date: Mon May 18 03:04:41 2026 -0700
[CPU Backend] Improve cpu thread utilization (#42666)
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit b4601ad43ff7ff2b9e2f52379144481e45bcf6c5
Author: Li, Jiang <jiang1.li@intel.com>
Date: Mon May 18 18:04:36 2026 +0800
[CPU] Add fused GDN support for AMX CPU platform (#42707)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
commit 2267f70070bdee8057b4afae69cba9b847add587
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Mon May 18 18:04:31 2026 +0800
[Kernel] Pack topk id/weights triton kernel (#42527)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
commit 965d076148326f4511b6b832cbe7d974db74dbe9
Author: Tony Lin <tony.lin@intel.com>
Date: Mon May 18 17:38:54 2026 +0800
[CPU] Specify required KV cache layout for CPU attention backend (#42740)
Signed-off-by: Tony Lin <tony.lin@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
commit c38bed4248e97e5ed981569777d035d31ace5368
Author: wenjun liu <wenjun.liu@intel.com>
Date: Mon May 18 16:36:45 2026 +0800
delete xpu ci (#42582)
Signed-off-by: wenjun.liu <wenjun.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
commit 998714b21b413c78db8eb7af7f384dc90c0b10dc
Author: Xin Yang <105740670+xyang16@users.noreply.github.com>
Date: Mon May 18 01:32:46 2026 -0700
[Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849)
Signed-off-by: Xin Yang <xyangx@amazon.com>
commit 9537542537728af9fac418ecf1604ad8e8d9ff93
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Mon May 18 17:31:06 2026 +0900
Revert checkpoint specific workaround in Transformers modelling backend (#42923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 5ab6d1b3fd407404cd78488bf6f4cbcde6d912b7
Author: Rishapveer Singh <singhrishapveer@gmail.com>
Date: Mon May 18 10:14:36 2026 +0200
[Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311)
Signed-off-by: Rishapveer Singh <singhrishapveer@gmail.com>
commit 7d5b033782681acee274f4f379c9fadc557fd7e8
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Mon May 18 15:22:26 2026 +0800
[LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit e3aeee5ff8bf7e89fea231d2a965701248eb43c0
Author: Nguyễn Thế Duy <nduy250299@gmail.com>
Date: Mon May 18 14:17:53 2026 +0700
[Bugfix] moe lora align kernel grid (#40131)
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
commit c1f7854342d1e80f7f2406524d242b8ee5476d6d
Author: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Date: Mon May 18 15:33:32 2026 +0900
Improve logging when docs build is skipped (#42929)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
commit 23c15acd770cf16ed36c6d3fed8e7d78db7d5282
Author: gaozihao-shy <gaozihao3@huawei.com>
Date: Mon May 18 13:07:16 2026 +0800
[BugFix] Kimi-K2.5: skip vision tower dtype conversion when using quantization (#42869)
Signed-off-by: gaozihao-shy <gaozihao-shy@users.noreply.github.com>
Signed-off-by: gaozihao <gaozihao3@huawei.com>
commit b50646e5effd7cb5884cd96fdff4c53c18521198
Author: Andreas Karatzas <akaratza@amd.com>
Date: Sun May 17 22:57:59 2026 -0500
[ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 990f49bdcb8ff51c0ceb1d784c3ca16e6c276927
Author: Soyaazz <523420504@qq.com>
Date: Mon May 18 11:19:13 2026 +0800
[MM][CG] Enable encoder Cudagraph for Step3VL (#42224)
Signed-off-by: JisoLya <523420504@qq.com>
Signed-off-by: Soyaazz <523420504@qq.com>
commit 107210442da1bc6985bfa615b55e1e5c2dd98958
Author: Alec <35311602+alec-flowers@users.noreply.github.com>
Date: Sun May 17 19:11:46 2026 -0700
[CI] Add NIXL EP import canary (#42567)
Signed-off-by: Alec Flowers <aflowers@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
commit 03ddc1c9bc5e448e0da6236268a611d7d001dbae
Author: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
Date: Mon May 18 09:57:04 2026 +0800
[Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M2) (#42497)
Signed-off-by: qianlihuang <yiliu.dong@qq.com>
Signed-off-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com>
Co-authored-by: qianlihuang <yiliu.dong@qq.com>
commit 966903eb93a053a908fbf8b931fcebfb28c4741a
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date: Sun May 17 15:49:16 2026 -0400
[torch.compile] Add patch for fullgraph compilation (#42686)
Signed-off-by: Luka Govedič <luka.govedic@gmail.com>
commit 599e75f432e5fd7c77e65dc95587f3441201bdbc
Author: TJian <tunjian.tan@embeddedllm.com>
Date: Mon May 18 00:18:50 2026 +0800
[ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
commit 1c8e9c0399f6a6a98f406dce5947a2ad318e195a
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Sun May 17 09:40:21 2026 -0500
Refactor: Pass num_labels explicitly to PoolerClassify instead of reading from global config (#42851)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit 0fa888465e5a30b797bdf2cdcd0f57fc77541cef
Author: zofia <110436990+zufangzhu@users.noreply.github.com>
Date: Sun May 17 16:55:10 2026 +0800
[XPU] fix weight scale shape (#42725)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
commit ff712f6447093d07747c88680b9d006b119f5890
Author: liuzhenwei <zhenweiliu@habana.ai>
Date: Sun May 17 12:15:50 2026 +0800
[MRV2][XPU] add Model Runner V2 log (#42710)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
commit 504a26ce2be2415118b73966480b4fc04d9b7bf8
Author: Qi Zhou <qizzzh@google.com>
Date: Sat May 16 17:54:58 2026 -0700
Support bf16 for mamba ssm cache (#41680)
Signed-off-by: Qi Zhou <qizzzh@google.com>
commit a94189295b8b9c1d952be438b49ed5793db59159
Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com>
Date: Sun May 17 08:54:27 2026 +0800
Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer (#42716)
Signed-off-by: weizhoublue <weizhou.lan@daocloud.io>
commit 0867497368f390212a3f9684e2e05f698f8d1149
Author: Artem Perevedentsev <aperevedents@nvidia.com>
Date: Sun May 17 00:55:12 2026 +0300
[CI/Build] Bump flashinfer to v0.6.11.post2 (#41711)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
commit 36e74c9ea4feb5ade38ffa1ea96f24dd73316e02
Author: Zhewen Li <zhewenli@meta.com>
Date: Sat May 16 13:34:15 2026 -0700
[KV Connector] Support disk offloading in MooncakeStoreConnector (#42689)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit 787bc0d0313840c16e403dfa2d135781d41d3614
Author: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Date: Sat May 16 14:58:16 2026 -0400
Add unit tests for pooler activation functions (#42824)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
commit d1586e1a1242754d2f6ac51f4f16680f7d4b129b
Author: weizhoublue <45163302+weizhoublue@users.noreply.github.com>
Date: Sun May 17 01:02:54 2026 +0800
Fix: Propagate pinned model revisions into Ultravox secondary weight loading (#42830)
commit 8a56da3845270837424ef4b7ee83ca97a7883025
Author: Jiangyun Zhu <riverclouds.zhu@qq.com>
Date: Sat May 16 22:04:12 2026 +0800
[Experimental] Breakable CUDA graph (#42304)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
commit 4db300e95fd29f5b1a4a7c34f4fbe91b7e9abb24
Author: Andreas Karatzas <akaratza@amd.com>
Date: Sat May 16 04:35:05 2026 -0500
[ROCm][CI] Removed problematic command override mechanism (#42807)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
commit 657b42b5922d21fef00529144ef5bb5633ad04b1
Author: Zhewen Li <zhewenli@meta.com>
Date: Sat May 16 00:26:25 2026 -0700
[Docker][KVConnector] Build mooncake-transfer-engine from source (#42114)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Signed-off-by: khluu <khluu000@gmail.com>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: khluu <khluu000@gmail.com>
commit 32b7177909d1c9928bcedd81de7de5a1fa21d2b3
Author: Jee Jee Li <pandaleefree@gmail.com>
Date: Sat May 16 11:22:35 2026 +0800
[LoRA][Bugfix] Dedup LoRA wrapping for modules referenced from multiple attribute paths (MoE gate) (#42757)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
commit 39c67d714ef091df1533181bdc3df82dc9ac3e07
Author: DustHunter <dusthunter@126.com>
Date: Sat May 16 09:29:27 2026 +0800
fix: add API key authorization to /v2 endpoints (#42594)
Signed-off-by: DustHunter <dusthunter@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
commit 87a2adcb43513ead1434aff03a535d86f56f768b
Author: Viktor Pus <viktorpus@tenstorrent.com>
Date: Sat May 16 02:44:48 2026 +0200
[Misc] Add common random prefix option to structured-output serving benchmark (#41632)
Signed-off-by: Viktor Pus <viktorpus@tenstorrent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 852f567444cf8c206219edb7b2c42aec55fc41cf
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 20:15:52 2026 -0400
[Bugfix] Respect explicit --kv-cache-dtype over checkpoint kv_cache_scheme (#42782)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit b2a27b82d970efa0203c06be6dc0d94526edaab0
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 20:07:39 2026 -0400
[Kernel][UX] Add `--linear-backend` arg for linear kernel selection (#39538)
Signed-off-by: mgoin <mgoin64@gmail.com>
commit d0921bafeff9bbe7a7b4efef6371700e69224702
Author: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com>
Date: Fri May 15 16:20:33 2026 -0700
[Bugfix] Unwrap VLM wrappers for EPLB on Model Runner V2 (#42706)
commit 1ccdf87507407cb02460ec2e7a3e1a4cac9b0a4a
Author: rasdani <73563550+rasdani@users.noreply.github.com>
Date: Fri May 15 15:20:53 2026 -0700
[Bugfix] Fix layerwise reload alias-buffer corruption (#42481)
Signed-off-by: rasdani <73563550+rasdani@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
commit bd9dbe60601c986b50260f299fe279d057d7d89f
Author: Rita Brugarolas <Rita.BrugarolasBrufau@amd.com>
Date: Fri May 15 13:50:03 2026 -0700
[ROCm][Bugfix] Fix fused_mla_dual_rms_norm for AITER API rename _fused_qk_rmsnorm (#42606)
Signed-off-by: Rita Brugarolas Brufau <rita.brugarolasbrufau@amd.com>
commit de2d76f35239c58202e49469dc5524b6f6fc4ffb
Author: Michael Goin <mgoin64@gmail.com>
Date: Fri May 15 16:46:16 2026 -0400
[Build] Switch CUDA 12.9 wheel builds to PyTorch manylinux_2_28 base (#41668)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
commit 9a7a273dfe6a89bbe00639fe99b0d61095fbc40a
Author: Sergei Skvortsov <yvorott@gmail.com>
Date: Fri May 15 21:01:21 2026 +0100
Add HumanEval and GSM8K benchmarks to datasets (#42648)
Signed-off-by: southfreebird <yvorott@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
commit b2c58ee9427f15563210e184c57a6e530f37e464
Author: Lanze Liu <86434077+liulanze@users.noreply.github.com>
Date: Fri May 15 12:34:59 2026 -0700
[FlashAttn] Fix supports_kv_cache_dtype() accepting unhandled fp8 kv-cache dtype variants (#42685)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
commit 4d67d3bde25f94b6199ce16c7ef239ae4412bb8f
Author: frida-andersson <fanders…
Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
* [XPU] add gptq(int4) support (#37844) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [UX] Add a persistent cache for FlashInfer autotuning (#42537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * add cutedsl dsv4 indexer fp8 kernel (#42899) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> * [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Docs] Add SVG images for pooling models. (#42626) Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [XPU] Use custom op collective behavior (#41354) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Aligning tokwise pooler heads for consistency (#43041) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Docs] Reorganize online serving docs. (#41907) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Consolidate beam search by BeamSearchMixin. (#42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KVConnector][DSV4] HMA support for Mooncake store connector (#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Misc][MM] Remove redundant code in CLIPAttention (#43046) Signed-off-by: shen-shanshan <467638484@qq.com> * [CI] Add MTP + PD disagg test for Qwen3.5 (#42677) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [XPU] update xpu graph usage (#43043) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> * [Model] Openvla support (#42654) Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> * [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Misc] add humming to dependencies (#42540) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> * [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Move language models tests (hybrid) back to L4 (#43129) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764) Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> * Fix error in Dynamic NTK scaling (#41277) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [CPU][DOC] Fix installation commands for Arm CPUs (#43115) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> * [bug] fix WeightTransferConfig.backend to allow for all strings (#43121) Signed-off-by: ahao-anyscale <ahao@anyscale.com> * [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Cohere] Enable Cohere MoE (#43143) Signed-off-by: Terrencezzj <terrence@cohere.ai> * [Perf][Bugfix] Update dflash aux layer indexing (#40727) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * add enqueue all option to throughput benchmark (#42975) Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Perf] Avoid forward scan for async output placeholders (#42938) * [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111) Signed-off-by: mgoin <mgoin64@gmail.com> * [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ci] Revert model executor test back to L4 (#43188) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099) Signed-off-by: NickLucche <nlucches@redhat.com> * [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097) Signed-off-by: NickLucche <nlucches@redhat.com> * [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * Enable mermaid diagrams in the docs (#43192) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [GDN] Enable FI Blackwell GDN prefill kernel (#40717) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499) Signed-off-by: zengxian <xiangdong.zeng@intel.com> * [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330) Signed-off-by: hallerite <git@hallerite.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326) Signed-off-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> * [Feature] Support manually enabling the cumem allocator (#33648) Signed-off-by: Kebe <mail@kebe7jun.com> * [Spec Decode] Support non-MTP speculation for NemotronH (#43130) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * Remove additional dead code as a follow-up to #42889 (#43144) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> * [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452) Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215) Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [R3] Add routed experts to openai entrypoint (#38939) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com> * Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082) Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> * [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add dllehr-amd to CODEOWNERS and committers list (#42772) Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> * [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [ROCm] Add QuickReduce min-size override and codec threshold (#41675) Signed-off-by: <> * [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255) Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * update GPU json file based on h200 recipes (#43262) Signed-off-by: louie-tsai <louie.tsai@intel.com> * [Minor] Bigger overlap for FI AR (#43103) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329) Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082) Signed-off-by: haosdent <haosdent@gmail.com> * [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973) Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: <> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [Frontend][RFC] Rust front-end integration (#40848) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905) Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> * [Benchmark] Add num-warmup to vllm bench throughput (#43245) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [CI] De-flake test_models for bigscience/bloom-560m (#43197) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU] add setuptools-rust for xpu dependency (#43287) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * Update KDA chunk prefill decay to use exp2 semantics (#43195) Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> * Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223) Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> * [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * [CI] Pin protoc binary in rust-build stages (#43292) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943) Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CI] Fix CPU tests failing on `tl.exp2` import (#43311) Signed-off-by: haosdent <haosdent@gmail.com> * [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079) Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> * [Perf] `zeros` -> `empty` to remove additional fill (#42988) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Core] Add native ModelExpress load format (#43105) Signed-off-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038) Signed-off-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Frontend] Rework fastokens integration (#43168) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968) Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BugFix] Use correct logprobs for `logprob_token_ids` (#43125) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873) Signed-off-by: Lanze Liu <lanzetech@gmail.com> * [Rust Frontend] Move code from `vllm-frontend-rs` (#43283) Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eric Curtin <eric.curtin@docker.com> Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Will.hou <willamhou@ceresman.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Will.hou <willamhou@ceresman.com> Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history. * [CI] Fix dockerfile dependency graph failure for pre-commit (#43378) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855) Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> * [ROCm] Add XGMI backend for MoRI Connector (#41753) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [ROCm][CI] add warmup to mem_util test before measurement (#43236) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Frontend] Add truncation side to OpenAI endpoints (#43260) Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> * [Frontend] DP Supervisor (#40841) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Make CuMemAllocator free callback stream-aware (#43020) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [XPU] Enable multiple key kernels for sparse attention (#37888) Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064) Signed-off-by: haosdent <haosdent@gmail.com> * [Model] Use `AutoWeightsLoader` for Voyage (#42972) Signed-off-by: Furkan Fidan <dev@yufufi.com> * [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213) Signed-off-by: tc-mb <tianchi_cai@icloud.com> * [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296) Signed-off-by: haosdent <haosdent@gmail.com> * Correcting the mock classes for MM GC tests (#43321) Signed-off-by: Weida Hong <wdhongtw@google.com> * [BugFix] Fix setuptools-rust dep in requirements files (#43377) Signed-off-by: Nick Hill <nickhill123@gmail.com> * Fix the docker build failure in tpu-inference (#43360) Signed-off-by: mrjunwan-lang <mrjunwan@google.com> * [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CPU] Experimentally enable Triton and MRV2 (#43225) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Attention] Mamba attention module refactor (#41126) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001) Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> * [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118) Signed-off-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Keyi Li <likey6688@gmail.com> * [LoRA] Reduce memory of 2D weights when EP is set (#42737) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [EPLB] Change default EPLB communicator (#43110) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> * [CI] Fix AMD docker build tests (#43329) Signed-off-by: haosdent <haosdent@gmail.com> * Add NVFP4 MOE support for Deepseek V4. (#42209) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> * [Multimodal] Simplify ViT CUDA graph interfaces (#41234) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650) Signed-off-by: zhanda <zhandazhu@gmail.com> Co-authored-by: Shang Wang <shangw@nvidia.com> * [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149) * [Frontend] Simplify AuthenticationMiddleware path extraction (#43426) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from sync path (#40733) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> * [Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GDN (#43427) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs (#36854) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * mhc_post - remove sts & add vectorized copies (#43437) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566) Signed-off-by: Juhi Mittal <juhim@nvidia.com> * [Model Runner V2] Support sharing kv cache layers (#35045) Signed-off-by: Nick Hill <nickhill123@gmail.com> * DSv4 fused Q-norm kernel grid refactor (#42353) * [Perf] Optimize hidden state extraction logic (#37374) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * elastic_ep: stage/commit MoE quant method on reconfigure (#40881) Signed-off-by: Itay Alroy <ialroy@nvidia.com> * [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend (#38822) * Add `model` to `WeightTransferEngine.__init__` (#42922) Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSV4] More multi-stream enablement for c4a (#42925) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Stabilize runner teardown between sampler tests (#43023) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal models (#43051) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> * [ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs (#41577) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU]feat: enable FP8 block-scaled quantization on XPU (#42952) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [XPU] reudce host overhead of XPU MOE (#42915) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Misc] Added missing return type annotations to improve mypy and IDE tooling (#43383) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi… (#42739) Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> * [ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * Keep scheduler alive for delayed KV connector frees (#43433) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpoint (#42143) Signed-off-by: FERRARIZHENG <popkart06@gmail.com> * [Kernel] Batch invariant NVFP4 linear using cutlass (#39912) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Remove benchmarks test group and shard long test groups (#41669) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43414) Signed-off-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM Signed-off-by: <callumm@amd.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: george <george@inferact.ai> Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: junyanxu <junyanxu5513@gmail.com> Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Terrencezzj <terrence@cohere.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com> Signed-off-by: hallerite <git@hallerite.com> Signed-off-by: Rui Wang <raygorous@gmail.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: <> Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Signed-off-by: louie-tsai <louie.tsai@intel.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Zheng Luo <zheluo@nvidia.com> Signed-off-by: Ylang Tsou <ylangt@google.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Furkan Fidan <dev@yufufi.com> Signed-off-by: tc-mb <tianchi_cai@icloud.com> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mrjunwan-lang <mrjunwan@google.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Ma Jian <jian1.ma@intel.com> Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> Signed-off-by: Keyi Li <likey6688@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: Shiyang Chen <shiychen@nvidia.com> Signed-off-by: zhanda <zhandazhu@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: FERRARIZHENG <popkart06@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: ffggs <314137448@qq.com> Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: <callumm@amd.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: gnovack <gnovack@amazon.com> Co-authored-by: george <george@inferact.ai> Co-authored-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Junyan Xu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com> Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com> Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: Izik Golan <47969623+izikgo@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: xiangdong <40376367+zxd1997066@users.noreply.github.com> Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Ray Wang <roguerui6@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: haosdent <haosdent@gmail.com> Co-authored-by: meena-at-work <80416898+meena-at-work@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Ace Eldeib <alexeldeib@gmail.com> Co-authored-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: sonusflow <git@sonusflow.pl> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: 손세정 <maze0717@g.skku.edu> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: Yiyang "Ian" Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: ylangtsou <149562838+ylangtsou@users.noreply.github.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: fangyuchu <fangyuchu@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: ruizhang <rza21.bc@gmail.com> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Furkan F <id+git@yufufi.com> Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: mrjunwan-lang <mrjunwan@google.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Tobias Wasner <wasnertobias@users.noreply.github.com> Co-authored-by: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Co-authored-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: sychen52 <41452870+sychen52@users.noreply.github.com> Co-authored-by: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com> Co-authored-by: Shang Wang <shangw@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com> Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com> Co-authored-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Xiaogang Zhou <zhou16386@163.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: GuangYaoZheng <popkart06@gmail.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com>
Purpose
Fix nightly
Quantized Models Test #67339failures oftest_awq_load[gemma4-moe-*]introduced by #43169.That PR sized the batched vision-encoder chunk by
0.05 * total_memorywith a cost model that counted only the encoder residual stream. On a 22 GiB L4 with a 26B AWQ model loaded (~3 GiB free), the heuristic admittedchunk ≈ 53and OOMed insideGemma4VisionPatchEmbedder._position_embeddingsallocating a 2.88 GiBF.one_hot(num_classes=position_embedding_size)int64 buffer, which is the actual dominant transient and ~4x larger than the old cost term.This PR sizes the chunk by currently-free GPU memory (
min(free // 2, total // 10)) withF.one_hotas the dominant per-patch cost, hoiststorch.cuda.mem_get_info()out of the per-bucket loop, and extracts the math into a pure static_encoder_chunkfor GPU-free unit testing. Batching speedup is preserved on roomy GPUs (chunk = 92 on 80 GiB A100 with 60 GiB free vs. 17 on the failing L4).Test Plan
Added
tests/models/multimodal/test_gemma4_mm.pywith 4 unit tests against_encoder_chunk(no GPU required): tight-budget allocation fits in free memory, roomy GPU keeps batching, zero-patches safe, zero-free falls back to 1.Test Result
Unit tests (no GPU):
Integration tests (the original failing parametrizations):
test_awq_load[gemma4-moe-standard-awq-dot-suffix]test_awq_load[gemma4-moe-compressed-tensors-underscore-suffix]CUDA_VISIBLE_DEVICES=0)gemma4-moe-*parametrizationsprofile_runreaches theEncoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 video items of the maximum feature size.log line, the exact point that OOMed in CI, and proceeds withoutOutOfMemoryErroron both SKUs.