[Refactor] Move unstreamed tool-arg flush from serving layer to parser#44017
Conversation
|
moving the flush into the parser where the streaming state lives is the right call, and the merge-into-last_tc branch is clean. but the standalone branch worries me: when there's no concurrent tool-call delta to append onto, |
Thanks for the review, your concern about the standalone branch dropping id/type/name makes sense. I think the part is actually unnecessary though. If there's no existing tool-call delta to merge into, we shouldn't be emitting a bare one without id/type/name — better to just skip the flush entirely, which matches the old serving-layer behavior where the flush only happened when there was already a delta to attach to. I've simplified _append_unstreamed_tool_args to remove the extra logic: it now only appends when delta_message already carries a tool-call delta with a .function, so id/type/name are naturally preserved. |
yzong-rh
left a comment
There was a problem hiding this comment.
One small question, LGTM otherwise
|
Tests look good, but the rest of the changes probably someone else should review. |
|
yeah merge-only is cleaner — last_tc already carries id/type/name, so appending args to it just works and there's no bare-delta path that could drop them. lgtm. |
* [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * Remove Transformers forward/backward compatibility tests (#43785) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Validate against some config fields being set to 0 (#43794) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][DFlash]allocate the proper number of lookahead slots (#43733) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> * Fix Qwen3-VL and Qwen3-omni-thinker accuracy degradation from deepstack inputs under torch.compile (#43617) Signed-off-by: Dakai An <dakaian108@gmail.com> * Add @AndreasKaratzas to CODEOWNERS (#43740) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Kernel] TRTLLM NVFP4 MoE chunking (#43599) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> * [ModelRunnerV2][Hybrid model] Support kernel block size in hybrid model (#38831) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Rust Frontend] Introduce mock engine for benchmark baseline (#43469) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Fix RunAI streamer tensor buffer reuse during weight loading (#43464) Signed-off-by: bbartels <benjamin@bartels.dev> * [MoE] Remove inplace fused experts mechanism (#43727) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [Misc][Rocm] Remove redundant `AiterUnifiedAttentionBackend` block size log (#43664) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][CI] Stabilize Cargo cache and pre-test image checks (#43815) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * fix: parse Qwen3 XML JSON arguments first (#43243) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Bugfix] Pass `routed_scaling_factor` to FlashInfer TRTLLM BF16 MoE (#43769) * [BugFix] Fix blocked reasoning parsing with MRV2 (#43808) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix][Frontend] streaming tool-call serializer drops first args chunk when name and args share a DeltaMessage (#42683) Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * minor docs: fix incorrect example path (#43830) Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> * [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (#43679) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * change name of fs_python secondary tier to fs. (#43600) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> * [BugFix] Fix hard-coded timeout for multi-API-server startup (#43768) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Kernel] Marlin MoE: include SM 12.x in default arch list (#40923) Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [DSV4] Remove AMD/XPU path in deepseek_v4/nvidia (#43829) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Restore `Literal` for `WeightTransferConfig.backend` (#43183) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Stream DeepSeek DSML tool-call argument deltas incrementally (#42879) Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][CI] Move workload from MI300 to MI325 (#43824) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feature] Add support for timed trace replay in `vllm bench serve` to replay Moonshot and Alibaba workload traces (#39795) Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> * [UX] Increase DP Coordinator startup timeout from 30s to 120s (#42343) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> * [Model][Bugfix] Rename weight_mapper to hf_to_vllm_mapper in LlamaNemotronVL pooling models (#43581) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> * [Bugfix][ROCm] Fix Accuracy Drop in Sparse Indexer on gfx950 (#43781) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> * [Bugfix] Fix HyperCLOVAX CI failure after upstream removed remote code (#43860) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Auto-apply `rust` label to relevant PRs (#43866) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Feature] Add structured output and effort support to Anthropic Messages API (#42396) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * Log dummy DP step in iteration details (#41406) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> * [EC Connector] Add shutdown API to EC Connector. (#42423) Signed-off-by: omerpaz95 <omerpaz95@gmail.com> * Fix `OlmoHybridForCausalLM` not initialising (#43846) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BUGFIX] Multimodal benchmark with MistralTokenizer (#42965) Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf][KDA] Fuse gate softplus, chunk-local cumsum, and RCP_LN2 scaling (#43667) Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> * Add token-offset based selective offload in OffloadConnector (#39983) Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> * [Model Refactoring] Remove torch compile dependency in DSv4 (#43746) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency (#40344) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [CPU] Migrate cpu_awq into awq_marlin (#43841) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Rust Frontend] Add `hy_v3` tool parser (#43872) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Rust Frontend] Reduce Gemma4 tool parser args scan complexity (#43850) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [rust] fix: aggregate `is_sleeping` and `reset_prefix_cache` across DP engines (#43429) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bug] Fix `tests/distributed/test_elastic_ep.py - assert False` (#43813) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf] Add do_not_specialize to Mamba SSD chunk kernels (#43803) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> * [Bugfix] Exclude Ray DP from #42585's deferred port allocation (#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [KV Offload] Rename `SecondaryTierManager.get_finished()` to `get_finished_jobs()` (#43870) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> * [XPU][MoE] Add WNA16 oracle backend for GPTQ sym-int4 (xpu_fused_moe) (#41426) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [ROCm] Bump ROCm to 7.2.3 (#43136) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * Add Cosmos3 Reasoner model (#43356) Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> * [Rust Frontend] Optimize multimodal prompt expansion (#43670) Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> * Allow native KV cache dtype in Triton cache update (#43330) Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> * [Attention][AMD] Standardize kv layout to blocks first for AMD (#43660) Signed-off-by: NickLucche <nlucches@redhat.com> * [ROCm] Enable the aiter top-k/top-p sampler by default (#43331) Signed-off-by: John Qin <yanyuan.qin@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [MM][CG] Avoid over-padding Qwen2.5-VL encoder cudagraph window metadata (#42796) Signed-off-by: Hua Huang <huah@nvidia.com> * Deprecate `JAISLMHeadModel` (#43784) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Feat] Add support for per GPU worker RDMA NIC selection (#42083) Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Core] Cleanup KVConnector handling with PP + fix MRV2 (#43732) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Offload] Add per-request offloading policy via `on_new_request` lifecycle hook (#43205) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model Refactoring] Remove unncessary torch op registration for DSv4 (#43891) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Spec Decode] Allow causal DFlash (#43445) * Refactor output filename handling in ci-fetch-log.sh (#43901) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [AMD][CI][BugFix] Fix Distributed Compile Unit Tests (2xH100-2xMI300) group (#43120) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * fix(frontend): Add multimodal placeholders to Gemma4 tool message template (#41459) Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> * [CI] Enable prefix caching in BFCL benchmark (#43925) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Model]Support Step-3.7-Flash (#43859) Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> * [Rust Frontend] Add `/version` endpoint using engine-reported value (#43854) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Misc][NUMA] Auto-bind to PCT priority cores on DGX B300 + widen EngineCore across shard NUMA nodes (#43270) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Cursor <noreply@cursor.com> * [DSv4] Move mHC tilelang kernels & Don't use CustomOP in dsv4/nvidia (#43905) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [feat] add GlmgaProcessor specific logits in `glm4_1v.py` (#43575) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * Adjust design around encoder_cudagraph_forward (#42288) Signed-off-by: Weida Hong <wdhongtw@google.com> * [XPU] add scale transpose to prepare_fp8_moe_layer_for_xpu and bump up kernels (#43277) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [kv_offload] Skip decode-phase blocks in CPU offload (#43797) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> * [Refactor] Remove dead code (#43234) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717) Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [CI] Separate non-root smoke tests from image build step (#43712) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [XPU] add gelu_tanh to xpu moe backend supported activations (#42822) Signed-off-by: yintong-lu <yintong.lu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU Backend] CPU top-k and top-p sampling kernels using Triton (#43633) Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][DSv4] Remove device pipeline stall in sparse attention (#43898) Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * [Frontend]Responses API supports chat_template_kwargs (#43761) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [ROCm][CI] Fix AITER unified attention for encoder-decoder cross-attention (#43945) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU] fix xpu install document triton-xpu version (#43947) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [CI][ROCm] Don't skip MoRI-IO Connector tests (#43703) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [XPU] support MTP of gdn attention (#43565) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] Nixl+SimpleCPUOffloadingConnector unit tests (#43871) Signed-off-by: NickLucche <nlucches@redhat.com> * [Bugfix] Fix Step3 pipeline parallel KeyError for residual tensor (#37622) Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 (#41394) Signed-off-by: JartX <sagformas@epdcenter.es> * [Bugfix] [ROCm] [DSV4] Fix AITER MXFP4 MoE weight loading and shuffle… (#42595) Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> * [ROCm][Perf] DSv3.2 MI355X TP4 decode-step orchestration cleanup (3 micro-opts) (#42982) Signed-off-by: Frida Andersson <fanderss@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Corrupted MLA + linear attention (#43961) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * Skip docs build if PR doesn't affect docs (#43972) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][CPU] Remove invalid extra deps (#43977) Signed-off-by: jiang1.li <jiang1.li@intel.com> * Add vLLM library info to Hugging Face Hub requests (#43857) Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: clarify ITL acronym in optimization docs (#43922) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * [Misc] added unit tests for the core pooling methods (#43818) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1 (#43616) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] WNA16 MoE backend selection into oracle module (#42553) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [EPLB] Make async EPLB default (#43219) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Use storage_block_size in KV cache reshape for compressed specs (DeepSeek V4) (#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Bugfix] Fix Ray placement group allocation with grouped nodes (#43998) Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> * [Bug] Fix torch device issue for MOE permute (#44005) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [CI] Make Model Executor test hangs fail fast with a traceback (#43971) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [CI] Remove redundant test_chat_with_tool_reasoning.py (#44011) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add @khluu to CODEOWNERS (#44019) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Feature] SSL support for dp supervisor (#43688) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Metrics] Exclude KV transfer tokens from iteration_tokens_total (#43346) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Fronten] Clean up stop_token_ids override for Harmony (#44009) Signed-off-by: Yifan Zong <yzong@redhat.com> * [MoE Refactor] Migrate MoeWNA16Method quantization to MK oracle (#42647) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] Remove supports_expert_map (#43108) Signed-off-by: Bill Nell <bnell@redhat.com> * [CI] Remove duplicate Harmony test coverage (#44023) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Fix smoke test step key to bypass block gate (#43974) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "[MoE Refactor] Migrate MoeWNA16Method quantization to MK orac… (#44033) Signed-off-by: Bill Nell <bnell@redhat.com> * [PERF]MiniMax-M2 gate kernel (#38445) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> * offload prompt_embeds decode in render_prompts_async to avoid blocking (#43792) Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> * [Refactor] Remove dead current_tool_name_sent assignments from tool parsers (#43997) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ROCm][CI] Fix failure in the Phi3V pooling test (#44028) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] cmake: support PYTORCH_FOUND_HIP for torch 2.13 native HIP language support (#43881) Signed-off-by: nemanjaudovic <nudovic@amd.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [BugFix][Platform] Fix import vllm.platforms.rocm error on non-CUDA test_gpt_oss.py (#43571) Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype (#42379) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm] Add attention sink support to AITer flash attention backend (#43817) Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> * [Governance] Add @BugenZhao as Rust frontend code owner (#44047) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bug] Fix gemma4 MTP IMA issue when TP>1, `CUDA error: an illegal memory access was encountered` (#43909) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [MRV2] Support breakable CUDA graph (#44050) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CPU][Zen] Route W8A8 and W4A16 linear inference through zentorch on AMD Zen CPUs (#41813) Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * docs: fix MLA attention docstring examples (#44118) Co-authored-by: nightcityblade <nightcityblade@gmail.com> * [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators (#36254) Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> * [MRV2] Remove Eagle's dedicated CUDA graph pool (#44078) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [BugFix] Fix `_has_module` to verify native deps via trial import (#44035) Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Docs] Replace broken video url in examples (#44159) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CPU][RISC-V] Add missing RVV cpu_types helpers for WNA16 (#42730) Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * fix: glm5.1 pp model loading (#42944) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> * [Frontend] Resettle generative scoring entrypoint. (#44153) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Rust Frontend] Add InternLM2 tool parser (#43481) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] fix wrong partial_rotary_factor calculation for bailing_moe model. (#43770) Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> * [XPU][CI] Fix test_audio_in_video flake by using module-scoped server fixture (#44146) Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> * [Perf] Optimize cutlass fp8 scaled mm bypassing padding, 20% kernel performance improvement (#43706) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Feature] Add support for JetBrains' Mellum v2 code generation model (#43992) Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [Kernel][DSv4] Optimize sparse FP8 compressor kernels (#44161) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Fix and stabilize EAGLE3 acceptance tests (#41294) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> * [Rust Frontend] Support streaming `generate` endpoint (#43779) Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates (#40096) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [BugFix][CI] Fix added `_has_module` tests (#44248) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Test][BugFix] Fix double-BOS in PD+specdec acceptance test (#44234) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [DSV4] Remove unncessary classes & functions (#44246) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [ROCm][CI] Skip unbacked dynamic shapes tests on PyTorch < 2.11 (#44256) Signed-off-by: JartX <sagformas@epdcenter.es> * [DSV4] Refactor RoPE initialization (#44262) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][Mooncake] Release GPU pin on failed store in MooncakeStoreConnector (#43742) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [ROCm] Upgrade AITER to v0.1.13.post1 (#44265) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * [Bugfix][CI] Normalize NIXL connector CUDA wheel installs (#44266) Signed-off-by: Alec Flowers <aflowers@nvidia.com> * [Refactor] Move unstreamed tool-arg flush from serving layer to parser (#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Stabilize OpenAI schema fuzzing for malformed structural tags (#44131) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [BugFix] Fix TypeError in MiniCPM-O audio feature unpadding (#38053) Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> * [BugFix][kv_offload]: Prevent offloading stale sliding window blocks (#42959) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [XPU][Bugfix] Fix per_token_group_fp8_quant missing dummy args on XPU (#43930) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [MM][CG] Profile encoder CUDA graph pool memory (#41714) Signed-off-by: JooHo Lee <jooho414@gmail.com> * [Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl (#43798) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> * [Model Runner V2] Support zeroing freshly allocated KV blocks for hybrid + fp8 KVCache (#43990) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Model Runner V2] Use actual batch max_seq_len for attn metadata (#43991) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Refactor] Unify reasoning + tool-call parsing behind Parser.parse() (#44267) Signed-off-by: sfeng33 <4florafeng@gmail.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Dakai An <dakaian108@gmail.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: bbartels <benjamin@bartels.dev> Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: omerpaz95 <omerpaz95@gmail.com> Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Signed-off-by: John Qin <yanyuan.qin@amd.com> Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Signed-off-by: yintong-lu <yintong.lu@intel.com> Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Frida Andersson <fanderss@amd.com> Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> Signed-off-by: nemanjaudovic <nudovic@amd.com> Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: wcy <233313160abc@gmail.com> Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Signed-off-by: Alec Flowers <aflowers@nvidia.com> Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Signed-off-by: JooHo Lee <jooho414@gmail.com> Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Hynek Kydlicek <kydlicek.hynek@gmail.com> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Dakai An <77474977+andakai@users.noreply.github.com> Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Benjamin Bartels <benjamin@bartels.dev> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com> Co-authored-by: JINO ROHIT <find.jinorohit@gmail.com> Co-authored-by: tonyliu312 <56969792@qq.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Animesh Trivedi <animesh.trivedi@gmail.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: omerpaz95 <73347585+omerpaz95@users.noreply.github.com> Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: Angelo Ruocco <angeloruocco90@gmail.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Majid <mjtaheri68@gmail.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Jason Elie Bou Kheir <5115126+jasonboukheir@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Chao-Ju Chen <ricky.chen@infinirc.com> Co-authored-by: Mike G <180722391+mikekg@users.noreply.github.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com> Co-authored-by: Hua Huang <huah@nvidia.com> Co-authored-by: Raj Joshi <rajjoshi@g.harvard.edu> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: Cursor <noreply@cursor.com> Co-authored-by: Jared Wen <w13431838023@gmail.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Yintong Lu <yintong.lu@intel.com> Co-authored-by: Tianmu Li <tianmu.li@intel.com> Co-authored-by: Joaquín Mondéjar <111321569+JMonde@users.noreply.github.com> Co-authored-by: JartX <sagformas@epdcenter.es> Co-authored-by: MHYangAMD <meng-hsuan.yang@amd.com> Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: czhu-cohere <conway.zhu@cohere.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Gagan Dhakrey <59848316+gagandhakrey@users.noreply.github.com> Co-authored-by: nemanjaudovic <152565955+nemanjaudovic@users.noreply.github.com> Co-authored-by: Liangliang Ma <liangliang.ma@intel.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Xiaoran <claire.rrchen@hotmail.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Aakar Dwivedi <82587125+aadwived@users.noreply.github.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: nightcityblade <jackchen@haloailabs.com> Co-authored-by: nightcityblade <nightcityblade@gmail.com> Co-authored-by: Umut Polat <52835619+umut-polat@users.noreply.github.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: wcy <86111164+wcynb1023@users.noreply.github.com> Co-authored-by: Uranus <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: zzt <mf1732009@smail.nju.edu.cn> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com> Co-authored-by: Krishna Chaitanya <krishnabkc15@gmail.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
vllm-project#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
vllm-project#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com>
vllm-project#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com>
vllm-project#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: JisoLya <523420504@qq.com>
Purpose
OpenAIServingChatintoDelegatingParser._append_unstreamed_tool_argsandToolParser.get_remaining_unstreamed_args, where the streaming state naturally livesfinished=Truefrom the serving layer whenoutput.finish_reasonis set, letting the parser emit any remaining arguments in its finalparse_deltacallfinish_reasoncondition by removing theauto_tools_calledvariableTest Plan
E2E test
Unit test