[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE#33715
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE#33715robertgshaw2-redhat merged 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of tests for the FlashInfer TRTLLM backend with BF16 Mixture-of-Experts (MoE). The additions include a new integration test for the Qwen3-Next model, unit tests for backend selection logic, a test for a weight conversion utility, and a kernel correctness test against a baseline implementation. The tests are well-structured and significantly improve the test coverage for this new feature. My primary concern is the high numerical tolerance used in one of the correctness tests, which might not be strict enough to catch potential precision regressions.
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
76acb41 to
673f517
Compare
pavanimajety
left a comment
There was a problem hiding this comment.
Thanks @Linda-Stadter, the tests look good to me
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Head branch was pushed to by a user without write access
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
062f700 to
cd4f3fc
Compare
| w2_original = w2.clone() | ||
| baseline_output = torch_moe(a, w1_original, w2_original, router_logits, topk) | ||
|
|
||
| close = torch.isclose(trtllm_output, baseline_output, atol=1e-1, rtol=0.85) |
There was a problem hiding this comment.
Aren't these tolerances a bit large for bf16?
There was a problem hiding this comment.
I was also wondering about this. I got these numbers and the way of checking accuracy from the tests in flashinfer
https://github.com/flashinfer-ai/flashinfer/blob/30bf78e1a05d18aaebd484aa6f3c1db8e4c0b1b4/tests/moe/test_trtllm_gen_fused_moe.py#L1398
and
https://github.com/flashinfer-ai/flashinfer/blob/30bf78e1a05d18aaebd484aa6f3c1db8e4c0b1b4/tests/moe/test_trtllm_gen_fused_moe.py#L1675
* Implement zero-copy GQA for multimodal and CPU (#33732)
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai>
* [Bugfix] Support `RotaryEmbedding` CustomOp for gpt-oss (#33800)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* [Model] Add transcription support for Qwen3-Omni (#29828)
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
* Revert "[torch.compile] Significantly speed up cold start times" (#33820)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] (#33573)
Signed-off-by: Sage Moore <sagmoore@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Core] Don't schedule spec tokens with prefill chunks (#33652)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* feat: Add ColBERT late interaction model support (#33686)
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CI][torch.compile] Reduce e2e fusion test time (#33293)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192)
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
* [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [release] Minor fixes to release annotation (#33849)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841)
* [Minor] Include `StreamingInput` in inputs package (#33856)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [docs] fix unintentional misspellings (#33863)
Signed-off-by: rinbaro <ilgomishra@gmail.com>
* [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [2/N] move responses/serving _make_response_output_items logic to parser (#33281)
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [CI/Build] Parallelize CPU CI tests (#33778)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [CI/Build] Fix CPU CI test case title (#33870)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Perf] Optimize the performance of structured output + reasoning (#33557)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Refactor] Move `task` outside of `PoolingParams.verify` (#33796)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
* Enable Cross layers KV cache layout at NIXL Connector V2 (#33339)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [perf] Integrate flashinfer concat_mla_k (#31171)
* [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Refactor] Clean up input preprocessing (#33687)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix corner case of sparse embedding (#33886)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [Docs] Add bart-plugin to docs (#33905)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix step3p5 parser when using mtp (#33690)
Signed-off-by: mariohong <mariohong128@gmail.com>
* [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
* [BugFix] Fix LoRA Fp8 (#33879)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Spec Decode] Unified Parallel Drafting (#32887)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* [Misc] Add debug logs (#33931)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path (#33795)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
* [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor (#33375)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Models] Consolidate Deepseek-OCR2 processor (#33909)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] Suppress non-TTY color output on the process name part of the log (#29714)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
* Fix tokenizer test for renamed attr on Transformers v5 (#33902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Misc] Rename `translations` to `speech_to_text` for OAI serving component (#33904)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix DSV3.2 NVFP4 (#33932)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Make MM batching more robust (#33817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Minor] Sort safetensors files to ensure deterministic loading order (#33491)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* Adds padding and perf improvements to wvSplitK_fp8 (#33527)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue (#33832)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
* [Feature] OTEL tracing during loading (#31162)
* [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel (#33568)
* [Docs] Add reo analytics (#33957)
Signed-off-by: simon-mo <simon.mo@hey.com>
* fix(ROCm): Make flash_attn import optional in MLA attention (#33511)
Signed-off-by: rabi <ramishra@redhat.com>
* feat(frontend): early-fail tokenization guard for user requests (#31366)
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Misc] Update code for encoder-decoder models (#33900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CPU] Add BF16 Kernel type for s390x (#33788)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [XPU][4/N] add mxfp4 moe model support (#33679)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [XPU]Replace pip in docker.xpu with uv pip (#31112)
Signed-off-by: sihao.li <sihao.li@intel.com>
* Onboard voyage-4-nano (#33720)
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263)
Signed-off-by: Gassan <gassan.salama@arm.com>
* Fix `main` pre-commit (#33975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* support view_from_cpu_tensor on XPU (#33868)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* Consolidate and fix forbidden import `pre-commit` checks (#33982)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [PaddleOCR-VL] Add BC for transformers 5.0 config (#33976)
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
* Bump HF Hub client to get bug fix (#33984)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [CPU][BugFix] Fix loading of w8a8int models with bias (#33582)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) (#33731)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix][Model] Support LoRA on Qwen3 Output Embedding (#29816)
Signed-off-by: kurt <kurt@thinkingmachines.ai>
* [Docs] Improve documentation (#33799)
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Update `WeightTransferConfig` to be more standard like the others (#33989)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix] Fix models and tests for transformers v5 (#33977)
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab (#33509)
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
* [ROCm][AITER] Fix AITER import regression for explicit backend selection (#33749)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Docs] Add sections on process architecture and minimum CPU resources (#33940)
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Model] Support MiniCPM-o 4.5 (#33431)
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Co-authored-by: mslv <mslv@baai.ac.cn>
* [Refactor] Consolidate sequence normalization and enc-dec parsing (#33928)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [XPU][5/N] add wna16 xpu kernel (#33973)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Docs] Update link to Benchmark CLI documentation (#33254)
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
* [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 (#33964)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [KV Connector] Add missing method overrides to MultiConnector (#33292)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [DOC] [ROCm] Update docker deployment doc (#33971)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] support apply penalty for spec decode (#33251)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
* [Refactor] Remove align block size logic in `moe_permute` (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op (#33734)
Signed-off-by: charlifu <charlifu@amd.com>
* [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) (#33993)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
* [Fix] Fix `logprobs=0` handling for `/inference/v1/generate` endpoint (#34010)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
* Fix RoutingMethodType logic (#33919)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* [bugfix] [ROCm] Fix premature CUDA initialization in platform detection (#33941)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
* [Feat][RL] Pause and Resume with keep requests for single engine (#32351)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 (#33967)
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Fix Whisper tokenization (#34011)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion (#34007)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [Bugfix] Fix _fused_moe_lora_expand signature mismatch (#33821)
Signed-off-by: Xin Yang <xyangx@amazon.com>
* [Misc] Add backward-compatible import aliases for renamed translations module (#34015)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [ModelRunner V2] Revert token rank comparison difference for now (#34017)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* fix description in plugin_system.md (#33999)
* [Revert] Add util `handle_deprecated` back (#33998)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
* [Misc] Make `PlaceholderRange.get_num_embeds` a method (#34035)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug (#34038)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Fix spelling errors (#33978)
* [Misc] Simplify `get_max_tokens` (#34036)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI][Build] Pin grpcio-tools==1.78.0 (#34048)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Renderer] Define `render_cmpl` and `render_chat` (#34039)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune (#34006)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [torch.compile] Stop compiling identical artifacts (#34003)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 (#33939)
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Frontend]Add support for transcriptions and translations to run_batch (#33934)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Model] Enable Step3p5ForCausalLM testing (#33755)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
* [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. (#33660)
Signed-off-by: whx-sjtu <2952154980@qq.com>
* move checks out of `unified_kv_cache_update` custom op (#33943)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* Update DeepGEMM version pin in Dockerfile to match #32479 (#33935)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Make directory exist ok for ray spinning up multiple replicas on a single instance (#33604)
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
* Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Doc] Fix run_batch docs (#34056)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Skip GCS test (#34057)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][Bugfix] fix act_quant_fusion module import error (#34069)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used (#33855)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [ROCm] [CI] Reduce Resource of two test groups (#34059)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Add embedding input functionality for disabled modalities [remake] (#32493)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate (#33771)
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
* [BugFix] Change support no act and mul for marlin (#34088)
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
* [torch.compile] Add an option to force-enable the MOE cold start optimization (#33735)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* glm 4.6 fused tuned inference config for B200 (#32958)
* Add support for ModelOpt MXFP8 dense models (#33786)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Release 2.10] Update to Torch 2.10 - final release (#30525)
* [bug-fix] supported_tasks is breaking backward compatibility at init_app_state (#34027)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Tiny] Rename encoder budget file to more specific name (#34103)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Fix `fastsafetensors` TP all procs using all GPUs (#34070)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
* [Model] GLM adaptation (#34124)
* [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr (#34107)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ASR] Fix audio benchmark and add RTFx metric (#32300)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
* [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
* [XPU][6/N] add xpu scaled_mm kernel (#34117)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [MODEL] Adding Support for Qwen3.5 Models (#34110)
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
* [Misc] Fix up attention benchmarks (#33810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
* [UX] Add `--language-model-only` for hybrid models (#34120)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 (#34031)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* Add NUMA Core binding in nixl_connector for CPU xPyD (#32365)
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Kernel] FlashInfer: switch allreduce fusion to unified API (#33985)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) (#34087)
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [Kernel] use flashinfer for gdn prefill (#32846)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
* [Bugfix] Avoid duplicate k-proj weight emission in helper (#34142)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [Bugfix] Voxtral prompt/audio placeholder alignment (#34140)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [ROCm] update triton branch to support gpt-oss models for gfx11xx devices (#34032)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
* [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945)
Signed-off-by: charlifu <charlifu@amd.com>
* [ModelRunner V2][BugFix] Fix `max_query_len` calculation (#34167)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Add DCP support to attention backend doc (#33936)
* [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [structured output] validate unsupported json features first (#33233)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
* [LMCache] Token Base IPC API (#34175)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
* [Bugfix] Adopt `ChunkGatedDeltaRule` for Qwen3.5 (#34198)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection (#34108)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [Doc] Update usage of `--limit-mm-per-prompt` (#34148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Relax `test_mcp_tool_call` (#34204)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix DP Attention Padding in Dummy Run (#34187)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
* [Bugfix] Add `--trust-remote-code` to dataset bench args (#34208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [responsesAPI] fix simpleContext streaming output_messages (#34188)
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 (#34190)
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
* [Frontend][CI] Consolidate instrumentator entrypoints (#34123)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
* [Perf] Optimize detokenizer python logic (#32975)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
* Revert #34208 (#34216)
* [Bugfix] Fix memory inconsistency in cross-process shared memory (#32022)
Signed-off-by: Zetong Li <slippersss@126.com>
* [Bugfix] Fix `--trust-remote-code` conflict (#34218)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Docs] Fix format error in KV load failure recovery doc (#34137)
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
* [Bugfix] Fix FI kernel`chunk_gated_delta_rule` output shape for Qwen3.5 (#34219)
Signed-off-by: Roger Wang <hey@rogerw.io>
* Add flagos in MiniCPM-o (#34126)
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
* [Misc] allow specify is_mm_prefix_lm in hf_config (#34215)
* Stop testing for slow tokenizers as they will not exist soon (#34235)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220)
Signed-off-by: KrxGu <krishom70@gmail.com>
* Bump `mamba-ssm` version in CI for Transformers v5 compatibility (#34233)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* add --insecure arg to the vllm bench to skip TLS (#34026)
Signed-off-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
* Support benchmarking of Geospatial models (#33922)
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
* [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations (#29008)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [compile] Enable AOT compile with 2.10 in trunk. (#34155)
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
* [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* [Core][BugFix] Fix PP KV cache sharding memory validation (#33698)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
* [Docs] Speed up build environment set-up (#34240)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] Use pinned memory for write_contents (#34222)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Minor cleanup for Voxtral (#34247)
Signed-off-by: Andy Lo <andy@mistral.ai>
* [UX nit] Fix non-default api_server_count message (#34152)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Misc] Introduce ec_both role EC (encoder cache) connector (#34182)
Signed-off-by: Qi Wang <qiwa@nvidia.com>
* Convert online APIs to use Renderer (#34084)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
* [Benchmarks] Fix attention benchmark smoke test (#34269)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Fix mamba cache dtype for Qwen3.5 (#34200)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [SM100] Resubmit FMHA FP8 prefill for MLA (#31195)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Feature] Warn about unrecognized environment variables (#33581)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [Perf] Move eplb rebalance algo to async thread (#30888)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
* [Misc][Spec Decode] support different load config for draft model (#34022)
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
* [torch.compile] Disable recursive pre_grad_passes (#34092)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [Misc] Add pre-commit hook to catch boolean ops in with-statements (#34271)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [CI] Add pip caching to cleanup_pr_body workflow (#32979)
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
* [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
* [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline (#34280)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
* [Misc] Add run one batch script that supports profiling (#32968)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* [Bugfix] Fix Worker.load_model context-manager composition for sleep mode (#34021)
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
* [Redo] Add `--trust-remote-code` to dataset bench args (#34251)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter (#34093)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend (#33738)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Misc] Clean up validation logic in input processor (#34144)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
* [Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [XPU][7/N] enable xpu fp8 moe (#34202)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Plugin] Simplify IO Processor Plugin interface (#34236)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 (#34149)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
* Threshold fix wvSplitk for occasional CI fails (#34013)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast (#34298)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
* [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides (#34279)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [Bugfix] Fix weight naming in Qwen3.5 (#34313)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [model] support FunASR model (#33247)
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
* [XPU][9/N] clean up existing ipex code/doc (#34111)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [Chore] Move `BaseRenderer` to `base.py` (#34308)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Enable AR+rms fusion by default available for `-O2` (#34299)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* [Misc] Bump `fastsafetensors` version for latest fixes (#34273)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Update Marlin support matrix for Turing (#34319)
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
* [Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer (#34217)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* Patch protobuf for CVE-2026-0994 (#34253)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
* [Docs] Reduce time spent generating API docs (#34255)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* Make Qwen3VL compatible with Transformers v5 (#34262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
* Make JAIS compatible with Transformers v5 (#34264)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
* Responses harmony system message structured (#34268)
Signed-off-by: Adam Binford <adamq43@gmail.com>
* Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* Don't try and run GLM-ASR with remote code (#34352)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache (#33948)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* [ROCm] [aiter] Split KV cache update for AiterFlashAttention (#33681)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
* [Docs] Fix typo ("defult") and double spacing (#34348)
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… (#32458)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [Model Runner V2] Init cuda graph pool when necessary (#33217)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* [Multimodal] Expose `mm_processor_kwargs` for `DummyInputsBuilder` (#34330)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] fix default is_neox_style is True for deepseek (#34353)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
* [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) (#34243)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
* [ROCm] [CI] fix test_unrecognized_env (#34350)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* [GPT-OSS] Remove unnecessary contiguous (#34337)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* Add cartridge (prefix) benchmark configs to CI workflows
The prefix_latency and prefix_throughput configs existed but weren't
being run by any workflow. Each benchmark workflow now runs both the
base and cartridge configs using the shared server support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* update flashinfer
* update wheel
* update cuda and flashinfer
* downgrade
* update tests
---------
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Sage Moore <sagmoore@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: rinbaro <ilgomishra@gmail.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: mariohong <mariohong128@gmail.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: rabi <ramishra@redhat.com>
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Signed-off-by: Gassan <gassan.salama@arm.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: kurt <kurt@thinkingmachines.ai>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Signed-off-by: ihb2032 <hebome@foxmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Artus KG <artuskg@gmail.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Zetong Li <slippersss@126.com>
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Signed-off-by: KrxGu <krishom70@gmail.com>
Signed-off-by: Fan Yang <yan9fan@meta.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Qi Wang <qiwa@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Taeksang Kim <voidbag@gmail.com>
Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: Muhammad Hashmi <105992724+mu-hashmi@users.noreply.github.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Sage Moore <sagmoore@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Ilya Boytsov <boytsovpanamera@mail.ru>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: rinbaro <ilgomishra@gmail.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Mario Hong <86880754+mariohong128@users.noreply.github.com>
Co-authored-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: zackyoray <yorayz@nvidia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Rabi Mishra <ramishra@redhat.com>
Co-authored-by: Mingliang Li <limingliang0527@gmail.com>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gassan Salama <gassan.salama@arm.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Kurt Shuster <shuster.kurt@gmail.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com>
Co-authored-by: mslv <mslv@baai.ac.cn>
Co-authored-by: zofia <110436990+zufangzhu@users.noreply.github.com>
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Charlie Fu <charlifu@amd.com>
Co-authored-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: 果冻虾仁 <guodong@apache.org>
Co-authored-by: Vel <110626982+Code4me2@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: Jiang Wu <jwu@cclgroup.com>
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: aabbccddwasd <140953076+aabbccddwasd@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: navmarri14 <nmarri@roblox.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: ihb2032 <40718643+ihb2032@users.noreply.github.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuwei An <ayw.sirius19@gmail.com>
Co-authored-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Zetong Li <48438720+slippersss@users.noreply.github.com>
Co-authored-by: zzaebok <44357534+zzaebok@users.noreply.github.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Phúc H. Lê Khắc <lkhphuc@pm.me>
Co-authored-by: Krish Gupta <krishom70@gmail.com>
Co-authored-by: Fan Yang <fanyang.real@gmail.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: mgazz <michele.gazzetti1@ibm.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Zhengxu Chen <zhxchen17@meta.com>
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Qi Wang <wqstu1@gmail.com>
Co-authored-by: J Seppänen <83203+jseppanen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Zhengkai Zhang <33679250+ZhengkaiZ@users.noreply.github.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: 7. Sun <jhao.sun@gmail.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: AllenDou <allen.dou@hotmail.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Tianqi Ren <tianqi.r@outlook.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Adam Binford <adamq43@gmail.com>
Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com>
Co-authored-by: Xinyu Dong <dongxinyu03@baidu.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [release] Minor fixes to release annotation (#33849)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841)
* [Minor] Include `StreamingInput` in inputs package (#33856)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [docs] fix unintentional misspellings (#33863)
Signed-off-by: rinbaro <ilgomishra@gmail.com>
* [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [2/N] move responses/serving _make_response_output_items logic to parser (#33281)
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [CI/Build] Parallelize CPU CI tests (#33778)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [CI/Build] Fix CPU CI test case title (#33870)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Perf] Optimize the performance of structured output + reasoning (#33557)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
* [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Refactor] Move `task` outside of `PoolingParams.verify` (#33796)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
* Enable Cross layers KV cache layout at NIXL Connector V2 (#33339)
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [perf] Integrate flashinfer concat_mla_k (#31171)
* [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Refactor] Clean up input preprocessing (#33687)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix corner case of sparse embedding (#33886)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [Docs] Add bart-plugin to docs (#33905)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix step3p5 parser when using mtp (#33690)
Signed-off-by: mariohong <mariohong128@gmail.com>
* [Feat][RL][1/2] Native Weight Syncing API: NCCL (#31943)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
* [BugFix] Fix LoRA Fp8 (#33879)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Spec Decode] Unified Parallel Drafting (#32887)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* [Misc] Add debug logs (#33931)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path (#33795)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
* [Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor (#33375)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Models] Consolidate Deepseek-OCR2 processor (#33909)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] Suppress non-TTY color output on the process name part of the log (#29714)
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
* Fix tokenizer test for renamed attr on Transformers v5 (#33902)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Misc] Rename `translations` to `speech_to_text` for OAI serving component (#33904)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Bugfix] Fix DSV3.2 NVFP4 (#33932)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Make MM batching more robust (#33817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Minor] Sort safetensors files to ensure deterministic loading order (#33491)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* Adds padding and perf improvements to wvSplitK_fp8 (#33527)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue (#33832)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
* [Feature] OTEL tracing during loading (#31162)
* [Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel (#33568)
* [Docs] Add reo analytics (#33957)
Signed-off-by: simon-mo <simon.mo@hey.com>
* fix(ROCm): Make flash_attn import optional in MLA attention (#33511)
Signed-off-by: rabi <ramishra@redhat.com>
* feat(frontend): early-fail tokenization guard for user requests (#31366)
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Misc] Update code for encoder-decoder models (#33900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CPU] Add BF16 Kernel type for s390x (#33788)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [XPU][4/N] add mxfp4 moe model support (#33679)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [XPU]Replace pip in docker.xpu with uv pip (#31112)
Signed-off-by: sihao.li <sihao.li@intel.com>
* Onboard voyage-4-nano (#33720)
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation (#32263)
Signed-off-by: Gassan <gassan.salama@arm.com>
* Fix `main` pre-commit (#33975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* support view_from_cpu_tensor on XPU (#33868)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* Consolidate and fix forbidden import `pre-commit` checks (#33982)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [PaddleOCR-VL] Add BC for transformers 5.0 config (#33976)
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
* Bump HF Hub client to get bug fix (#33984)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [CPU][BugFix] Fix loading of w8a8int models with bias (#33582)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) (#33731)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix][Model] Support LoRA on Qwen3 Output Embedding (#29816)
Signed-off-by: kurt <kurt@thinkingmachines.ai>
* [Docs] Improve documentation (#33799)
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Update `WeightTransferConfig` to be more standard like the others (#33989)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix] Fix models and tests for transformers v5 (#33977)
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab (#33509)
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
* [ROCm][AITER] Fix AITER import regression for explicit backend selection (#33749)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Docs] Add sections on process architecture and minimum CPU resources (#33940)
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Model] Support MiniCPM-o 4.5 (#33431)
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Co-authored-by: mslv <mslv@baai.ac.cn>
* [Refactor] Consolidate sequence normalization and enc-dec parsing (#33928)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [XPU][5/N] add wna16 xpu kernel (#33973)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Docs] Update link to Benchmark CLI documentation (#33254)
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
* [Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 (#33964)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
* [Log] Optimize duplicate startup log (#33944)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [KV Connector] Add missing method overrides to MultiConnector (#33292)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
* [DOC] [ROCm] Update docker deployment doc (#33971)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] support apply penalty for spec decode (#33251)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
* [Refactor] Remove align block size logic in `moe_permute` (#33449)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op (#33734)
Signed-off-by: charlifu <charlifu@amd.com>
* [Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) (#33993)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
* [Fix] Fix `logprobs=0` handling for `/inference/v1/generate` endpoint (#34010)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
* Fix RoutingMethodType logic (#33919)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
* [bugfix] [ROCm] Fix premature CUDA initialization in platform detection (#33941)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
* [Feat][RL] Pause and Resume with keep requests for single engine (#32351)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 (#33967)
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [Bugfix] Fix Whisper tokenization (#34011)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion (#34007)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
* [Bugfix] Fix _fused_moe_lora_expand signature mismatch (#33821)
Signed-off-by: Xin Yang <xyangx@amazon.com>
* [Misc] Add backward-compatible import aliases for renamed translations module (#34015)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [ModelRunner V2] Revert token rank comparison difference for now (#34017)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* fix description in plugin_system.md (#33999)
* [Revert] Add util `handle_deprecated` back (#33998)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support (#33517)
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
* [Misc] Make `PlaceholderRange.get_num_embeds` a method (#34035)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug (#34038)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* Fix spelling errors (#33978)
* [Misc] Simplify `get_max_tokens` (#34036)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI][Build] Pin grpcio-tools==1.78.0 (#34048)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Renderer] Define `render_cmpl` and `render_chat` (#34039)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune (#34006)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
* [torch.compile] Stop compiling identical artifacts (#34003)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 (#33939)
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Frontend]Add support for transcriptions and translations to run_batch (#33934)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Model] Enable Step3p5ForCausalLM testing (#33755)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
* [PluggableLayer][3/N] Apply PluggableLayer to mamba layers. (#33660)
Signed-off-by: whx-sjtu <2952154980@qq.com>
* move checks out of `unified_kv_cache_update` custom op (#33943)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* Update DeepGEMM version pin in Dockerfile to match #32479 (#33935)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* Make directory exist ok for ray spinning up multiple replicas on a single instance (#33604)
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
* Perf tuning and expansion of cases covered for wvSplitKrc (#33493)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [Doc] Fix run_batch docs (#34056)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Skip GCS test (#34057)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [ROCm][Bugfix] fix act_quant_fusion module import error (#34069)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used (#33855)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [ROCm] [CI] Reduce Resource of two test groups (#34059)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Add embedding input functionality for disabled modalities [remake] (#32493)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate (#33771)
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
* [BugFix] Change support no act and mul for marlin (#34088)
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
* [torch.compile] Add an option to force-enable the MOE cold start optimization (#33735)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* glm 4.6 fused tuned inference config for B200 (#32958)
* Add support for ModelOpt MXFP8 dense models (#33786)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
* [Release 2.10] Update to Torch 2.10 - final release (#30525)
* [bug-fix] supported_tasks is breaking backward compatibility at init_app_state (#34027)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
* [Tiny] Rename encoder budget file to more specific name (#34103)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Fix `fastsafetensors` TP all procs using all GPUs (#34070)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* fix(cpu): fix mla_decode compilation on x86 without AVX512 (#34052)
Signed-off-by: ihb2032 <hebome@foxmail.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
* [Model] GLM adaptation (#34124)
* [CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr (#34107)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ASR] Fix audio benchmark and add RTFx metric (#32300)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
* [Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul (#33901)
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
* [XPU][6/N] add xpu scaled_mm kernel (#34117)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [MODEL] Adding Support for Qwen3.5 Models (#34110)
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
* [Misc] Fix up attention benchmarks (#33810)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
* [UX] Add `--language-model-only` for hybrid models (#34120)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 (#34031)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* Add NUMA Core binding in nixl_connector for CPU xPyD (#32365)
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Kernel] FlashInfer: switch allreduce fusion to unified API (#33985)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) (#34087)
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [Kernel] use flashinfer for gdn prefill (#32846)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
* [Bugfix] Avoid duplicate k-proj weight emission in helper (#34142)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [Bugfix] Voxtral prompt/audio placeholder alignment (#34140)
Signed-off-by: Artus KG <artuskg@gmail.com>
* [ROCm] update triton branch to support gpt-oss models for gfx11xx devices (#34032)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
* [torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. (#33945)
Signed-off-by: charlifu <charlifu@amd.com>
* [ModelRunner V2][BugFix] Fix `max_query_len` calculation (#34167)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Add DCP support to attention backend doc (#33936)
* [Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available (#34153)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [structured output] validate unsupported json features first (#33233)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
* [LMCache] Token Base IPC API (#34175)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
* [Bugfix] Adopt `ChunkGatedDeltaRule` for Qwen3.5 (#34198)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection (#34108)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [Doc] Update usage of `--limit-mm-per-prompt` (#34148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [CI/Build] Relax `test_mcp_tool_call` (#34204)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix DP Attention Padding in Dummy Run (#34187)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
* [Bugfix] Add `--trust-remote-code` to dataset bench args (#34208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [responsesAPI] fix simpleContext streaming output_messages (#34188)
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
* [Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 (#34190)
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
* [Frontend][CI] Consolidate instrumentator entrypoints (#34123)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
* [Perf] Optimize detokenizer python logic (#32975)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
* Revert #34208 (#34216)
* [Bugfix] Fix memory inconsistency in cross-process shared memory (#32022)
Signed-off-by: Zetong Li <slippersss@126.com>
* [Bugfix] Fix `--trust-remote-code` conflict (#34218)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Docs] Fix format error in KV load failure recovery doc (#34137)
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
* [Bugfix] Fix FI kernel`chunk_gated_delta_rule` output shape for Qwen3.5 (#34219)
Signed-off-by: Roger Wang <hey@rogerw.io>
* Add flagos in MiniCPM-o (#34126)
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
* [Misc] allow specify is_mm_prefix_lm in hf_config (#34215)
* Stop testing for slow tokenizers as they will not exist soon (#34235)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220)
Signed-off-by: KrxGu <krishom70@gmail.com>
* Bump `mamba-ssm` version in CI for Transformers v5 compatibility (#34233)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* add --insecure arg to the vllm bench to skip TLS (#34026)
Signed-off-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
* Support benchmarking of Geospatial models (#33922)
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
* [ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations (#29008)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [compile] Enable AOT compile with 2.10 in trunk. (#34155)
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
* [Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* [Core][BugFix] Fix PP KV cache sharding memory validation (#33698)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [BUGFIX] Fix accuracy bugs in Qwen3-Next MTP (#34077)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
* [Docs] Speed up build environment set-up (#34240)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Model Runner V2] Use pinned memory for write_contents (#34222)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Minor cleanup for Voxtral (#34247)
Signed-off-by: Andy Lo <andy@mistral.ai>
* [UX nit] Fix non-default api_server_count message (#34152)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Misc] Introduce ec_both role EC (encoder cache) connector (#34182)
Signed-off-by: Qi Wang <qiwa@nvidia.com>
* Convert online APIs to use Renderer (#34084)
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
* [Bugfix] Fix weights offloading for sleep mode (#32947)
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
* [Benchmarks] Fix attention benchmark smoke test (#34269)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix] Fix mamba cache dtype for Qwen3.5 (#34200)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [SM100] Resubmit FMHA FP8 prefill for MLA (#31195)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
* [Feature] Warn about unrecognized environment variables (#33581)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
* [Perf] Move eplb rebalance algo to async thread (#30888)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* [BugFix] Fix async EPLB hang with DeepEP LL all2all backend (#32860)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
* [Misc][Spec Decode] support different load config for draft model (#34022)
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
* [torch.compile] Disable recursive pre_grad_passes (#34092)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [Misc] Add pre-commit hook to catch boolean ops in with-statements (#34271)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [CI] Add pip caching to cleanup_pr_body workflow (#32979)
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
* [MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner (#32344)
Signed-off-by: Bill Nell <bnell@redhat.com>
* [ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline (#34280)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
* [Misc] Add run one batch script that supports profiling (#32968)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* [Bugfix] Fix Worker.load_model context-manager composition for sleep mode (#34021)
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
* [Redo] Add `--trust-remote-code` to dataset bench args (#34251)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter (#34093)
Signed-off-by: Richard Zou <zou3519@gmail.com>
* [WideEP] Fix nvfp4 DeepEP High Throughput All2All backend (#33738)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [Misc] Clean up validation logic in input processor (#34144)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast (#33884)
Signed-off-by: Kebe <mail@kebe7jun.com>
* [Kernel] Apply 256bit LDG/STG To Activation Kernels (#33022)
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [XPU][7/N] enable xpu fp8 moe (#34202)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [Plugin] Simplify IO Processor Plugin interface (#34236)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 (#34149)
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
* Threshold fix wvSplitk for occasional CI fails (#34013)
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
* [ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast (#34298)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
* [Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides (#34279)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [Bugfix] Fix weight naming in Qwen3.5 (#34313)
Signed-off-by: Roger Wang <hey@rogerw.io>
* [CPU] Enable FP16 (Half dtype) support for s390x (#34116)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
* [model] support FunASR model (#33247)
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
* [XPU][9/N] clean up existing ipex code/doc (#34111)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [Chore] Move `BaseRenderer` to `base.py` (#34308)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
* [torch.compile] Enable AR+rms fusion by default available for `-O2` (#34299)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
* [Misc] Bump `fastsafetensors` version for latest fixes (#34273)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Doc] Update Marlin support matrix for Turing (#34319)
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
* [Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer (#34217)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* Patch protobuf for CVE-2026-0994 (#34253)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
* [Docs] Reduce time spent generating API docs (#34255)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix][CPU] Fix llama4 inference on CPU (#34321)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* Make Qwen3VL compatible with Transformers v5 (#34262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
* Make JAIS compatible with Transformers v5 (#34264)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE (#33715)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
* Responses harmony system message structured (#34268)
Signed-off-by: Adam Binford <adamq43@gmail.com>
* Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
* Don't try and run GLM-ASR with remote code (#34352)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache (#33948)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
* [ROCm] [aiter] Split KV cache update for AiterFlashAttention (#33681)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
* [Docs] Fix typo ("defult") and double spacing (#34348)
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… (#32458)
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
* [Model Runner V2] Init cuda graph pool when necessary (#33217)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* [Multimodal] Expose `mm_processor_kwargs` for `DummyInputsBuilder` (#34330)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] fix default is_neox_style is True for deepseek (#34353)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
* [Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) (#34243)
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
* [ROCm] [CI] fix test_unrecognized_env (#34350)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* [GPT-OSS] Remove unnecessary contiguous (#34337)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
* Add cartridge (prefix) benchmark configs to CI workflows
The prefix_latency and prefix_throughput configs existed but weren't
being run by any workflow. Each benchmark workflow now runs both the
base and cartridge configs using the shared server support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* update flashinfer
* update wheel
* update cuda and flashinfer
* downgrade
* update tests
* experimental: implement pipelining
* add pipeline test
* configure PR to actually run
* bugfix
* loosen TPOT threshold for catridge latency
* improve pipelining
* simplify pipelining impl
---------
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: rinbaro <ilgomishra@gmail.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: mariohong <mariohong128@gmail.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: rabi <ramishra@redhat.com>
Signed-off-by: limingliang <limingliang@stepfun.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com>
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Signed-off-by: Gassan <gassan.salama@arm.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: kurt <kurt@thinkingmachines.ai>
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
Signed-off-by: caitianchi <caitianchi@modelbest.cn>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Ikenna <ikennachifo@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com>
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Jiang Wu <jwu@cclgroup.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com>
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Signed-off-by: ihb2032 <hebome@foxmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com>
Signed-off-by: JJJYmmm <1650675829@qq.com>
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com>
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Artus KG <artuskg@gmail.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Zetong Li <slippersss@126.com>
Signed-off-by: Jaebok Lee <jaebok9541@naver.com>
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Signed-off-by: KrxGu <krishom70@gmail.com>
Signed-off-by: Fan Yang <yan9fan@meta.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Qi Wang <qiwa@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Signed-off-by: 7. Sun <jhao.sun@gmail.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: rinbaro <ilgomishra@gmail.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Andrew Xia <axia@meta.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Mario Hong <86880754+mariohong128@users.noreply.github.com>
Co-authored-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: zackyoray <yorayz@nvidia.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Tsukasa OI <floss_llm@irq.a4lg.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: emricksini-h <emrick.birivoutin@hcompany.ai>
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Rabi Mishra <ramishra@redhat.com>
Co-authored-by: Mingliang Li <limingliang0527@gmail.com>
Co-authored-by: limingliang <limingliang@stepfun.com>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Gassan Salama <gassan.salama@arm.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Kurt Shuster <shuster.kurt@gmail.com>
Co-authored-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: FredericOdermatt <50372080+FredericOdermatt@users.noreply.github.com>
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com>
Co-authored-by: mslv <mslv@baai.ac.cn>
Co-authored-by: zofia <110436990+zufangzhu@users.noreply.github.com>
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Charlie Fu <charlifu@amd.com>
Co-authored-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: Ikenna <ikennachifo@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: 果冻虾仁 <guodong@apache.org>
Co-authored-by: Vel <110626982+Code4me2@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: TundeAtSN <akintunde.oladipo@servicenow.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: zifeitong <zifeitong@gmail.com>
Co-authored-by: Jiang Wu <jwu@cclgroup.com>
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com”>
Co-authored-by: aabbccddwasd <140953076+aabbccddwasd@users.noreply.github.com>
Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: navmarri14 <nmarri@roblox.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: ihb2032 <40718643+ihb2032@users.noreply.github.com>
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com>
Co-authored-by: wulipc <wulipc@users.noreply.github.com>
Co-authored-by: ywang96 <ywang96@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <hongming.zheng@intel.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Artus Krohn-Grimberghe <artuskg@users.noreply.github.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuwei An <ayw.sirius19@gmail.com>
Co-authored-by: Balaxxe <136368465+jaim12005@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Zetong Li <48438720+slippersss@users.noreply.github.com>
Co-authored-by: zzaebok <44357534+zzaebok@users.noreply.github.com>
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com>
Co-authored-by: Phúc H. Lê Khắc <lkhphuc@pm.me>
Co-authored-by: Krish Gupta <krishom70@gmail.com>
Co-authored-by: Fan Yang <fanyang.real@gmail.com>
Co-authored-by: Fan Yang <yan9fan@meta.com>
Co-authored-by: mgazz <michele.gazzetti1@ibm.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Zhengxu Chen <zhxchen17@meta.com>
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: junuxyz <216036880+junuxyz@users.noreply.github.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Qi Wang <wqstu1@gmail.com>
Co-authored-by: J Seppänen <83203+jseppanen@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Zhengkai Zhang <33679250+ZhengkaiZ@users.noreply.github.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: 7. Sun <jhao.sun@gmail.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com>
Co-authored-by: Matthias Gehre <matthias.gehre@amd.com>
Co-authored-by: AllenDou <allen.dou@hotmail.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Tianqi Ren <tianqi.r@outlook.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Adam Binford <adamq43@gmail.com>
Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com>
Co-authored-by: Xinyu Dong <dongxinyu03@baidu.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Purpose
Adding tests for trtllm bf16 moe backend added in PR [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe #32954
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.