Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
293 commits
Select commit Hold shift + click to select a range
e4d4f14
Add @benchislett to codeowner for spec decode and structured outputs …
benchislett Sep 6, 2025
1aa47a2
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. (#24…
mohankku Sep 6, 2025
cf41ffb
[Bugfix] Fix broken deepseek fp8 TP weights loading (#24367)
Isotr0py Sep 6, 2025
20b9ebc
[Bugfix] Fix test_mixtral_moe (#24371)
jeejeelee Sep 6, 2025
8f8a223
Lora bias(enable_lora_bias) deprecate warning (#24339)
ashwin-phadke Sep 6, 2025
8e8661c
[Fix] [gpt-oss] fix non-tool calling path for chat completion (#24324)
aarnphm Sep 6, 2025
5c672b7
[Frontend][Responses API] Support reporting tool output tokens and fi…
yeqcharlotte Sep 6, 2025
eaa48c1
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test (#24370)
elvischenv Sep 6, 2025
4ef2101
break execute_model in gpu_model_runner into sub-functions for custom…
bangshengtang Sep 6, 2025
15ba556
[V0 deprecation] Deprecate V0 Neuron backend (#21159)
WoosukKwon Sep 6, 2025
a80d287
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode (#24…
youkaichao Sep 7, 2025
7ff6370
Migrate Qwen2 inputs to TensorSchema (#23475)
bbeckca Sep 7, 2025
8cac128
[CI][Fix] deterministic seed for flaky CI runs on structured outputs …
aarnphm Sep 7, 2025
5123beb
[Benchmark] add benchmark for custom activation op (#23908)
ZJY0516 Sep 7, 2025
d93f13b
QWEN3 Thinking Fused MoE kernels Optimization configs (#24330)
samanamp Sep 7, 2025
9e8af02
[Misc] collect flashinfer version in collect_env.py (#24378)
yeqcharlotte Sep 7, 2025
ebcecac
[Bugfix] Fix Qwen3-coder moe tuned config (#24072)
jeejeelee Sep 7, 2025
b2a60ed
[TPU] Remove TopKTopPSampler dependency for TPU sampler (#24391)
WoosukKwon Sep 7, 2025
9f1ae28
Add renderer-based prompt processing for embedding and classification…
sfeng33 Sep 7, 2025
7329603
Skip MM Encoder for non-first PP ranks (#24387)
WoosukKwon Sep 7, 2025
be805b8
Add @luccafong to codeowner for spec decode (#24397)
luccafong Sep 8, 2025
57f13f5
[Kernel] Support decode context parallelism on Blackwell with CUTLASS…
minosfuture Sep 8, 2025
bb8c492
[xpu] upgrade ipex/python3.12 for xpu (#23830)
yma11 Sep 8, 2025
598f01c
[Sampler] Support returning all prompt logprobs (#23868)
charlotte12l Sep 8, 2025
47be785
[CI/Build] Disable flaky test_structured_output tests (#24404)
22quinn Sep 8, 2025
e88dc47
[CI/Build] Fix local image inputs in test_pixtral.py (#24401)
huachenheli Sep 8, 2025
5d47040
[Doc] Fix UTF-8 encoding issues in documentation generation on Window…
alhridoy Sep 8, 2025
7fd342d
[P/D] Add a shutdown method to the Connector API (#22699)
chaunceyjiang Sep 8, 2025
e8b362d
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video prep…
what-in-the-nim Sep 8, 2025
a97ee6d
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video prepr…
what-in-the-nim Sep 8, 2025
dfd2f48
[gpt-oss][Responses API] Fix the function call id format (#24409)
chaunceyjiang Sep 8, 2025
d6d568e
[Docs] Fix a tip indentation and typo (#24419)
windsonsea Sep 8, 2025
a47a61b
[Doc]: fix typos in Python comments (#24417)
didier-durand Sep 8, 2025
e1ca633
[Doc] Fix issues in integrations/llamastack.md (#24428)
windsonsea Sep 8, 2025
a22d860
[Bugfix] Fix get_quant_config when using modelscope (#24421)
Potabk Sep 8, 2025
f4acf25
[Bugfix] Fix mamba2 prefill chunking (#23279)
tomeras91 Sep 8, 2025
8acb5c4
[Misc] Terratorch related fixes (#24337)
christian-pinto Sep 8, 2025
2312f9b
Move `KVEventsConfig` from `config/__init__.py` to `config/kv_events.…
hmellor Sep 8, 2025
221c598
[Frontend] User-provided uuids for medias in chat. (RFC #22044) (#23449)
huachenheli Sep 8, 2025
6db7ffc
[Docs] Move feature compatibility tables to README (#24431)
hmellor Sep 8, 2025
23973de
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's …
didier-durand Sep 8, 2025
f4121db
[Docs]add eplb_config param use docs (#24213)
lengrongfu Sep 8, 2025
244f937
[Model] Enable BNB support for qwen2_5_omni_thinker (#24420)
jeejeelee Sep 8, 2025
815951b
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking (#23…
ekagra-ranjan Sep 8, 2025
e96319c
[Spec Decode][Benchmark] Add Blitzedit dataset (#23605)
ekagra-ranjan Sep 8, 2025
a9ec5ed
[Model] Remove quantized mixtral (#24437)
jeejeelee Sep 8, 2025
4bca93d
[CI] Enable encoder model compilation test (#24442)
ZJY0516 Sep 8, 2025
3d9eb95
[Model loader]: support multi-thread model weight loading (#23928)
BraveY Sep 8, 2025
ff4b305
[Spec Decode] Fix offline spec_decode.py (#24257)
ekagra-ranjan Sep 8, 2025
1b2ea7a
[Attention] FlashAttention MLA cudagraph support (#23958)
MatthewBonanni Sep 8, 2025
5e67769
[Bugfix] Disable the statslogger if the api_server_count is greater t…
chaunceyjiang Sep 8, 2025
33b312c
[Hardware][IBM Z] Fix Outlines Core issue for s390x (#24034)
R3hankhan123 Sep 8, 2025
c91217c
[CI] Add nightly multiarch manifests to dockerhub (#24102)
csahithi Sep 9, 2025
b2cda8b
Update reviewers for modelopt related files (#24468)
Edwardf0t1 Sep 9, 2025
3269ff8
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and …
tlrmchlsmth Sep 9, 2025
57b1c87
[gpt-oss] Harmony changes with container tool support (#23386)
morgendave Sep 9, 2025
e829601
Bump actions/setup-python from 5.4.0 to 6.0.0 (#24414)
dependabot[bot] Sep 9, 2025
871576f
[doc] update `vllm serve` cli args documentation (#24329)
cjackal Sep 9, 2025
4c7f729
Bump actions/stale from 9.1.0 to 10.0.0 (#24412)
dependabot[bot] Sep 9, 2025
0b123d1
Bump actions/github-script from 7.0.1 to 8.0.0 (#24413)
dependabot[bot] Sep 9, 2025
e46c334
Move `KVTransferConfig` from `config/__init__.py` to `config/kv_trans…
hmellor Sep 9, 2025
8ef6048
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs (#24074)
CSWYF3634076 Sep 9, 2025
11747e3
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attentio…
elvischenv Sep 9, 2025
1db44bb
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead (#…
linzebing Sep 9, 2025
bb6c4c0
Add data_parallel_size to VllmConfig string representation (#24298)
Prowindy Sep 9, 2025
b79c925
[Bugfix] Fix Apertus HF repo name (#24447)
DarkLight1337 Sep 9, 2025
9f5c730
[Misc] Improve Worker process title and logging prefix (#22205)
22quinn Sep 9, 2025
01829e1
[Doc] mention fpdb for multiprocess breakpoints (#24452)
mickaelseznec Sep 9, 2025
c1522cd
[Misc] Support bench serve long context (#24373)
minosfuture Sep 9, 2025
ba81b6d
[Doc]: fixing typos to improve docs (#24480)
didier-durand Sep 9, 2025
72a1c89
[Performance][MM] Building the inverse permutation in O(n) time in Qw…
david6666666 Sep 9, 2025
c3479ee
[Misc] Add claude settings to gitignore (#24492)
yeqcharlotte Sep 9, 2025
1154475
[Misc] Add Codex settings to gitignore (#24493)
ywang96 Sep 9, 2025
b0abe91
[gpt-oss] Validate gpt-oss python tool during initialization (#23856)
heheda12345 Sep 9, 2025
06bbe3a
[RL] fast weight update with zmq + ipc handles (#24295)
weixiao-huang Sep 9, 2025
a35ecf7
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throu…
yeqcharlotte Sep 9, 2025
cea2046
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT (#24123)
yewentao256 Sep 9, 2025
639b0c7
[Model] Systematic support for fp32 head, pooling models part (#23810)
noooop Sep 9, 2025
2d2c71e
[Bugfix] Handle the edge case in detokenizer where processed tokens c…
dtransposed Sep 9, 2025
06d1fe7
[Core] Run garbage collector after CUDA graph capture to fix throughp…
micah-wil Sep 9, 2025
73d8ebc
[Kernels] Add Flash Linear Attention Kernels (#24518)
youkaichao Sep 9, 2025
a3abf62
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork (#24279)
gshtras Sep 9, 2025
9f05773
[Bugfix] Fix hidden_size for multimodal classification model (#24501)
jeejeelee Sep 9, 2025
52fcf72
Extend renderer with embedding support and integrate completion endpo…
sfeng33 Sep 9, 2025
a832de5
[Misc] bump outlines_core to fix the version conflicts with outlines …
serihiro Sep 9, 2025
8d5b330
[Docs] Gemma3n `transcriptions` endpoint support (#24512)
NickLucche Sep 9, 2025
a48b56a
[TPU] Fix tpu structured decoding in mixed batches (#24458)
Chenyaaang Sep 9, 2025
d4932d0
[CI] execute all piecewise compilation tests together (#24502)
ZJY0516 Sep 9, 2025
9510fcf
[Feature] Disallow FlashMLA on Blackwell (#24521)
yewentao256 Sep 9, 2025
9562d3a
[Log] Use a relative path in debug-level logs to distinguish files wi…
ZJY0516 Sep 9, 2025
441d8bd
[Benchmark] Update bench doc with mtbench, blazedit, spec bench (#24450)
ekagra-ranjan Sep 9, 2025
9841120
[Benchmark] Add option to skip oversampling in benchmark (#24457)
ekagra-ranjan Sep 9, 2025
30a053e
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph o…
charlifu Sep 9, 2025
a1a68da
[Bugfix] Improve EPLB config validation error message (#24524)
tlrmchlsmth Sep 10, 2025
01fd594
[Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. (#24…
bnellnm Sep 10, 2025
0fe3457
[Perf] Convert np array to torch tensor to index into block table for…
sarckk Sep 10, 2025
6b99406
Add @heheda12345 to CODEOWNERS of KVCacheManager related code (#24546)
heheda12345 Sep 10, 2025
2d93385
[CI] Retry flaky fp8 cutlass mla tests (#24536)
njhill Sep 10, 2025
b4a6e54
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and…
ignaciosica Sep 10, 2025
3f51615
[BugFix] Fix async core engine client finalizer (#24540)
njhill Sep 10, 2025
94cd4d9
[CI] Adjust threshold for flaky ngram spec decoding test (#24528)
njhill Sep 10, 2025
983f717
[KV Connector] More async support for `get_num_new_matched_tokens` (#…
ApostaC Sep 10, 2025
c2ef645
[P/D] MultiConnector supports shutdown (#24425)
chaunceyjiang Sep 10, 2025
a266dff
[BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-…
wwl2755 Sep 10, 2025
39c79f0
[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (#…
frank-wei Sep 10, 2025
de8b109
[Core] Simplify and unify mm uuid handling & auto-generated mm hash o…
huachenheli Sep 10, 2025
6f9a004
[Bugfix] Update Run:AI Model Streamer Loading Integration (#23845)
pwschuurman Sep 10, 2025
574bea1
[Docs] Enable relative links in examples to function when rendered in…
hmellor Sep 10, 2025
e85f50a
[docs] promo pytorch conf and ray summit (#24562)
simon-mo Sep 10, 2025
7f63bea
[Bugfix] Guard `_may_reorder_batch` for encoder-only models on CPU (#…
comsky Sep 10, 2025
dac51aa
Consolidate rendering parameters into RenderConfig dataclass (#24543)
sfeng33 Sep 10, 2025
457a288
[Model] Limit CPU threads for image transformations in InternVL to re…
li-jinpeng Sep 10, 2025
cc87b8e
[Attention] add DCP support for FLASH_ATTN_MLA backend (#24453)
LucasWilkinson Sep 10, 2025
e1f48ce
[ROCm][Bugfix] Fix Aiter RMSNorm (#23412)
vllmellm Sep 10, 2025
5715439
[Docs] Improve organisation of API Reference nav (#24569)
hmellor Sep 10, 2025
6fe7b8c
[Docs] Document the extra memory footprint overhead when using EPLB (…
tlrmchlsmth Sep 10, 2025
b0c16df
Support for NemotronH Nano VLM (#23644)
danielafrimi Sep 10, 2025
4020198
Feature/vit attention unification# 23880 (#23978)
baonudesifeizhai Sep 10, 2025
3a189f0
[LoRA]: Add LoRA support to Mistral's Voxtral models (#24517)
pratapyash Sep 10, 2025
75d4b60
Move `LoadConfig` from `config/__init__.py` to `config/load.py` (#24566)
hmellor Sep 10, 2025
7ca0cc2
[BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo (#24559)
wwl2755 Sep 10, 2025
69ffdb6
[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat (#24549)
lacora Sep 10, 2025
75108fc
[BugFix] Ensure integrity of reused CPU tensors during async scheduli…
njhill Sep 10, 2025
6fad99d
[CI/Build] split true unit tests to Entrypoints Unit Tests (#24418)
yeqcharlotte Sep 10, 2025
33fa45a
[rocm] enable torchao quantization for rocm (#24400)
draftbk Sep 10, 2025
e826f1e
[CI] Add PPL test for generation models (#24485)
noooop Sep 10, 2025
2e0563e
[CI/Build] bump timm dependency (#24189)
dtrifiro Sep 10, 2025
2a1a0c2
fix some typos (#24167)
co63oc Sep 10, 2025
ec574e4
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs (#24217)
RoadToNowhereX Sep 10, 2025
fbc6c7e
[Docs] Fix warnings in `mkdocs build` (continued) (#24092)
Zerohertz Sep 10, 2025
02bbdba
[BugFix] `python collect_env.py` and `vllm collect-env` compatibility…
yankay Sep 10, 2025
c60c72f
[Platform] Custom ops support for LMhead and LogitsProcessor (#23564)
zzhx1 Sep 10, 2025
854b2c4
[CI] Fix tensorizer test assertion (#24545)
pwschuurman Sep 10, 2025
74a072d
[Core] Split LoRA layers (#24574)
jeejeelee Sep 10, 2025
68d5b4f
[Doc] Add documentation for GLM-4.5 series models: tool-calling and r…
WangErXiao Sep 10, 2025
8889ec1
[Logging] allow config logging stream (#24336)
842974287 Sep 10, 2025
33299a7
[Bugfix] fix modelopt exclude_modules name mapping (#24178)
tomeras91 Sep 10, 2025
9803fd7
[Bugfix] Fix DeepEP config for DP4TP4 (#23619)
minosfuture Sep 10, 2025
f87ee3a
[Core] Support configuration parsing plugin (#24277)
charlotte12l Sep 10, 2025
5e4d508
[Misc] update log level debug to warning when process port is used by…
lengrongfu Sep 10, 2025
29ead89
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non…
gau-nernst Sep 10, 2025
c1a565a
[CI] Fail subprocess tests with root-cause error (#23795)
njhill Sep 10, 2025
93899a9
[v1] Add Whisper model support (encoder-decoder) (#21088)
russellb Sep 10, 2025
6d0ff0f
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 a…
gshtras Sep 10, 2025
1c0898f
[gpt-oss] raise error for flashinfer backend without trtllm (#24482)
heheda12345 Sep 10, 2025
ccc8d19
[Perf] Warmup FlashInfer attention during startup (#23439)
mgoin Sep 10, 2025
b007211
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration (#21078)
hjjq Sep 10, 2025
3866b8e
[Misc] Make timeout passable in init_distributed_environment (#24522)
jberkhahn Sep 10, 2025
a295d9e
[Models][Quantization] Add quantization configuration update in Voxtr…
anmarques Sep 11, 2025
6729ae7
[distributed] update known issues (#24624)
youkaichao Sep 11, 2025
51276e7
Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool pars…
chaunceyjiang Sep 11, 2025
b353521
[Bug] [Spec Decode] Fix model_initialization test and mismatch in aux…
wwl2755 Sep 11, 2025
b97a919
[Ultravox] Fix Gemma instantiation, support quantization via --hf-ove…
petersalas Sep 11, 2025
aa53ea1
[Bugfix] Add missing VIT backend dispatch on CPU (#24623)
bigPYJ1151 Sep 11, 2025
32dddb3
[BugFix] Fix pipeline parallel (#24621)
njhill Sep 11, 2025
3f97756
[Engine][Chore] use local variable and remove output var assignment (…
GuyStone Sep 11, 2025
9bed487
Kimi K2 Fused MoE kernels Optimization configs (#24597)
samanamp Sep 11, 2025
9254401
Enable --profile in 'vllm bench throughput' (#24575)
tomasruizt Sep 11, 2025
993e98d
[Core] feat: Add --safetensors-load-strategy flag for faster safetens…
shengshiqi-google Sep 11, 2025
c19917d
[Doc]: fixing doc typos (#24635)
didier-durand Sep 11, 2025
392f6b2
[Model] New model support for Motif-1-Tiny (#23414)
ca1207 Sep 11, 2025
afdc8bb
Remove redundant all gather + split (#23441)
chenxi-yang Sep 11, 2025
fa68ce0
[torchao] Support quantization configs using module swap (#21982)
jerryzh168 Sep 11, 2025
54c2286
Add the support for the qwen3 next model (a hybrid attention model). …
sighingnow Sep 11, 2025
7c8a30f
[Bugfix] Fix incorrect import of CacheConfig (#24631)
DarkLight1337 Sep 11, 2025
2cb5e96
[Docs] Revise frameworks/anything-llm.md (#24489)
windsonsea Sep 11, 2025
f8d52b3
[Docs] Update V1 doc to reflect whisper support (#24606)
russellb Sep 11, 2025
67a62ef
[Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ (#24…
windsonsea Sep 11, 2025
5672c2b
[CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Tes…
charlotte12l Sep 11, 2025
10a342e
[Bugfix] Fix _synced_weight_loader (#24565)
kyuyeunk Sep 11, 2025
c6eb9d2
[CI] Split pooling from entrypoints Test (#24632)
noooop Sep 11, 2025
fe45b35
[Misc] Add @NickLucche to codeowners (#24647)
NickLucche Sep 11, 2025
c86420f
[CI Failure] fix models/language/pooling/test_auto_prefix_cache_suppo…
noooop Sep 11, 2025
2719772
Fix typing for `safetensors_load_strategy` (#24641)
hmellor Sep 11, 2025
f327073
Move `LoRAConfig` from `config/__init__.py` to `config/lora.py` (#24644)
hmellor Sep 11, 2025
9e69a5b
[XPU] add missing dependency tblib for XPU CI (#24639)
faaany Sep 11, 2025
8207731
[Docs] Fixes a typo in the qwen3next model name. (#24654)
sighingnow Sep 11, 2025
a118f18
[build] add torch to tool.uv no-build-isolation-package (#24303)
youkaichao Sep 11, 2025
50efb4a
[Bench] Add qwen-next in benchmark_moe.py (#24661)
jeejeelee Sep 11, 2025
f4fd61c
[CI] Split mteb test from Language Models Test (#24634)
noooop Sep 11, 2025
6873e04
Allow users to specify kv cache memory size (#21489)
BoyuanFeng Sep 11, 2025
6f32825
[HybridKVCache][Platform] Add support_hybrid_kv_cache for platform (#…
MengqingCao Sep 11, 2025
bfe4dc0
[Bugifx] Fix qwen-next packed_modules_mapping (#24656)
jeejeelee Sep 11, 2025
4880db5
[Docs] Add transcription support to model (#24664)
NickLucche Sep 11, 2025
723f70d
[Doc] Fix Markdown Pre-commit Error (#24670)
yewentao256 Sep 11, 2025
044d39d
[Docs] Fix typos in EP deployment doc (#24669)
hmellor Sep 11, 2025
4692578
[VLM] Optimize GLM4.5-V-style video processing to only decode necessa…
Isotr0py Sep 11, 2025
a3d4cf3
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111)
ilmarkov Sep 11, 2025
7b7b33b
[Bugfix] Fix platform-specific routing in CustomOp implementations (#…
kzawora-intel Sep 11, 2025
0af7bf5
Fix model name included in responses (#24663)
hmellor Sep 11, 2025
535d67c
fix some typos (#24616)
co63oc Sep 11, 2025
e7c96ef
[Docs] Fix formatting of transcription doc (#24676)
hmellor Sep 11, 2025
287b75b
[VLM] Migrate remain DP-supported ViT models to use `disable_tp` (#24…
Isotr0py Sep 11, 2025
845b8d3
[Ultravox] Use wrapped_model_config to instantiate inner model (#24679)
petersalas Sep 11, 2025
3bcdb78
[Doc] Remove Useless Comments (#24687)
yewentao256 Sep 11, 2025
6ca8058
[Qwen3-Next] Add MoE Config for H200 (#24688)
WoosukKwon Sep 11, 2025
bb7b3be
[BugFix] Fix tokenize asyncio task leak (#24677)
njhill Sep 11, 2025
c398458
Update Spec Decode metrics to include drafted and accepted token thro…
qandrew Sep 11, 2025
a84109a
[Kernel][B200] `mxfp4` fused cutlass moe (#23696)
djmmoss Sep 11, 2025
391b6ed
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill att…
mxz297 Sep 11, 2025
806e5c7
[Bugfix] Set `VLLM_ALLREDUCE_USE_SYMM_MEM` default to False (#24696)
yewentao256 Sep 11, 2025
0f69164
[Qwen3-Next] MoE configs for H200 TP=1,2,4 (#24695)
WoosukKwon Sep 11, 2025
b2f3f30
[CI/Build] Add bc-linter to vLLM CI (#21234)
zhewenl Sep 11, 2025
996608d
[Qwen3-Next] Add B200 MoE configs for Qwen3-next (#24698)
vadiklyutiy Sep 11, 2025
bef7e1e
[Bugfix][Attention] Fix FlashInfer MLA block size logic (#24692)
MatthewBonanni Sep 11, 2025
2f0c63e
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel (#23280)
mgoin Sep 11, 2025
7b8868a
[Qwen3-Next] MOE configs for H100 TP4 (#24699)
heheda12345 Sep 11, 2025
be9dfc7
[Doc] Clarify cudagraph capture size logic and default behavior in sc…
Zazzle516 Sep 11, 2025
02413fc
[Bug] Fix Layer `weight_block_size` Assertion Issue (#24674)
yewentao256 Sep 11, 2025
c89f6be
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens (#24…
LucasWilkinson Sep 12, 2025
4e63dd2
[V1] feat:add engine v1 tracing (#20372)
RichardoMrMu Sep 12, 2025
874547d
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative…
sighingnow Sep 12, 2025
10f74b4
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 (#24707)
jeejeelee Sep 12, 2025
53a9762
[DOCs] Update ROCm installation docs section (#24691)
gshtras Sep 12, 2025
5e6f0d1
Enable conversion of multimodal models to pooling tasks (#24451)
maxdebayser Sep 12, 2025
334b518
Fix implementation divergence for BLOOM models between vLLM and Huggi…
qthequartermasterman Sep 12, 2025
9f35f80
[Bugfix] Fix MRoPE dispatch on CPU (#24712)
bigPYJ1151 Sep 12, 2025
88a4548
[BugFix] Fix Qwen3-Next PP (#24709)
njhill Sep 12, 2025
e931fff
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cach…
heheda12345 Sep 12, 2025
ac2fa11
[CI] Add ci_envs for convenient local testing (#24630)
noooop Sep 12, 2025
edef31f
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend (#24721)
bigPYJ1151 Sep 12, 2025
4aab8b2
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or rela…
heheda12345 Sep 12, 2025
f70592f
[Bugfix] Fix BNB name match (#24735)
jeejeelee Sep 12, 2025
6d7980e
[Kernel] [CPU] refactor `cpu_attn.py:_run_sdpa_forward` for better me…
ignaciosica Sep 12, 2025
5fccffd
[sleep mode] save memory for on-the-fly quantization (#24731)
youkaichao Sep 12, 2025
eacddf1
[Multi Modal] Add FA3 in VIT (#24347)
wwl2755 Sep 12, 2025
654c56e
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFe…
sfeng33 Sep 12, 2025
97a3444
[Doc]: fix typos in various files (#24726)
didier-durand Sep 12, 2025
18fdd76
[Docs] Fix warnings in mkdocs build (continued) (#24740)
Zerohertz Sep 12, 2025
2bc231e
[Bugfix] Fix MRoPE dispatch on XPU (#24724)
yma11 Sep 12, 2025
cba8d65
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP (#24739)
elvircrn Sep 12, 2025
d59f41e
[Core] Shared memory based object store for Multimodal data caching a…
dongluw Sep 12, 2025
1d09ff3
[Bugfix][Frontend] Fix `--enable-log-outputs` does not match the docu…
kebe7jun Sep 12, 2025
dc53bde
[Models] Optimise and simplify `_validate_and_reshape_mm_tensor` (#24…
lgeiger Sep 12, 2025
7357361
[Models] Prevent CUDA sync in Qwen2.5-VL (#24741)
lgeiger Sep 12, 2025
a64af8d
[Model] Switch to Fused RMSNorm in GLM-4.1V model (#24733)
SamitHuang Sep 12, 2025
d9edb64
[UX] Remove AsyncLLM torch profiler disabled log (#24609)
mgoin Sep 12, 2025
8287efe
[CI] Speed up model unit tests in CI (#24253)
afeldman-nm Sep 12, 2025
8f3a71b
[Bugfix] Fix incompatibility between #20452 and #24548 (#24754)
DarkLight1337 Sep 12, 2025
0412d1e
[CI] Trigger BC Linter when labels are added/removed (#24767)
zhewenl Sep 12, 2025
40e452c
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpo…
smarterclayton Sep 12, 2025
8937dc5
[Compilation Bug] Fix Inductor Graph Output with Shape Issue (#24772)
yewentao256 Sep 12, 2025
b824377
Invert pattern order to make sure that out_proj layers are identified…
anmarques Sep 12, 2025
d0431a7
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (#2…
MatthewBonanni Sep 12, 2025
a0483c5
Add FLASHINFER_MLA to backend selector test (#24753)
MatthewBonanni Sep 12, 2025
496d7fa
[Qwen3Next] Fixes the cuda graph capture conditions under large batch…
sighingnow Sep 12, 2025
9ec0b45
[Core] Support async scheduling with uniproc executor (#24219)
njhill Sep 12, 2025
a2f816a
[Frontend][Multimodal] Allow skipping media data when UUIDs are provi…
huachenheli Sep 13, 2025
0e443a7
Resolve rebase conflict in batched_deep_gemm_moe: use _get_effective_…
skyloevil Sep 13, 2025
6594d05
Merge branch 'main' into optimize/moe-dispatch-efficiency
skyloevil Sep 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 103 additions & 4 deletions vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import os
from typing import Optional

import torch

import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm.distributed import (get_tensor_model_parallel_rank,
get_tensor_model_parallel_world_size)
from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig
from vllm.model_executor.layers.fused_moe.topk_weight_and_reduce import (
Expand Down Expand Up @@ -188,6 +191,9 @@ class BatchedDeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):
# The Deep Gemm kernels only support block size of 128
DEEPGEMM_BLOCK_SHAPE: list[int] = [128, 128]

# Class variable for one-shot logging to verify dispatch optimization
_logged_dispatch_once: bool = False

def __init__(self,
max_num_tokens: int,
num_dispatchers: int,
Expand All @@ -199,6 +205,13 @@ def __init__(self,
block_shape: Block quantization block shape.
per_act_token_quant: Per activation token quantization flag.
"""
logger.info(
"[MoE Debug] BatchedDeepGemmExperts.__init__ called with: "
"max_num_tokens=%s, num_dispatchers=%s, block_shape=%s, "
"per_act_token_quant=%s, DEEPGEMM_BLOCK_SHAPE=%s", max_num_tokens,
num_dispatchers, block_shape, per_act_token_quant,
self.DEEPGEMM_BLOCK_SHAPE)

super().__init__(
FusedMoEQuantConfig(
quant_dtype=torch.float8_e4m3fn,
Expand All @@ -209,6 +222,10 @@ def __init__(self,
self.max_num_tokens = max_num_tokens
self.num_dispatchers = num_dispatchers

logger.info(
"[MoE Debug] BatchedDeepGemmExperts initialized successfully with "
"final block_shape=%s", self.block_shape)

@property
def activation_formats(
self
Expand All @@ -222,6 +239,56 @@ def supports_chunking(self) -> bool:
def supports_expert_map(self) -> bool:
return False

def _get_effective_num_dispatchers(self) -> int:
"""
Calculates the effective number of token dispatchers considering tensor
parallelism.

When tensor parallelism (TP) is used (TP > 1), only the leader rank
(rank 0) in each TP group should dispatch tokens to avoid redundant
communication. This significantly reduces cross-rank communication
overhead in distributed environments.

Returns:
int: The effective number of dispatchers to use.
When TP > 1:
- Returns max(1, num_dispatchers // tp_size) for leader ranks
(tp_rank == 0)
- Returns 0 for non-leader ranks (tp_rank != 0)
When TP <= 1:
- Returns the original num_dispatchers

Note:
Leader ranks are guaranteed at least 1 dispatcher for stability,
while non-leader ranks return 0 to eliminate redundant dispatching.
"""
tp_size = get_tensor_model_parallel_world_size()
tp_rank = get_tensor_model_parallel_rank()

if tp_size <= 1:
# No TP or single device - use all dispatchers
return self.num_dispatchers

# TP > 1 case
eff = (max(1, self.num_dispatchers // tp_size) if tp_rank == 0 else 0)

# --- lightweight one-shot log for verification ---
if (not BatchedDeepGemmExperts._logged_dispatch_once
and os.getenv("VLLM_LOG_MOE_DISPATCH", "0") == "1"):
logger.info(
"[moe-dispatch-opt] tp_rank=%d/%d, num_dispatchers=%d -> "
"effective=%d, leader=%s, participates_a2a=%s",
tp_rank,
tp_size,
self.num_dispatchers,
eff,
str(tp_rank == 0),
str(eff > 0),
)
BatchedDeepGemmExperts._logged_dispatch_once = True

return eff

def finalize_weight_and_reduce_impl(self) -> mk.TopKWeightAndReduce:
# Let PrepareAndFinalize::finalize() decide the impl.
return TopKWeightAndReduceDelegate()
Expand All @@ -239,10 +306,10 @@ def workspace_shapes(
expert_tokens_metadata: Optional[mk.ExpertTokensMetadata],
) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...], torch.dtype]:
assert a.dim() == 2
# FIXME (varun): We should be able to dispatch only from the leader
# DP ranks in the case of TP > 1. At the moment, all the Ranks
# end up sending their tokens. This needs to be fixed.
num_dispatchers = self.num_dispatchers
# Optimize token dispatch: only leader DP ranks dispatch tokens when
# TP > 1. This reduces cross-rank communication overhead in distributed
# MoE models.
num_dispatchers = self._get_effective_num_dispatchers()
num_experts = local_num_experts
max_num_tokens = a.size(
0) if self.max_num_tokens is None else self.max_num_tokens
Expand Down Expand Up @@ -274,9 +341,31 @@ def apply(
expert_tokens_meta: Optional[mk.ExpertTokensMetadata],
apply_router_weight_on_input: bool,
):
logger.info("[MoE Debug] *** BatchedDeepGemmExperts.apply() ENTRY *** "
"THIS IS THE DEEP GEMM IMPLEMENTATION BEING CALLED!")
logger.info(
"[MoE Debug] BatchedDeepGemmExperts.apply() parameters: "
"hidden_states.shape=%s, global_num_experts=%s, activation=%s",
hidden_states.shape, global_num_experts, activation)

assert expert_tokens_meta is not None
expert_num_tokens = expert_tokens_meta.expert_num_tokens

# Monitor expert_num_tokens for workspace allocation analysis
if torch.cuda.is_current_stream_capturing():
logger.debug(
"[MoE Monitor] skip logging during CUDA Graph capture")
else:
cpu_vals = expert_num_tokens.detach().to("cpu")
logger.info(
"[MoE Monitor] expert_num_tokens "
"shape=%s sum=%d max=%d values(sample)=%s",
tuple(expert_num_tokens.shape),
int(cpu_vals.sum().item()),
int(cpu_vals.max().item()),
cpu_vals.numpy(),
)

assert hidden_states.ndim == 3
assert self.block_shape is not None

Expand All @@ -288,17 +377,27 @@ def apply(
E, max_num_tokens, N, K, top_k_num = mk._moe_problem_size(
hidden_states, w1, w2, topk_ids)

logger.info(
"[MoE Debug] Problem size: E=%s, max_num_tokens=%s, N=%s, K=%s, "
"top_k_num=%s", E, max_num_tokens, N, K, top_k_num)

workspace1 = _resize_cache(workspace13, (E, max_num_tokens, N))

# (from deepgemm docs) : A value hint (which is a value on CPU)
# for the M expectation of each batch, correctly setting this value
# may lead to better performance.
expected_m = max_num_tokens

logger.info("[MoE Debug] Calling first fp8_m_grouped_gemm_nt_masked")
fp8_m_grouped_gemm_nt_masked((a1q, a1q_scale), (w1, w1_scale),
workspace1, expert_num_tokens, expected_m)

logger.info("[MoE Debug] Calling silu_mul_fp8_quant_deep_gemm")
a2q, a2q_scale = silu_mul_fp8_quant_deep_gemm(workspace1,
expert_num_tokens)

logger.info("[MoE Debug] Calling second fp8_m_grouped_gemm_nt_masked")
fp8_m_grouped_gemm_nt_masked((a2q, a2q_scale), (w2, w2_scale), output,
expert_num_tokens, expected_m)

logger.info("[MoE Debug] *** BatchedDeepGemmExperts.apply() EXIT ***")
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@
import torch

import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe import (
BatchedDeepGemmExperts)
from vllm.model_executor.layers.fused_moe.config import FusedMoEQuantConfig
from vllm.model_executor.layers.fused_moe.fused_batched_moe import (
BatchedTritonExperts)

logger = init_logger(__name__)


class BatchedTritonOrDeepGemmExperts(mk.FusedMoEPermuteExpertsUnpermute):

Expand All @@ -24,6 +27,11 @@ def __init__(self,
block_shape: Optional[list[int]] = None,
per_act_token_quant: bool = False,
allow_deep_gemm: bool = False):
logger.info(
"[MoE Debug] BatchedTritonOrDeepGemmExperts.__init__ called with: "
"max_num_tokens=%s, num_dispatchers=%s, use_fp8_w8a8=%s, "
"allow_deep_gemm=%s, block_shape=%s", max_num_tokens,
num_dispatchers, use_fp8_w8a8, allow_deep_gemm, block_shape)
assert not use_int8_w8a8, "NYI"
assert not use_int8_w8a16, "NYI"
assert not use_int4_w4a16, "NYI"
Expand All @@ -49,16 +57,34 @@ def __init__(self,
block_shape=self.block_shape,
)

logger.debug(
"[MoE Debug] BatchedTritonOrDeepGemmExperts init: "
"allow_deep_gemm=%s, use_fp8_w8a8=%s, block_shape=%s, "
"DEEPGEMM_BLOCK_SHAPE=%s", allow_deep_gemm, use_fp8_w8a8,
self.block_shape, BatchedDeepGemmExperts.DEEPGEMM_BLOCK_SHAPE)

self.allow_deep_gemm = (allow_deep_gemm and use_fp8_w8a8
and self.block_shape
== BatchedDeepGemmExperts.DEEPGEMM_BLOCK_SHAPE)

logger.debug(
"[MoE Debug] Final allow_deep_gemm decision: %s "
"(conditions: allow=%s, fp8=%s, shape_match=%s)",
self.allow_deep_gemm, allow_deep_gemm, use_fp8_w8a8,
self.block_shape == BatchedDeepGemmExperts.DEEPGEMM_BLOCK_SHAPE)

self.batched_deep_gemm_experts = BatchedDeepGemmExperts(
max_num_tokens=max_num_tokens,
num_dispatchers=num_dispatchers,
block_shape=self.block_shape, # type: ignore[arg-type]
) if self.allow_deep_gemm else None

if self.allow_deep_gemm:
logger.debug(
"[MoE Debug] Created BatchedDeepGemmExperts successfully")
else:
logger.debug("[MoE Debug] Using BatchedTritonExperts fallback")

assert (self.batched_deep_gemm_experts is not None
or self.batched_triton_experts is not None)

Expand Down Expand Up @@ -154,11 +180,27 @@ def apply(
expert_tokens_meta: Optional[mk.ExpertTokensMetadata],
apply_router_weight_on_input: bool,
):
logger.info(
"[MoE Debug] BatchedTritonOrDeepGemmExperts.apply() ENTRY: "
"allow_deep_gemm=%s, hidden_states.shape=%s, global_num_experts=%s",
self.allow_deep_gemm, hidden_states.shape, global_num_experts)

experts = (self.batched_deep_gemm_experts
if self.allow_deep_gemm else self.batched_triton_experts)

# Log which expert implementation is being used
if self.allow_deep_gemm:
logger.info(
"[MoE Debug] Using BatchedDeepGemmExperts for forward pass")
else:
logger.info(
"[MoE Debug] Using BatchedTritonExperts for forward pass")

assert experts is not None
experts.apply(output, hidden_states, w1, w2, topk_weights, topk_ids,
activation, global_num_experts, expert_map, w1_scale,
w2_scale, w1_zp, w2_zp, a1q_scale, a2_scale, workspace13,
workspace2, expert_tokens_meta,
apply_router_weight_on_input)

logger.info("[MoE Debug] BatchedTritonOrDeepGemmExperts.apply() EXIT")
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
TopKWeightAndReduceContiguous, TopKWeightAndReduceDelegate)
from vllm.model_executor.layers.fused_moe.utils import (
moe_kernel_quantize_input)
from vllm.distributed import get_tensor_model_parallel_world_size
from vllm.distributed.parallel_state import get_tp_group


class DeepEPHTPrepareAndFinalize(mk.FusedMoEPrepareAndFinalize):
Expand Down Expand Up @@ -193,6 +195,15 @@ def prepare_async(
quant_config: FusedMoEQuantConfig,
) -> Callable:

# Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1.
tp_world_size = get_tensor_model_parallel_world_size()
tp_rank_in_group = get_tp_group().rank_in_group if tp_world_size > 1 else 0
if tp_world_size > 1 and tp_rank_in_group != 0:
# Non-leader TP ranks send zero tokens to avoid duplicate dispatch.
a1 = a1[:0]
topk_ids = topk_ids[:0]
topk_weights = topk_weights[:0]

if apply_router_weight_on_input:
topk = topk_ids.size(1)
# TODO: this only works for topK=1, will need to update for topK>1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
TopKWeightAndReduceDelegate)
from vllm.model_executor.layers.fused_moe.utils import (
moe_kernel_quantize_input, normalize_batched_scales_shape)
from vllm.distributed import get_tensor_model_parallel_world_size
from vllm.distributed.parallel_state import get_tp_group

# DeepEP kernels quantize dispatch inputs in 128 element chunks.
DEEPEP_QUANT_BLOCK_SIZE = 128
Expand Down Expand Up @@ -147,6 +149,15 @@ def prepare_async(
"apply_router_weight_on_input is only implemented for topk=1")
a1 = a1 * topk_weights.to(a1.dtype)

# Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1.
tp_world_size = get_tensor_model_parallel_world_size()
tp_rank_in_group = get_tp_group().rank_in_group if tp_world_size > 1 else 0
if tp_world_size > 1 and tp_rank_in_group != 0:
# Non-leader TP ranks send zero tokens to avoid duplicate dispatch.
a1 = a1[:0]
topk_ids = topk_ids[:0]
topk_weights = topk_weights[:0]

# Dispatch
expert_x, expert_num_tokens, self.handle, event, hook = \
self.buffer.low_latency_dispatch(a1,
Expand Down
Loading
Loading