Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
439 commits
Select commit Hold shift + click to select a range
cbc8457
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. (#22184)
vllmellm Aug 7, 2025
3706618
[Frontend] Update OpenAI error response to upstream format (#22099)
msanft Aug 7, 2025
82216dc
[Misc] Support routing logic simulation (#21990)
minosfuture Aug 7, 2025
8e8e0b6
feat: Add --enable-log-outputs flag for logging model generations (#2…
mizadri Aug 7, 2025
434d2f3
[Docs] Add missing dependency for docs build (#22435)
hmellor Aug 7, 2025
c2dba2d
Add H20-3e fused MoE kernel tuning configs for GLM-4.5 (#22433)
JaceyShao Aug 7, 2025
136825d
[Misc] Enhance code formatting in mxfp4.py (#22423)
WoosukKwon Aug 7, 2025
5e83988
[Doc] Fix link to prefix caching design (#22384)
sarckk Aug 7, 2025
a2c6696
[Docs] Factor out troubleshooting to its own guide; add section for R…
crypdick Aug 7, 2025
35171b1
[Doc] update docs for nightly benchmarks (#12022)
andrewkchan Aug 7, 2025
289b18e
[Docs] Update features/disagg_prefill, add v1 examples and developmen…
david6666666 Aug 7, 2025
766bc81
[Core] Store only the keys for multi-modal data in P0 (#22198)
DarkLight1337 Aug 7, 2025
7e0b121
[Bugfix] Add missing `packed_modules_mapping` to `DeepseekV2ForCausal…
fxmarty-amd Aug 7, 2025
4da8bf2
[Tool] Fix auto tool call (#22434)
heheda12345 Aug 7, 2025
4815b00
[gpt-oss] Generate ResponseOutputItem from Harmony Message (#22410)
heheda12345 Aug 7, 2025
399d2a1
Fix pre-commit error in main (#22462)
WoosukKwon Aug 7, 2025
8c9da6b
[Core] Simplify mm processing cache (#22457)
DarkLight1337 Aug 7, 2025
139d155
[Frontend] Use engine argument to control MM cache size (#22441)
DarkLight1337 Aug 7, 2025
7e3a8dc
Remove `from_dict` from `SpeculativeConfig` (#22451)
hmellor Aug 7, 2025
acf8aeb
[Misc] normalize multiprocessing Queue usage (#22371)
andyxning Aug 8, 2025
1ee5ead
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine (…
tjtanaa Aug 8, 2025
e2c8f1e
[PERF] Use pybase64 to more quickly decode prompt embeddings (#22469)
qthequartermasterman Aug 8, 2025
d57dc23
Add ModelOpt Qwen3 nvfp4 support (#20101)
Edwardf0t1 Aug 8, 2025
a3b9c17
Support Tensorrt-LLM MoE fp4 for low-latency (#21331)
wenscarl Aug 8, 2025
b2c8ce5
Fix Flashinfer CUTLASS MOE Allgather (#21963)
wenscarl Aug 8, 2025
3303f13
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO …
0xjunhao Aug 8, 2025
17eaaef
[Bugfix] Fix RuntimeError: Index put requires the source and destinat…
chaunceyjiang Aug 8, 2025
c152e2a
not tie_word_embeddings for glm-4.5 and glm-4.5v (#22460)
zRzRzRzRzRzRzR Aug 8, 2025
6f28791
Optimize MiniCPMO mask creation with vectorized implementation (#22464)
skyloevil Aug 8, 2025
157f9c1
Fix pre-commit (#22487)
DarkLight1337 Aug 8, 2025
af473f0
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 (#22426)
nvpohanh Aug 8, 2025
099c046
[Doc] Sleep mode documentation (#22310)
iAmir97 Aug 8, 2025
808a7b6
[bench] Fix benchmark/serve.py to ignore unavailable results (#22382)
lk-chen Aug 8, 2025
1712543
[CI/Build] Fix multimodal tests (#22491)
DarkLight1337 Aug 8, 2025
43c4f3d
[Misc] Begin deprecation of `get_tensor_model_*_group` (#22494)
DarkLight1337 Aug 8, 2025
9040639
[Misc] fix openai version (#22485)
lengrongfu Aug 8, 2025
ccdae73
[BugFix] Don't cancel asyncio tasks directly from destructors (#22476)
njhill Aug 8, 2025
7be7f38
[Docs] Improve API docs (+small tweaks) (#22459)
hmellor Aug 8, 2025
e5ebeeb
Remove exception for Python 3.8 typing from linter (#22506)
hmellor Aug 8, 2025
e789cad
[gpt-oss] triton kernel mxfp4 (#22421)
zyongye Aug 8, 2025
f0964e2
[Benchmark] Add benchmark tool for multi turn conversations (#20267)
pliops-daniels Aug 8, 2025
f756a68
[gpt-oss] guard import when triton kernel is not installed (#22529)
zyongye Aug 8, 2025
e290594
[Docs] Rename “Distributed inference and serving” to “Parallelism & S…
crypdick Aug 8, 2025
fe6d825
[gpt-oss] Support tool call and implement MCP tool server (#22427)
heheda12345 Aug 8, 2025
cd9b9de
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA (#…
LucasWilkinson Aug 8, 2025
f703b92
[Misc] DeepGEMM : Avoid JIT generation in the hot-path (#22215)
varun-sundar-rabindranath Aug 8, 2025
bd875d2
[Bugfix] Update FA commit hash (#22546)
tdoublep Aug 8, 2025
41b9655
Skip Qwen 1 in CI because remote code is no longer compatible with Tr…
hmellor Aug 8, 2025
2fcf6b2
[Docs] fix broken links in metrics.md (#22315)
GuyStone Aug 8, 2025
baece8c
[Frontend] Add unix domain socket support (#18097)
yyweiss Aug 8, 2025
e3edc0a
Extract `CompilationConfig` from `config.py` (#22524)
hmellor Aug 8, 2025
311d875
Drop flaky test_healthcheck_response_time (#22539)
russellb Aug 8, 2025
81c57f6
[XPU] upgrade torch 2.8 on for XPU (#22300)
jikunshang Aug 9, 2025
35afe1b
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec…
Pradyun92 Aug 9, 2025
429e4e2
[Bugfix] Fix ModernBert cuda graph capturing in v1 (#21901)
Isotr0py Aug 9, 2025
08b751b
Implicit language-model-only mode via limit-mm-per-prompt (#22299)
Aug 9, 2025
23472ff
[Doc] Add usage of implicit text-only mode (#22561)
Aug 9, 2025
8a0ffd6
Remove mamba_ssm from vLLM requirements; install inside test containe…
tdoublep Aug 9, 2025
3157aeb
[Log] Add Warning for Deprecation of DeepGEMM old version (#22194)
yewentao256 Aug 9, 2025
6ade99e
[V1] [Hybrid] Support Minimax-Text-01 in V1 (#22151)
tdoublep Aug 9, 2025
7ad7adb
v1: Pass KVConnectorOutput to scheduler-side (#22157)
orozery Aug 9, 2025
65552b4
[Misc] Use config definitions from Transformers library (#21913)
DarkLight1337 Aug 9, 2025
10a0253
Fix loading of quantized BigCode models (#22463)
eldarkurtic Aug 9, 2025
9a0c5de
[TPU] Add support for online w8a8 quantization (#22425)
kyuyeunk Aug 9, 2025
b7c0942
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged a…
charlifu Aug 9, 2025
7920e9b
[Bugfix] Fix failing GPT-OSS initialization test (#22557)
Isotr0py Aug 9, 2025
0edc0cd
[Bugfix] Fix CI moe kernel failure (#22556)
jeejeelee Aug 9, 2025
2be07a0
Update docs for Minimax-Text support (#22562)
tdoublep Aug 9, 2025
a6022e6
GLM-4.5V with new class name at transformers (#22520)
zRzRzRzRzRzRzR Aug 9, 2025
1bf5e1f
[CI] [Hybrid] Speed up hybrid models test by removing large models (…
tdoublep Aug 9, 2025
5618647
[Docs] Reduce noise in docs and `--help` from the JSON tip (#22567)
hmellor Aug 9, 2025
2d18256
Move `ParallelConfig` from `config/__init__.py` to `config/parallel.p…
hmellor Aug 9, 2025
5a16fa6
[Model] Gemma3n MM (#20495)
NickLucche Aug 9, 2025
fbd8595
[Bugfix] Fix basic models tests hanging due to mm processor creation …
Isotr0py Aug 9, 2025
42172ad
[FEAT] [Performance] Add triton mrope to replace the torch code path …
tjtanaa Aug 9, 2025
61f67d8
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (…
tdoublep Aug 10, 2025
0c5254b
[oss] Init gpt-oss bf16 support (#22508)
jeejeelee Aug 10, 2025
3d7363e
[Config] add "qwen" as a native eagle3 target supported model (#22333)
lec77 Aug 10, 2025
534c45b
Improve fast_topk function with type hints and documentation (#22530)
skyloevil Aug 10, 2025
2a84fb4
[TPU] kv cache update kernel doesn't need to be padded slices to mult…
yaochengji Aug 10, 2025
c498483
Refactor sliding window configuration to Transformers best practice (…
hmellor Aug 10, 2025
7e8d685
[Minor] Fix pre-commit error on main (#22579)
Isotr0py Aug 10, 2025
3269762
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_conf…
andyxning Aug 10, 2025
010e0e3
[Doc] Fix API doc link in side navigation (#22585)
22quinn Aug 10, 2025
d411df0
[Misc] Further refine type annotations in parallel state (#22499)
DarkLight1337 Aug 10, 2025
00976db
[Docs] Fix warnings in docs build (#22588)
hmellor Aug 10, 2025
049c245
[Misc] Replace flaky image urls in pixtral test (#22574)
Isotr0py Aug 10, 2025
8290d15
Move `CacheConfig` from `config/__init__.py` to `config/cache.py` (#2…
hmellor Aug 10, 2025
0757551
[doc] add beijing meetup links (#22596)
youkaichao Aug 10, 2025
b81fe83
[doc] add alibaba cloud as sponsor (#22597)
youkaichao Aug 10, 2025
b76753f
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton ke…
Isotr0py Aug 10, 2025
65a7917
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion…
h-brenoskuk Aug 10, 2025
b4e2916
Migrate LlavaNextImageInputs to TensorSchema (#21774)
bbeckca Aug 10, 2025
8c50d62
Remove redundant row_indices unsqueeze operation in MiniCPMO (#22528)
skyloevil Aug 10, 2025
68b254d
Fix TensorSchema validation test for symbolic dims (#22366)
bbeckca Aug 10, 2025
d1af8b7
enable Docker-aware precompiled wheel setup (#22106)
dougbtv Aug 10, 2025
a554991
Migrate LlavaNextVideoPixelInputs to TensorSchema (#21843)
bbeckca Aug 11, 2025
06da44f
Migrate LlavaImageInputs to TensorSchema (#21770)
bbeckca Aug 11, 2025
b799f4b
[CI/Build] Fix tensorizer test for load_format change (#22583)
22quinn Aug 11, 2025
5898b13
[BugFix] Fix KVConnectorOutput TPU breakage (#22598)
njhill Aug 11, 2025
1b99028
[Misc][gpt-oss] Add rules to label gpt-oss related PRs (#22600)
draftbk Aug 11, 2025
afa5b7c
[Misc][gpt-oss] guard import when triton kernel when not up to date …
zhewenl Aug 11, 2025
f919d4c
[BugFix] Fix logits repetition penalty cuda check (#22592)
PicoCreator Aug 11, 2025
9c97a1c
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. (#22521)
vllmellm Aug 11, 2025
39052db
Support token_type_ids in V1 with less code changes (#21985)
maxdebayser Aug 11, 2025
384a052
[Misc] benchmark_moe supports expert parallel (#22251)
jeejeelee Aug 11, 2025
1e55dfa
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ …
JartX Aug 11, 2025
bc1d02a
[Docs] Add comprehensive CLI reference for all large `vllm` subcomman…
hmellor Aug 11, 2025
ebf7605
[Misc] Move tensor schema tests (#22612)
DarkLight1337 Aug 11, 2025
951b038
[Misc] Move jsontree to utils (#22622)
DarkLight1337 Aug 11, 2025
14a5d90
[Model] NemotronH Support (#22349)
danielafrimi Aug 11, 2025
3fa5b25
Document aarch64 CPU support works (#22646)
ericcurtin Aug 11, 2025
8e13d9f
[Misc] Further clean up some redundant config definitions (#22649)
Isotr0py Aug 11, 2025
f7dcce7
[Feature] Add `VLLM_USE_DEEP_GEMM_E8M0` Env to Control E8M0 Scale (#2…
yewentao256 Aug 11, 2025
16fb668
fix: NIXL connector transfers partial block to pass full multi-modal …
GuanLuo Aug 11, 2025
84cf78a
[Model] Pooling models default to using chunked prefill & prefix cach…
noooop Aug 11, 2025
c90fb03
[CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 (#22…
Isotr0py Aug 11, 2025
807d21b
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI (#22611)
22quinn Aug 11, 2025
65abe11
[CI] Skip Tree Attn Test in `test_max_len.py` to unblock CI (#22664)
tjtanaa Aug 11, 2025
458e74e
Support more parallel styles in Transformers backend TP (#22651)
hmellor Aug 11, 2025
95a935f
[gpt-oss] Support streaming in response API (#22431)
heheda12345 Aug 12, 2025
1891a26
[gpt-oss] Add test for response API + harmony (but skipped) (#22554)
heheda12345 Aug 12, 2025
9b94d6e
Enable 4bit bnb prequant MOE (#21548)
py-andy-c Aug 12, 2025
839ab00
Re-enable Xet on TPU tests now that `hf_xet` has been updated (#22666)
hmellor Aug 12, 2025
dc5e4a6
Upgrade FlashInfer to v0.2.11 (#22613)
nvpohanh Aug 12, 2025
ea1292a
[CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py (…
mgoin Aug 12, 2025
93d0652
[CI] Increase timeout for test_completion_with_image_embeds (#22670)
mgoin Aug 12, 2025
4678503
Migrate MiniCPMVImageInputs to TensorSchema (#21939)
bbeckca Aug 12, 2025
bbaf9e9
[gpt-oss] Fix mxfp4 support (#22700)
heheda12345 Aug 12, 2025
ad344ef
[gpt-oss] Small bug fixes for frontend (#22512)
heheda12345 Aug 12, 2025
4fbd8bb
Fix passing `SpeculativeConfig` from the CLI (#22652)
hmellor Aug 12, 2025
3a7e3bb
[Doc] Added unmentioned required option "method" in the usage of EAGL…
hsliuustc0106 Aug 12, 2025
2f46579
[doc] Update x86 CPU-inference installation doc to reflect optionalit…
sooraj-satheesh Aug 12, 2025
6d729c4
[Bugfix] Fix ModernBert load & Enable sliding window attention for bi…
noooop Aug 12, 2025
78077d5
Move `SchedulerConfig` from `config/__init__.py` to `config/scheduler…
hmellor Aug 12, 2025
59f3b93
[DOC] update v1_guide with INTEL HW (#22679)
xuechendi Aug 12, 2025
9f909b8
[New Model] Support Command-A-Vision (#22660)
dongluw Aug 12, 2025
8d17fa6
[V0] Correct CUDA Graph capture for encoder-decoder models (#22630)
Sugar-zsg Aug 12, 2025
bc8372e
[Bugfix] Fix erroneous randomly generated cases in bad word testing (…
phantomlei3 Aug 12, 2025
1ece7f3
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_c…
Jun-Howie Aug 12, 2025
46ae7f6
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, as…
RishiAstra Aug 12, 2025
50f2aae
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decode…
zejunchen-zejun Aug 12, 2025
b8a9d0e
[Misc] remove GH discussions link (#22722)
jeejeelee Aug 12, 2025
007dd90
[gpt-oss] Enable gpt-oss on ampere (#22714)
zyongye Aug 12, 2025
767e63b
[Docs] Improve docs navigation (#22720)
hmellor Aug 12, 2025
d030b01
[BugFix][Nixl][PD] Fix heterogenous TP (#22663)
NickLucche Aug 12, 2025
80bb1e8
Officially support SmolLM3 using the Transformers backend (#22665)
hmellor Aug 12, 2025
f7ad6a1
[CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py (#22…
noooop Aug 12, 2025
67c153b
Fix Llama4 FlashInfer FP4 MoE issues (#22511)
nvpohanh Aug 12, 2025
3d9d40e
[Bugfix][CI] Fix `test_remote_decode_lifecycle.py::test_short_prompt_…
NickLucche Aug 12, 2025
e5d3d63
[Benchmark] Fix terminal colors in benchmark_serving_multi_turn (pyth…
pliops-daniels Aug 12, 2025
5a4b4b3
Add: `SupportsEagle3` interface for explicit EAGLE3 support (#22642)
rahul-tuli Aug 12, 2025
c42fe0b
Add more test scenario for tensor schema (#22733)
teekenl Aug 12, 2025
dab4f9f
[Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, a…
yewentao256 Aug 12, 2025
6bd8ebf
[Kernel][AMD] Avoid D2H copy and cumsum kernel (#22683)
mxz297 Aug 12, 2025
422f22e
[CI][Nixl] Check kv cache layout during handshake (#22745)
NickLucche Aug 12, 2025
6534d2f
Fix torch version check for SM100 mxfp4 (#22535)
zifeitong Aug 12, 2025
53c7302
[Misc] parametrize 'dtype' in test_flash_mla (#22641)
RUTHLESS-BOT Aug 12, 2025
ba81acb
[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues (#22606)
frankwang28 Aug 12, 2025
45c3936
[Docs] Hide the navigation and toc sidebars on home page (#22749)
hmellor Aug 13, 2025
d0a6301
Fix Transformers backend tensor parallel for multimodal models (#22673)
hmellor Aug 13, 2025
fde0b61
[Model] Decouple glm4v (#22751)
jeejeelee Aug 13, 2025
e188592
Add hardware plugins to installation doc (#22732)
mgoin Aug 13, 2025
71683ca
[V0 Deprecation] Remove multi-step scheduling (#22138)
WoosukKwon Aug 13, 2025
d31f97c
[Misc] Remove tests/multi_step/__init__.py (#22778)
WoosukKwon Aug 13, 2025
c583038
[V0 Deprecation] Remove args for multi-step scheduling (#22779)
WoosukKwon Aug 13, 2025
4f0f844
Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op (#22…
nvpohanh Aug 13, 2025
b1361c7
[Bugfix] Fix default enable for CUTLASS MLA on SM100 (#22738)
mgoin Aug 13, 2025
c6b9287
Force TRTLLM attention for gpt-oss on SM100 (#22678)
mgoin Aug 13, 2025
4082338
Remove unneeded ROCm platform import when using CUDA (#22765)
mgoin Aug 13, 2025
77a6bf0
[Bug] Fix Unexpected Keyword Argument 'w1_bias' (#22757)
yewentao256 Aug 13, 2025
4c558cf
[Perf] Support topk softmax fused kernel for broader num_experts (#22…
shixianc Aug 13, 2025
6807af8
[gpt-oss] upgrade gpt-oss to v0.0.3 and add version check (#22768)
heheda12345 Aug 13, 2025
d16aa3d
[Model] Add option to run Step3VisionEncoder in DP (#22697)
zzh142857 Aug 13, 2025
9e7e5ba
[Model] Add missing prefix to glm4_1v (#22716)
zRzRzRzRzRzRzR Aug 13, 2025
a01e001
[Bugfix] Fix Nemotron VL image processing (#22739)
ducviet00 Aug 13, 2025
3f52738
[Doc] Add max_lora_rank configuration guide (#22782)
chi2liu Aug 13, 2025
d94e302
[V1] Add tree drafting tests for eagle spec decoding (#22705)
TheEpicDolphin Aug 13, 2025
0b1bdac
[Platform] Custom ops support for FusedMoe (#22509)
wangxiyuan Aug 13, 2025
653124b
[Frontend] Add chunked processing to handle long inputs in embedding …
x22x22 Aug 13, 2025
98deac3
[FEATURE] support custom vllm tuned config path for fused moe triton …
vermouth1992 Aug 13, 2025
6b794c7
[Nixl][CI] Fix tests (#22806)
NickLucche Aug 13, 2025
fceafaf
[Bugfix][mamba] Fix type annotation of Mamba2Metadata (#22787)
heheda12345 Aug 13, 2025
6772bb0
Remove unnecessary CUDA sync of qwen image and video preprocess (#22792)
cyyever Aug 13, 2025
b159c0a
Fix GGUF loader for Qwen3 MoE. (#22785)
Gh0u1L5 Aug 13, 2025
20d65aa
[Frontend] Multithreaded async multimodal load_bytes (#22710)
milesial Aug 13, 2025
19b927e
[Core] Use individual MM items in P0/P1 cache and model runner (#22570)
DarkLight1337 Aug 13, 2025
da27051
[Misc] clear and separate error messages for input too long and input…
Aug 13, 2025
9bd9294
[Bugfix] Fix MiniCPMV Image input inference failed (#22813)
jio-H Aug 13, 2025
c9232d4
[CI/Build] Update VLM common tests (#22841)
DarkLight1337 Aug 13, 2025
12817a8
[CI] Fix `tests/v1/e2e/test_kv_sharing_fast_prefill.py` import on tes…
NickLucche Aug 13, 2025
b4b78d6
[CI/Build] Fix param mismatch in `test_eagle_correctness` (#22847)
DarkLight1337 Aug 13, 2025
df0e0f0
[CI/Build] Skip gpt_big model test because of broken HF model (#22848)
Isotr0py Aug 13, 2025
c6cd5ca
[ROCm][Bugfix] Fix compilation error in topk softmax fused kernel (#2…
kliuae Aug 13, 2025
4e8614e
Move checklist in PR template (#22852)
ProExpertProg Aug 13, 2025
31a500c
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP…
Jialin Aug 13, 2025
0ca2393
[CI/Build] Increase pooling tolerance to pass CI (#22844)
DarkLight1337 Aug 13, 2025
b6af24f
[CI][Entrypoints]: add filter to generation to filter out invalid too…
wseaton Aug 14, 2025
1d20c34
[CI] Fix `tests/distributed/test_ca_buffer_sharing.py` (#22849)
ilmarkov Aug 14, 2025
a353bd0
[CI] remove flaky v0 test (#22864)
robertgshaw2-redhat Aug 14, 2025
00e3f9d
vLLM Benchmark suite improvement (#22119)
louie-tsai Aug 14, 2025
7c3a074
[Bugfix] Fix `PixtralHFImagePixelInputs` dynamic shape check (#22827)
Isotr0py Aug 14, 2025
eb08487
[BugFix] Threadsafe close async zmq sockets (#22877)
njhill Aug 14, 2025
f4efda8
Remove Phi 4 Flash configuration workaround (#22723)
hmellor Aug 14, 2025
7655dc3
[Bugfix] Add reset prefix cache for online serving (#22726)
iAmir97 Aug 14, 2025
0783f13
[Doc] fix dead link (#22898)
dtrifiro Aug 14, 2025
540d54c
[CI] Re-enable transcriptions `test_long_audio_request` (#22890)
NickLucche Aug 14, 2025
829b9a6
[Perf] Dont create unnecessary pooling params (#22876)
LucasWilkinson Aug 14, 2025
92ff41a
[Model] Modify the gate implementation of glm4_moe (#22832)
jeejeelee Aug 14, 2025
625ccd1
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralT…
ZJY0516 Aug 14, 2025
dbe2980
[Bugfix] Fix parsing of `--disable-mm-preprocessor-cache` (#22909)
DarkLight1337 Aug 14, 2025
ab9f2cf
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba (#22…
tdoublep Aug 14, 2025
33c63e9
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel …
jinzhen-lin Aug 14, 2025
637093a
docs: update fastsafetensors usage instructions (#22891)
NirLevy98 Aug 14, 2025
b8ff053
[CI] Temporarily disable flaky test (#22930)
LucasWilkinson Aug 14, 2025
279a5f3
[Kernel] Add nvfp4 gemm flashinfer backends (#22346)
nvjullin Aug 14, 2025
4121de5
[Quantization]: Support compressed-tensors mixed-precision model load…
dsikka Aug 14, 2025
ebcce2c
[Core] Return final response for aborted requests from `AsyncLLM.gene…
njhill Aug 14, 2025
919234f
[BugFix] Fix initial DP request load imbalance (#22910)
njhill Aug 14, 2025
39cd09d
[Bugfix] use flash attn on sm90 (#22933)
zyongye Aug 14, 2025
81f4b96
[Kernel] Add cuda kernel for gpt_oss activation (#22538)
jeejeelee Aug 15, 2025
f1f0d2f
Revert "[Kernel] Add cuda kernel for gpt_oss activation" (#22948)
simon-mo Aug 15, 2025
0933f9d
[BugFix][KVConn] Fix use of `get_required_kvcache_layout` (#22734)
njhill Aug 15, 2025
ae05a6d
[BugFix] Fix port lookup in internal DP LB tests (#22252)
njhill Aug 15, 2025
590bddb
[CI Perf] Prune tests in `tests/kernels/quantization/` (#22942)
mgoin Aug 15, 2025
d2b0e97
[CI Perf] Prune tests in `tests/kernels/moe/` (#22939)
mgoin Aug 15, 2025
0fe8508
[CI Perf] Prune tests in `tests/kernels/attention/` (#22936)
mgoin Aug 15, 2025
b4cef5e
refactor: Change scaling factors calculation for flashinfer FusedMoE …
amirkl94 Aug 15, 2025
5c3fbfe
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughp…
yewentao256 Aug 15, 2025
3d232db
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn (#22818)
Josephasafg Aug 15, 2025
b2f6c24
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Modul…
tjtanaa Aug 15, 2025
b2c0650
[P/D]Provide bucket algorithm rate limiter for proxy_server (#22643)
frankie-ys Aug 15, 2025
5406ebf
[CI] Pooling models mteb test uses enforce_eager (#22878)
noooop Aug 15, 2025
fe91ce9
[V1] - Split Prefill and Decode for Mamba1 models (#22653)
amirai21 Aug 15, 2025
aa300c4
[Bugfix] Unquote file uri before reading image (#22912)
sayandipdutta Aug 15, 2025
3e6dd40
[Bugfix] fix cuda 12.6 and 11.8 build (#22952)
jinzhen-lin Aug 15, 2025
49252cf
[MM] Allow skipping memory profiling for multimodal models. (#22950)
Aug 15, 2025
22341b9
Improve multimodal hasher performance for re-used Image prompts (#22825)
p88h Aug 15, 2025
75531a6
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba…
tdoublep Aug 15, 2025
48f4636
[Misc] Ignore ep_kernels_workspace (#22807)
jeejeelee Aug 15, 2025
e8b40c7
[CI] Remove duplicated docs build from buildkite (#22924)
hmellor Aug 15, 2025
a0632a3
[Frontend] Expose do_log_stats interval to env (#22905)
Csrayz Aug 15, 2025
74f441f
[Core] Allow full cudagraph with separate attention routines and orth…
fhl2000 Aug 15, 2025
1c859a1
[V0 Deprecation] Remove advance_step (#22969)
WoosukKwon Aug 15, 2025
6b04039
[BugFix] Skip the Q component for QKVParallelLinear in the case of QK…
sstamenk Aug 15, 2025
b515079
initial commit for non-shifting prefill in eagle, prepare for kv sharing
morgendave Jun 10, 2025
5b4e0a3
add kv copy logic and offline tests
morgendave Jun 12, 2025
cb5d6e7
Add examples and algorithm for non-shifting, fixes some minor issues
morgendave Jun 30, 2025
6a9964b
rebase and adapt to new attn builder
morgendave Jul 18, 2025
a915428
Add kv copy kernel for between layers
morgendave Jul 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
41 changes: 16 additions & 25 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.

See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.

## Performance benchmark quick overview

Expand Down Expand Up @@ -104,7 +104,6 @@ We test the throughput by using `vllm bench serve` with request rate = inf to co
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
Expand Down Expand Up @@ -139,28 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts

The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

Here is an example using the script to compare result_a and result_b without detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`

| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|----------------------------------------|----------------------------------------|----------|
| 0 | 142.633982 | 156.526018 | 1.097396 |
| 1 | 241.620334 | 294.018783 | 1.216863 |
| 2 | 218.298905 | 262.664916 | 1.203235 |
| 3 | 242.743860 | 299.816190 | 1.235113 |

Here is an example using the script to compare result_a and result_b with detail test name.
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |

A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />

## Nightly test details

Expand All @@ -169,9 +160,9 @@ See [nightly-descriptions.md](nightly-descriptions.md) for the detailed descript
### Workflow

- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.

### Nightly tests

Expand All @@ -181,6 +172,6 @@ In [nightly-tests.json](tests/nightly-tests.json), we include the command line a

The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.

WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.

WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
175 changes: 162 additions & 13 deletions .buildkite/nightly-benchmarks/scripts/compare-json-results.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,38 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import json
import os

import pandas as pd


def compare_data_columns(
files, name_column, data_column, drop_column, ignore_test_name=False
files, name_column, data_column, info_cols, drop_column, debug=False
):
print("\ncompare_data_column: " + data_column)
frames = []
raw_data_cols = []
compare_frames = []
for file in files:
data_df = pd.read_json(file)
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
if ignore_test_name is False:
# Show all info columns in the first couple columns
if not frames:
for col in info_cols:
if col not in serving_df.columns:
print(f"Skipping missing column: {col}")
continue
frames.append(serving_df[col])
# only show test name under debug mode
if debug is True:
serving_df = serving_df.rename(columns={name_column: file + "_name"})
frames.append(serving_df[file + "_name"])

file = "/".join(file.split("/")[:-1])
serving_df = serving_df.rename(columns={data_column: file})
frames.append(serving_df[file])
raw_data_cols.append(file)
compare_frames.append(serving_df[file])
if len(compare_frames) >= 2:
# Compare numbers among two files
Expand All @@ -27,7 +41,68 @@ def compare_data_columns(
compare_frames.pop(1)

concat_df = pd.concat(frames, axis=1)
return concat_df
print(raw_data_cols)
return concat_df, raw_data_cols


def split_json_by_tp_pp(
input_file: str = "benchmark_results.json", output_root: str = "."
) -> list[str]:
"""
Split a benchmark JSON into separate folders by (TP Size, PP Size).

Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
Returns: list of file paths written.
"""
# Load JSON data into DataFrame
with open(input_file, encoding="utf-8") as f:
data = json.load(f)

# If the JSON is a dict with a list under common keys, use that list
if isinstance(data, dict):
for key in ("results", "serving_results", "benchmarks", "data"):
if isinstance(data.get(key), list):
data = data[key]
break

df = pd.DataFrame(data)

# Handle alias column names
rename_map = {
"tp_size": "TP Size",
"tensor_parallel_size": "TP Size",
"pp_size": "PP Size",
"pipeline_parallel_size": "PP Size",
}
df.rename(
columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
)

# Ensure TP/PP columns exist (default to 1 if missing)
if "TP Size" not in df.columns:
df["TP Size"] = 1
if "PP Size" not in df.columns:
df["PP Size"] = 1

# make sure TP/PP are numeric ints with no NaN
df["TP Size"] = (
pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
)
df["PP Size"] = (
pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
)

# Split into separate folders
saved_paths: list[str] = []
for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
os.makedirs(folder_name, exist_ok=True)
filepath = os.path.join(folder_name, "benchmark_results.json")
group_df.to_json(filepath, orient="records", indent=2, force_ascii=False)
print(f"Saved: {filepath}")
saved_paths.append(filepath)

return saved_paths


if __name__ == "__main__":
Expand All @@ -36,31 +111,105 @@ def compare_data_columns(
"-f", "--file", action="append", type=str, help="input file name"
)
parser.add_argument(
"--ignore_test_name", action="store_true", help="ignore_test_name or not"
"--debug", action="store_true", help="show all information for debugging"
)
parser.add_argument(
"--plot",
action=argparse.BooleanOptionalAction,
default=True,
help="plot perf diagrams or not --no-plot --plot",
)
parser.add_argument(
"-x",
"--xaxis",
type=str,
default="# of max concurrency.",
help="column name to use as X Axis in comparision graph",
)
args = parser.parse_args()
files = args.file
print("comparing : " + ", ".join(files))

drop_column = "P99"
name_column = "Test name"
info_cols = [
"Model",
"Dataset Name",
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
]
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"Median TTFT /n",
"Median TPOT /n",
]
ignore_test_name = args.ignore_test_name

if len(args.file) == 1:
files = split_json_by_tp_pp(args.file[0], output_root="splits")
info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
else:
files = args.file
print("comparing : " + ", ".join(files))
debug = args.debug
plot = args.plot
# For Plot feature, assign y axis from one of info_cols
y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
with open("perf_comparison.html", "w") as text_file:
for i in range(len(data_cols_to_compare)):
output_df = compare_data_columns(
output_df, raw_data_cols = compare_data_columns(
files,
name_column,
data_cols_to_compare[i],
info_cols,
drop_column,
ignore_test_name=ignore_test_name,
debug=debug,
)
print(output_df)
html = output_df.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)

# For Plot feature, insert y axis from one of info_cols
raw_data_cols.insert(0, info_cols[y_axis_index])

filtered_info_cols = info_cols[:-2]
existing_group_cols = [
c for c in filtered_info_cols if c in output_df.columns
]
if not existing_group_cols:
raise ValueError(
f"No valid group-by columns "
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)

output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
html = group.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)

if plot is True:
import pandas as pd
import plotly.express as px

df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# Export to HTML
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))
Loading