Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
238 commits
Select commit Hold shift + click to select a range
ed57f77
[Bugfix ] fix bailing_moe_linear (#40859)
ghphotoframe Apr 28, 2026
76c9ccc
[Core] Fix redundant None append in StepPool.forward for chunked pref…
anthonsu Apr 28, 2026
a8208e6
[Examples] Resettle features examples. (#40995)
noooop Apr 28, 2026
ea74f70
Bugfix: fix SpecBench sample argument error (#40927)
izhuhaoran Apr 28, 2026
bde0efd
[Bugfix][Granite4Vision] Fix deepstack buffer causing decode slowdown…
artem-spector Apr 28, 2026
9e92de5
[Bugfix] Exclude numa_bind fields from ParallelConfig DP hash (#41098)
esmeetu Apr 28, 2026
de3da0b
Add tuned triton fused_moe configs on H100 for gpt-oss (#39904)
zhangxin81 Apr 28, 2026
5aa371d
[DSV4] Enable Multi-stream for Pre-Attn GEMM (#41061)
zyongye Apr 28, 2026
a608836
[Build] Defer flashinfer cubin download to avoid ~2.5 GB (decompresse…
benoittgt Apr 28, 2026
358a755
[CI][AMD][BugFix] Update request URL in test_moriio_connector to matc…
rasmith Apr 28, 2026
0899f43
[New Model] Laguna XS.2 implementation (#41129)
joerowell Apr 28, 2026
de3fe8d
[Bugfix] release KV blocks for skipped P-ranks to prevent invalid KV …
yangrz7 Apr 28, 2026
e9f8f31
[FEATURE] Add EagleMistralForCausalLM (#41024)
juliendenize Apr 28, 2026
f05f366
[Doc] Add missing API endpoints to security documentation (#40532)
russellb Apr 28, 2026
e68fa1b
[Core] Account for `num_gpu_blocks_override` in `max_model_len` check…
njhill Apr 28, 2026
d109eac
[Bugfix][ROCm] Fix gemm_a4w4 call to use updated AITER API signature …
chelnnexy Apr 29, 2026
6fb3f7b
[DSV4] Align aux stream API with DeepseekV4DecoderLayer (#41171)
zixi-qi Apr 29, 2026
856b15c
[CI][AMD][BugFix] Patch has_flashinfer decorator for test_select_rocm…
rasmith Apr 29, 2026
4b95e9c
[CI] Return HTTP 400 for unsupported chat content part type (#41121)
haosdent Apr 29, 2026
75a7cf2
[CI] De-flake test_chat_completion_n_parameter_non_streaming (#41147)
haosdent Apr 29, 2026
99255f3
[UX] Allow enable/disable model weights loading tracking by config (#…
Isotr0py Apr 29, 2026
7fd05e0
uncomment flex backend for batch invariant mode (#40842)
liangel-02 Apr 29, 2026
a085b52
[Docs] [QeRL] Layerwise Reloading Documentation (#40317)
kylesayrs Apr 29, 2026
916e56c
[QeRL] Add warnings for extra memory buffering (#40309)
kylesayrs Apr 29, 2026
fa1b984
[BE][Torch 2.12] Remove workaround code for fixed cublas issue (#40845)
Lucaskabela Apr 29, 2026
1312f07
[Feature] add cohere reasoning and tool parsers (#40422)
walterbm Apr 29, 2026
803b9d7
[Bugfix] Fix Deepseek V4 import error due to AOT compile cache loadin…
wzhao18 Apr 29, 2026
d95d03c
[BugFix][CPU] fix error on CPU runner shutdown (#41034)
fadara01 Apr 29, 2026
2ae73c7
[Bugfix] fix inductor error for dpsk v4 (#41135)
ZJY0516 Apr 29, 2026
8b49cf3
[Bugfix] Fix max_num_batched_token not captured in cuda graph (#40734)
wzhao18 Apr 29, 2026
a269744
[Bugfix] Fix rope (#41113)
jeejeelee Apr 29, 2026
8a8c9b5
[KV Offload] Per-job store completion for CPU offloading connector (#…
Etelis Apr 29, 2026
68dd7db
[Reasoning] Support for speculative decoding with thinking budget (#3…
rishitdholakia13 Apr 29, 2026
92879e1
[CI] fix test_rotary_embedding_opcheck format error (#41202)
chaunceyjiang Apr 29, 2026
e48cb85
[CI/Build] Auto-detect manylinux ABI tag for nightly wheels (#41149)
Harry-Chen Apr 29, 2026
ef70057
[CI][CPU] Split CPU-Distributed Tests into per-scenario labels (#41203)
haosdent Apr 29, 2026
3885d34
[Frontend]Responses API supports Tool/Function calling with streaming…
chaunceyjiang Apr 29, 2026
762022c
[Bugfix] DSV32/V4 add missing type conversion for non-streaming tool …
chaunceyjiang Apr 29, 2026
3f1a4bb
build: embed image provenance metadata in vLLM containers (#40653)
alec-flowers Apr 29, 2026
6d7d4da
[Bugfix] BailingMoeV2.5: rotate full qk_rope_head_dim in MLA RoPE (#4…
ZJY0516 Apr 29, 2026
5371d6f
Fix PP in Gemma4 (#40786)
SKRohit Apr 29, 2026
37e2882
[KV Offload] Tighten `keys` type from `Iterable` to `Sequence` in `Of…
ronensc Apr 29, 2026
33f36d4
[DSV4] Support `max` reasoning effort (#40982)
BugenZhao Apr 29, 2026
11b6912
[Frontend] Add `defer_loading` and `tool_reference` support for Anthr…
JaredforReal Apr 29, 2026
9d8ad5b
[Bugfix] Fix repeated DSv4 RoPE cache initialization (#41148)
jeejeelee Apr 29, 2026
22524f7
[Feat] CPU fp8 attn for AMX/AVX-512 (#39445)
tianmu-li Apr 29, 2026
5b39b26
hf_name argument for vllm bench throughput CLI (#41012)
pmaybank Apr 29, 2026
5560cac
[Bugfix][CPU] Backport PT cpp codegen indirect_assert scalar-mask fix…
amd-lalithnc Apr 29, 2026
b92ef9e
[Perf] Enable FlashInfer top-k/top-p sampler by default (#40376)
arpera Apr 29, 2026
39a7f4f
[Perf] Optimize `AllPool.forward` by slicing first, 51% faster in the…
yewentao256 Apr 29, 2026
51fda1b
[Model Runner v2] Fix block table IMA issue (#40648)
yewentao256 Apr 29, 2026
a05848e
[Bugfix] Report compile time for in-memory cache hit path (#41023)
frgossen Apr 29, 2026
91a2d39
[Models] Cohere MoE (#40817)
Terrencezzj Apr 29, 2026
a80d6f1
better logging for large uncachable items (#41145)
h-avsha Apr 29, 2026
4a42aba
[CI/Build] Enable FP8 on NVIDIA Thor (#39712)
DarkLight1337 Apr 29, 2026
d1a75e3
Fix timeout when using LoRA adapters with Nemotron Super (#40916)
danisereb Apr 29, 2026
6f20f81
Replace shape_invariants with simpler apprach in dynamic_arg_dims uti…
laithsakka Apr 29, 2026
faab189
[Feature]: IndexCache support for DSA models (#37735)
chaunceyjiang Apr 29, 2026
169988a
[ROCm] Use quant_dtype in per_token_quant instead of hardcoded FP8 (#…
Bortlesboat Apr 29, 2026
93da1fe
[CI] Add temperature to bfcl eval, default greedy (#41059)
yzong-rh Apr 29, 2026
1628239
[Multimodal][Render] Skip mm processor initialization and warmup for …
Isotr0py Apr 29, 2026
b58669c
[Perf][Spec Decode] Avoid per-step numpy allocation in prepare_next_t…
wangluochao902 Apr 29, 2026
944e138
[ROCm][Bugfix]: W4A4 MOE using emulation instead of AITER on MXFP4-su…
Rohan138 Apr 29, 2026
0335316
[BUG] Two phase pause to prevent deadlock (#39366)
hao-aaron Apr 29, 2026
ccfb620
Create tests/distributed/test_mnnvl_alltoall.py (#35241)
puririshi98 Apr 29, 2026
c2fb013
[Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraph…
roikoren755 Apr 29, 2026
6841f5d
[ROCm] Add env flags to disable dynamic MXFP4 quant and enable AITER …
heachary Apr 29, 2026
a966aae
[Bugfix][MLA] Size arange_buffer to max_num_batched_tokens to prevent…
UranusSeven Apr 29, 2026
296741d
[DSv4] Use `cvt` PTX for FP32->FP4 conversion (#41015)
gau-nernst Apr 29, 2026
18599bf
[Ci][BugFix] Fix slow DP tests due to bad teardown logic (#41166)
njhill Apr 29, 2026
3795d7a
[ROCm][Bugfix][GPTOSS]: fix input_ids and expert_map args for quark w…
Rohan138 Apr 29, 2026
0ab67c0
[CI] Add key field to all test_areas pipeline steps (#41201)
khluu Apr 29, 2026
0ff1bf9
[Bugfix] Fix failure to allocate KV blocks error (#41282)
wzhao18 Apr 30, 2026
c42981d
[Refactor][kv_offload] KV Offloading maintainability improvements (#4…
hickeyma Apr 30, 2026
a749a33
[Bugfix] Fix persistent_topk cooperative deadlock at TopK=1024 (#41189)
zyongye Apr 30, 2026
cb1b02d
[Frontend] Add VLLM_SKIP_MODEL_NAME_VALIDATION environment variable (…
dsingal0 Apr 30, 2026
a04e0cf
Fix Cohere ASR after HF upgrade (#40582)
ekagra-ranjan Apr 30, 2026
ca97f7b
Fix Gemma4 MoE expert weight remapping (#41206)
Baekpica Apr 30, 2026
54146a9
[Bugfix] correct h matrix layout in chunk_kda output kernel (#40956)
ChenxiQ Apr 30, 2026
efdc956
[KVConnector] MultiConnector SupportsHMA (#39571)
NickLucche Apr 30, 2026
3179e53
[P/D] Prefill compute optimizations with bi-directional KV cache tran…
snadampal Apr 30, 2026
b55b265
[MoE] Make MoERunnerInterface a PluggableLayer for OOT support (#35178)
wxsIcey Apr 30, 2026
3527229
[Doc] Fix RTD build: pytorch.org/docs/stable/objects.inv returns 404 …
stecasta Apr 30, 2026
ff449b6
Stop mergify labelling from skipping pre-commit (#41362)
hmellor Apr 30, 2026
a7fb008
[EPLB] Optimize memory overhead in Nixl communicator (#40013)
ilmarkov Apr 30, 2026
f03d82e
[UX][Bugfix] Fix OOM by setting PyTorch `max_split_size_mb` during mo…
MatthewBonanni Apr 30, 2026
121dbe7
[ROCm] ROCm DeepEP API updated to latest (#39721)
itej89 Apr 30, 2026
10558f5
[CI/Build] Skip terratorch + torchgeo while PyPI has lightning quaran…
stecasta Apr 30, 2026
3ca6ca2
xpu docker: pin oneAPI to 2025.3 and avoid unintended 2026 upgrade (#…
wendyliu235 Apr 30, 2026
307b17c
[DSV4] Avoid redundant dtype conversion. (#41374)
jeejeelee Apr 30, 2026
92a7c12
[CI] Add MTP coverage: Qwen3.5 correctness + no-sync spec decode (#40…
stecasta Apr 30, 2026
efb4cdf
[CI/Build] Skip Prithvi/Terratorch model-registry tests when terrator…
stecasta Apr 30, 2026
2917d63
[NVFP4][Hopper/AMD Instinct] Add Triton kernels for NVFP4 dequantizat…
fxmarty-amd Apr 30, 2026
75a4c16
Fix typo in log message for indexer cache (#41419)
mgoin Apr 30, 2026
526927b
[Model Runner v2] Fix v2 compile counter `num_gpu_runner_capture_trig…
yewentao256 Apr 30, 2026
b4806c8
[DSV4] Add BF16 and MXFP8 A2A support for flashinfer a2a one sided (#…
zyongye Apr 30, 2026
71725f6
[Bugfix] Fix RoutedExpertsCapturer for Gemma 4 MoE (top_k_experts) (#…
lequytra Apr 30, 2026
9c61864
[DeepSeek] Use torch.mm for bf16xbf16->fp32 gemm (#41300)
WoosukKwon Apr 30, 2026
a3c83ff
Faster per-token fp8 group quant packed kernel for blackwell (#41326)
zyongye May 1, 2026
dd5506a
[Core] Simplify handling of `scheduler_reserve_full_isl` option (#41064)
njhill May 1, 2026
4d5c892
(bugfix): block_size check for flex attn (#41363)
JisoLya May 1, 2026
1adaa50
[ROCm][CI] Add ROCm score absolute tolerance floor (#41341)
AndreasKaratzas May 1, 2026
14043df
feat: Enable `prompt_embeds` Content Part Support in vLLM Chat Comple…
LuisRobaina May 1, 2026
7198940
[Model] Add Moondream3 model support(only query and caption skills) (…
sniper35 May 1, 2026
415a879
[KV Offload] Use `Collection` instead of `Sequence/Iterable` for Offl…
ronensc May 1, 2026
b542bdf
[Bugfix] Disable FlashInfer CUTLASS MoE on SM110 (Jetson Thor AGX) (#…
stecasta May 1, 2026
6b6ac6c
[Kernel][MoE] Support GELU on TRT-LLM NvFP4 fused MoE for Gemma4 (#41…
juhi10071998 May 1, 2026
941fb50
[kv_offload+HMA][12/N]: Scheduler-side support for sliding window gro…
orozery May 1, 2026
947138b
Add nvfp4 kv cache support (#40177)
sychen52 May 1, 2026
c3868bb
[compile] Add FlashInfer FP8 async TP fusion and preserve allreduce …
baonudesifeizhai May 1, 2026
a076426
[Bugfix] Pass reasoning parser kwargs to structured output (#41199)
BugenZhao May 1, 2026
32964e7
[ROCm][CI] Upgraded UCX and RIXL (#41210)
AndreasKaratzas May 1, 2026
a3ec4a3
[Bugfix][Metrics] Fix RayPrometheusMetric.labels() returning shared l…
eicherseiji May 1, 2026
0dbaf9d
Refractor longcat loading to use AutoWeightsLoader (#41448)
Yuyi-Ao May 1, 2026
7075df7
[ROCm] Enable DBO (Dynamic Batch Optimization) on ROCm (#34726)
raviguptaamd May 1, 2026
2fa1f8e
[kv_offload+HMA][13/N]: Enable HMA support (#41445)
orozery May 1, 2026
4f7bde5
[Kernel] Pack output and LSE in DCP A2A (#41160)
sungsooha May 1, 2026
c3e6469
[Perf] Warmup forward_native sampler kernel (#41375)
arpera May 1, 2026
bc635fa
[ROCm][Deepseek] dsv3.2 further optimization (#41217)
ganyi1996ppo May 1, 2026
529c671
[ROCm][FEAT] AITER Fused Allreduce + RMSNorm (#37646)
vllmellm May 1, 2026
3ccc1ff
[Eval][CI] Add basic mrcr eval to tests/evals/ (#40164)
mgoin May 1, 2026
5129579
[Model Runner V2] Add `logprob_token_ids` support (#40559)
yewentao256 May 1, 2026
f3fef12
[Attention] Abstract the MLA prefill backends and eliminate cuDNN (#3…
MatthewBonanni May 1, 2026
a9484da
[Perf] Intergrate Tile Kernels `head_compute_mix_kernel` for Deepseek…
Isotr0py May 1, 2026
bcf5cac
[DSV4] Add knob to enable pre-attn gemm (#41443)
zyongye May 1, 2026
edd60ac
[Bugfix] Fix persistent_topk inter-CTA init race on RadixRowState (#4…
zyongye May 1, 2026
0c99629
[Build] Make bundled DeepGEMM wheel portable across Python versions (…
mgoin May 1, 2026
5737770
Re-enable allreduce rms fusion for DP / PP (#41458)
andylolu2 May 1, 2026
c408fdd
[Fix] Sync gemma4 chat template from hf (#39570)
FredericOdermatt May 2, 2026
964a4bc
[MM][CG] Support ViT CG for Qwen2.5-VL (#40830)
johncalesp May 2, 2026
3e49479
Limit concurrency on `test_transcription_api_correctness.py` (#41478)
ekagra-ranjan May 2, 2026
d58c42e
[vLLM IR] 2/N fused_add_rms_norm and maybe_inplace overload (#36823)
ProExpertProg May 2, 2026
c293ccc
[ROCm][Bugfix] Fix init-time bias dtype cast when gate.out_dtype is N…
rbrugaro-amd May 2, 2026
ae3b4de
[Doc] Add Codex usage example (#41358)
chaunceyjiang May 2, 2026
8586369
Refactor Step3Text loading to use AutoWeightsLoader (#41492)
mcsantiago May 2, 2026
c3ad791
[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens (#40796)
hnt2601 May 2, 2026
cfd2573
[Build] Switch CUDA 13.0 wheel builds to PyTorch manylinux_2_28 base …
mgoin May 2, 2026
0a9362d
Revert "[Build] Make bundled DeepGEMM wheel portable across Python ve…
mgoin May 2, 2026
4f7309f
[CI] Add ci-fetch-log.sh helper for Buildkite job logs (#41517)
mgoin May 2, 2026
1c607d7
[DSV4] Guard megamoe flag with Pure TP (#41522)
zyongye May 2, 2026
856ec48
[DSv4] Tune default value of `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD`…
ywang96 May 3, 2026
08834cc
[Quantization] add humming mxfp4 moe backend (#41083)
jinzhen-lin May 3, 2026
e6ff3e9
[MRV2] Add shutdown() method (#41297)
WoosukKwon May 3, 2026
54dc64d
[Doc] Add Qwen3-30B-A3B-Thinking-2507-FP8 to batch invariance verifie…
taneem-ibrahim May 3, 2026
c51df43
Disable flashinfer autotune temporarily due to correctness issues (#4…
wzhao18 May 3, 2026
cb03fee
[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (#40…
tomeras91 May 3, 2026
db9a84e
[Bugfix] Fix FP8 Bias Loading (#41424)
alex-jw-brooks May 3, 2026
66dfee7
[Bugfix] Fix degenerate KV cache stride causing TMA cudaErrorIllegalI…
the-david-oy May 3, 2026
894a025
[Bench] Forward --seed to CustomDataset and CustomMMDataset shuffle (…
Aktsvigun May 4, 2026
67058ca
[CI] Clean up remote servers on pytest parent exit (#41570)
AndreasKaratzas May 4, 2026
c103c02
[Transformers v5] Vendor HCXVisionConfig for compatibility (#38447)
HanFa May 4, 2026
01d4d1a
[ROCm][CI] Align spec decode logprob test prefill settings (#41335)
AndreasKaratzas May 4, 2026
6ec9bbe
[CI] Stabilize cpu offload compressed tensors test (#41102)
AndreasKaratzas May 4, 2026
6f53753
[Bugfix] Apply ruff-format to hyperclovax.py (#41620)
stecasta May 4, 2026
62ba751
Revert "[Doc] Fix RTD build: pytorch.org/docs/stable/objects.inv retu…
stecasta May 4, 2026
8decbfa
Test nemotron nano-v2 and nemotron nano-v3 separately, disable super-…
netanel-haber May 4, 2026
3e1ad44
[Bug] Fix `tests/compile/test_config.py` AttributeError: 'NoneType' o…
yewentao256 May 4, 2026
321fa2d
Limit gpu utils and lower max BS on test_transcription_api_correctnes…
ekagra-ranjan May 4, 2026
712ad02
[Bugfix] KimiK2ReasoningParser: guard against buffered end-token in s…
JasonKeyiL May 4, 2026
e724b0e
[ROCm] ROCm7.2.2 + profiler fix + AITER 0.1.12.post2 (#41386)
gshtras May 4, 2026
8c78094
Fix Nano Nemotron text-only weight loading (#41205)
Baekpica May 4, 2026
422dd02
[bugfix] Fix prompt logprobs on request eviction during chunked pref…
joa-stdn May 4, 2026
844df54
feat: update xgrammar==0.2.0 to use structural tags for strict tool c…
Seven-Streams May 4, 2026
9c07342
[NVFP4][fix] Fix `layer.weight` -> `w13` typo in NVFP4 MOE emulation …
fxmarty-amd May 4, 2026
be5983b
[Docs] Add non-causal support to attention backend docs (#41643)
MatthewBonanni May 4, 2026
1cb0838
[ROCm][CI] Fix MLA prefill scale for DeepSeek GSM8K (#41569)
AndreasKaratzas May 4, 2026
577b962
[Bug] Fix status update address for non-MOE model within external dp …
yewentao256 May 4, 2026
4f2af1a
[Feature] TurboQuant: support hybrid models and uniform quantization …
JartX May 5, 2026
e1e4646
[Model Runner V2] Rebuild attn metadata between draft decode steps (#…
TheEpicDolphin May 5, 2026
685bf81
[XPU] enable is_act_and_mul for xpu (#37481)
xuechendi May 5, 2026
416f9cd
[Perf][2/n] Eliminate GPU<->CPU syncs in pooling code (#41433)
njhill May 5, 2026
1e95004
[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle (#39136)
BowenBao May 5, 2026
420b0a5
[Hardware][Power]Add Power VSX Attention Backend and fix l2 Cache Cra…
Akashcodes732 May 5, 2026
f04fd16
[Ray] Enable RayExecutorV2 by default (#41421)
jeffreywang-anyscale May 5, 2026
eaec7be
[BugFix] Preserve max_seq_len in ubatch metadata during CUDA graph ca…
czhu-cohere May 5, 2026
6bb924b
[Model] Fix Gemma4 MoE activation mismatch (#41574)
lucianommartins May 5, 2026
0c620d2
[Model] Use AutoWeightsLoader for CohereMoe (#41690)
bittoby May 5, 2026
4845aee
[Benchmark] Add --trust-remote-code flag to multi-turn benchmark (#41…
Dao007forever May 5, 2026
27cc676
[Model] Use AutoWeightsLoader for Plamo2 (#41699)
bittoby May 5, 2026
bee1261
[P/D][Mooncake] Add KVConnectorStats for transfer observability (#40414)
zhewenl May 5, 2026
2ceea42
[XPU] use xpu topk topp sample kernel (#39285)
jikunshang May 5, 2026
8b9ea2f
[Feature] Add Triton kernel JIT compilation monitor for inference (#4…
arpera May 5, 2026
0a201b6
[Model] support Qianfan-OCR model (#40136)
marvinzh May 5, 2026
b0765be
Fix DeepSeek-OCR for Transformers v4 (#41460)
hmellor May 5, 2026
98661fe
[Bugfix][KVConnector] Support DCP/PCP in OffloadingConnector (#41549)
Etelis May 5, 2026
6fca518
[BugFix][MyPy]: Module has no attribute "sched_getaffinity" [attr-de…
hickeyma May 5, 2026
20dcd98
[Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe H…
yzong-rh May 5, 2026
b786ec8
[Bugfix] Suggest upgrading Transformers for tokenizer class errors (#…
Lidang-Jiang May 5, 2026
84bd8a3
Remove unnecessary runtime asserts from linear layers (#41729)
hmellor May 5, 2026
2228fe6
[Attention] Move FA3→FA4 upgrade into get_flash_attn_version() (#40815)
gcanlin May 5, 2026
628c436
[New Model][ROCm] Add AMD support for DeepSeek V4 (#40871)
whx-sjtu May 5, 2026
c6235ed
[BUGFIX] Support streamed_args_for_tool in MistralToolParser (#41730)
juliendenize May 5, 2026
48954de
Fix DeepGEMM ep_scatter output address overflow (#39213)
S1ro1 May 5, 2026
79246b5
[Spec Decode] Fix max_model_len logging in speculative config for dra…
liulanze May 5, 2026
8c57b6e
Bump model-hosting-container-standards to >= 0.1.14 (#39755)
Dhruvilbhatt May 5, 2026
01b9b5a
[Attention] Minor refactor: layer takes ownership of the MLA prefill …
MatthewBonanni May 5, 2026
1333864
[CI] Automate Docker Hub release image publishing (#40415)
khluu May 6, 2026
4a8ae26
[ROCm][CI] Use vLLM generation defaults for DeepSeek prefetch-offload…
AndreasKaratzas May 6, 2026
f653761
[CI] Route part of B200 jobs to b200-k8s (#41453)
khluu May 6, 2026
c7aa186
[Frontend] Supports resubmitting output items with missing fields in …
chaunceyjiang May 6, 2026
16e3364
[Mistral Tokenizer] allow more leniency in apply_chat_template (#41658)
juliendenize May 6, 2026
aee190a
[Build] Fall back to system libgomp when torch has no vendored copy (…
lyd1992 May 6, 2026
e47c98e
[Fix] Add missing stubs from cpu fp8 attention changes (#41387)
tianmu-li May 6, 2026
91740ca
[ROCm][CI] Refine gating tests (#37243)
AndreasKaratzas May 6, 2026
2d7d6cf
[Spec Decode] Allow multimodal models with a warning (#41752)
laviier May 6, 2026
b53c507
[Bugfix] Skip PP sampled-token receive on last rank during async sche…
wi-adam May 6, 2026
809b98e
[CPU] Add FP8 W8A16 linear support (#41186)
yuwenzho May 6, 2026
e87e09a
[Feat] dnnl build for AVX2 W8A8 Int8 (#41318)
tianmu-li May 6, 2026
213f10b
[Bugfix] Fix codegen for unqualified names (#40726)
Lucaskabela May 6, 2026
51c1ee9
[Examples] Resettle Disaggregated examples. (#40759)
noooop May 6, 2026
1c58876
[XPU] Disable CUDA graph memory estimate on XPU platform (#41344)
chaojun-zhang May 6, 2026
66d1cc0
fix(rocm): remove workaround causing invalid argument on Qwen3.5 with…
aaab8b May 6, 2026
e43a791
[Bugfix][CI] Fix Disaggregated test area path (#41794)
NickLucche May 6, 2026
2e777d2
[Bugfix][Rocm]Aiter MoE re-uses existing tensor addresses after weigh…
yuankaichen-amd May 6, 2026
d8deb5b
Fix some legacy checkpoints with deprecated `rope_type` values (#41734)
hmellor May 6, 2026
5d0fd87
[CPU][RISC-V] Auto-bind OMP threads and harden nobind path (#40569)
lyd1992 May 6, 2026
242afc6
[MM][Gemma4] Respect max_soft_tokens in encoder budget (#41799)
lesj0610 May 6, 2026
df8e63f
nixl refactor: new transfer design (#40731)
ZhanqiuHu May 6, 2026
6467213
fix(openai): tolerate empty content in forced tool choice (#40148)
QwertyJack May 6, 2026
f39bcf1
[KV Offload] Return None from lookup() for in-flight blocks (#41795)
ronensc May 6, 2026
27e0057
[Spec Decode] Add Gemma4 MTP speculative decoding support (#41745)
lucianommartins May 6, 2026
ee38750
[Bugfix] Fix spawn_new_process_for_each_test silently swallowing test…
dzhengAP May 6, 2026
d5b31c9
[Bugfix] Account for truncate_prompt_tokens when computing max_tokens…
viktorpusTT May 6, 2026
22a3cbe
[ROCm] aiter_unified_attn fp8 q scale refactor (#38296)
divakar-amd May 6, 2026
27702f6
[Bugfix] Fix token loss in PP mode which causes degraded accuracy (#4…
starkwj May 6, 2026
38e1667
[Bugfix] Align block table for TRTLLM MLA edge-case (#39324)
benchislett May 6, 2026
ca3e62d
Upgrade tpu-inference to v0.19.0 (#41844)
jcyang43 May 6, 2026
f3f8efa
[CI] Enable gemma4 parser test on CI (#41857)
sfeng33 May 6, 2026
deb737e
[Doc] Add ModernBertForSequenceClassification to scoring.md cross-en……
JLiu4Coding May 6, 2026
50acdc5
Fix Qwen3 streaming content routing (#40820)
xy3xy3 May 6, 2026
9558286
[Bugfix] DeepSeekV32/v4: respect string='true|false' attribute andunw…
chaunceyjiang May 6, 2026
80d5e7d
[Bugfix] Fix condition to clear persistent topk so that it can be cap…
zyongye May 6, 2026
7a576e2
[ROCm][CI] Remove `TORCH_NCCL_BLOCKING_WAIT=1` After Bugfix In ROCm 7…
micah-wil May 6, 2026
5a0a8fc
[Docs] add cache directory security guidance (#38920)
russellb May 6, 2026
20cac26
[ROCm] Enable SimpleCPUOffloadConnector on ROCm (#40549)
hongxiayang May 7, 2026
51f22dc
[Feat][CPU] Enable Gated DeltaNet Attention (Qwen 3.5 / 3.6) (#41025)
fadara01 May 7, 2026
0c697c9
Merge branch 'main' of github.com:vllm-project/vllm into dsv4-upstrea…
lcskrishna May 7, 2026
1eb6385
fix accuracy issues with fused_deepseek_v4_qnorm_rope_kv_insert_kernel
lcskrishna May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
23 changes: 18 additions & 5 deletions .buildkite/hardware_tests/cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@ steps:
- tests/kernels/moe/test_cpu_fused_moe.py
- tests/kernels/test_onednn.py
- tests/kernels/test_awq_int4_to_int8.py
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py"
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"

- label: CPU-Compatibility Tests
depends_on: []
Expand Down Expand Up @@ -69,11 +71,11 @@ steps:
pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
pytest -x -v -s tests/quantization/test_cpu_wna16.py"

- label: CPU-Distributed Tests
- label: CPU-Distributed Tests (PP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies:
source_file_dependencies: &cpu_distributed_deps
- csrc/cpu/shm.cpp
- vllm/v1/worker/cpu_worker.py
- vllm/v1/worker/gpu_worker.py
Expand All @@ -82,10 +84,21 @@ steps:
- vllm/platforms/cpu.py
- vllm/distributed/parallel_state.py
- vllm/distributed/device_communicators/cpu_communicator.py
- .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh tp_pp"

- label: CPU-Distributed Tests (DP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies: *cpu_distributed_deps
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp"

- label: CPU-Multi-Modal Model Tests %N
depends_on: []
Expand Down
1 change: 1 addition & 0 deletions .buildkite/image_build/image_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ export BUILDKITE_COMMIT
export PARENT_COMMIT
export IMAGE_TAG
export IMAGE_TAG_LATEST
export COMMIT="${COMMIT:-${BUILDKITE_COMMIT}}"
export CACHE_FROM
export CACHE_FROM_BASE_BRANCH
export CACHE_FROM_MAIN
Expand Down
173 changes: 155 additions & 18 deletions .buildkite/release-pipeline.yaml

Large diffs are not rendered by default.

94 changes: 1 addition & 93 deletions .buildkite/scripts/annotate-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ if [ -z "${RELEASE_VERSION}" ]; then
RELEASE_VERSION="1.0.0.dev"
fi

ROCM_BASE_CACHE_KEY=$(.buildkite/scripts/cache-rocm-base-wheels.sh key)

buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
To download the wheel (by commit):
\`\`\`
Expand All @@ -25,95 +23,5 @@ aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl .
\`\`\`


To download and upload the image:

\`\`\`
# Download images:

docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu129
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu129
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${ROCM_BASE_CACHE_KEY}-rocm-base
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION}
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION}

# Tag and push images:

## CUDA

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
docker push vllm/vllm-openai:latest-x86_64
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu129 vllm/vllm-openai:x86_64-cu129
docker tag vllm/vllm-openai:x86_64-cu129 vllm/vllm-openai:latest-x86_64-cu129
docker tag vllm/vllm-openai:x86_64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129
docker push vllm/vllm-openai:latest-x86_64-cu129
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker push vllm/vllm-openai:latest-aarch64
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu129 vllm/vllm-openai:aarch64-cu129
docker tag vllm/vllm-openai:aarch64-cu129 vllm/vllm-openai:latest-aarch64-cu129
docker tag vllm/vllm-openai:aarch64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129
docker push vllm/vllm-openai:latest-aarch64-cu129
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129

## ROCm

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
docker push vllm/vllm-openai-rocm:latest
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${ROCM_BASE_CACHE_KEY}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
docker push vllm/vllm-openai-rocm:latest-base
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base

## CPU

docker tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:x86_64
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:latest-x86_64
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
docker push vllm/vllm-openai-cpu:latest-x86_64
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64

docker tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:arm64
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:latest-arm64
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
docker push vllm/vllm-openai-cpu:latest-arm64
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64

# Create multi-arch manifest:

docker manifest rm vllm/vllm-openai:latest
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker manifest push vllm/vllm-openai:latest
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}

docker manifest rm vllm/vllm-openai:latest-cu129
docker manifest create vllm/vllm-openai:latest-cu129 vllm/vllm-openai:latest-x86_64-cu129 vllm/vllm-openai:latest-aarch64-cu129
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION}-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129
docker manifest push vllm/vllm-openai:latest-cu129
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}-cu129

docker manifest rm vllm/vllm-openai-cpu:latest || true
docker manifest create vllm/vllm-openai-cpu:latest vllm/vllm-openai-cpu:latest-x86_64 vllm/vllm-openai-cpu:latest-arm64
docker manifest create vllm/vllm-openai-cpu:v${RELEASE_VERSION} vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
docker manifest push vllm/vllm-openai-cpu:latest
docker manifest push vllm/vllm-openai-cpu:v${RELEASE_VERSION}
\`\`\`
Docker images are published automatically by the "Publish release images to DockerHub" pipeline step.
EOF
55 changes: 55 additions & 0 deletions .buildkite/scripts/ci-fetch-log.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash
# Usage: ./ci-fetch-log.sh <buildkite_job_url> [output_file]
# ./ci-fetch-log.sh <build_number> <job_uuid> [output_file]
#
# Downloads the raw log for a Buildkite job from the public, unauthenticated
# /organizations/<org>/pipelines/<pipeline>/builds/<n>/jobs/<uuid>/download
# endpoint, then strips ANSI/timestamps via ci-clean-log.sh.
#
# Find <build_number> and <job_uuid> via:
# gh pr checks <PR> --repo vllm-project/vllm
# Each failing row's URL is .../builds/<build_number>#<job_uuid>.

set -euo pipefail

ORG="vllm"
PIPELINE="ci"

usage() {
echo "Usage: $0 <buildkite_job_url> [output_file]"
echo " $0 <build_number> <job_uuid> [output_file]"
exit 1
}

if [ $# -lt 1 ]; then usage; fi

if [[ "$1" == https://* ]]; then
BUILD=$(echo "$1" | sed -nE 's#.*/builds/([0-9]+).*#\1#p')
JOB=$(echo "$1" | grep -oE '[0-9a-f]{8}-[0-9a-f-]+' | head -n 1)
OUT="${2:-ci-${BUILD}-${JOB:0:8}.log}"
else
if [ $# -lt 2 ]; then usage; fi
BUILD="$1"
JOB="$2"
OUT="${3:-ci-${BUILD}-${JOB:0:8}.log}"
fi

if [ -z "$BUILD" ] || [ -z "$JOB" ]; then
echo "Could not parse build number or job UUID from: $1" >&2
usage
fi

COOKIES=$(mktemp)
trap 'rm -f "$COOKIES"' EXIT

# Buildkite issues a session cookie on first hit; subsequent /download needs it.
curl -fsSL -c "$COOKIES" -A "vllm-ci-fetch-log" \
"https://buildkite.com/${ORG}/${PIPELINE}/builds/${BUILD}" -o /dev/null

curl -fsSL -b "$COOKIES" -A "vllm-ci-fetch-log" \
"https://buildkite.com/organizations/${ORG}/pipelines/${PIPELINE}/builds/${BUILD}/jobs/${JOB}/download" \
-o "$OUT"

bash "$(dirname "$0")/ci-clean-log.sh" "$OUT"

echo "$OUT"
142 changes: 142 additions & 0 deletions .buildkite/scripts/detect-manylinux-tag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Detect the manylinux platform tag for a wheel and rename it in place.

vLLM's build images produce wheels with the generic ``linux_<arch>`` platform
tag, which installers like ``pip`` won't accept off PyPI/our index. We need to
rewrite the platform tag to the appropriate ``manylinux_<major>_<minor>_<arch>``
before uploading.

Historically the tag was hard-coded per build (``manylinux_2_31`` for the
Ubuntu 20.04-based image, ``manylinux_2_35`` for the Ubuntu 22.04-based
images). That is brittle: bumping the base image silently produces wheels
labelled with the wrong glibc requirement. This script asks ``auditwheel``
to derive the tag from the symbol versions actually referenced by the
binaries inside the wheel, so the label tracks reality.

We can't simply call ``auditwheel repair`` -- it tries to graft external
shared libraries into the wheel and fails on vLLM's CUDA/cuBLAS dependencies.
Instead we use ``auditwheel.wheel_abi.analyze_wheel_abi`` directly, which is
the same call that powers ``auditwheel show``, and read off
``winfo.sym_policy.name``.

Usage:
detect-manylinux-tag.py <wheel_path>

The wheel is renamed in place; the new path is printed on stdout. All
diagnostics go to stderr so callers can capture stdout safely.
"""

from __future__ import annotations

import argparse
import sys
from pathlib import Path

from auditwheel.error import (
AuditwheelError,
NonPlatformWheelError,
WheelToolsError,
)
from auditwheel.wheel_abi import analyze_wheel_abi
from auditwheel.wheeltools import get_wheel_architecture, get_wheel_libc


def detect_platform_tag(wheel_path: Path) -> str:
"""Return the most precise platform tag the wheel is consistent with.

Mirrors ``auditwheel show`` but returns ``sym_policy`` rather than
``overall_policy``: we only care about the glibc symbol versions used,
not about other policy axes (ISA extensions, blacklist, etc.) that
``overall_policy`` folds in.
"""
fn = wheel_path.name

try:
arch = get_wheel_architecture(fn)
except (WheelToolsError, NonPlatformWheelError):
# Architecture isn't deducible from the filename; let auditwheel
# infer it from the ELF binaries inside the wheel.
arch = None

try:
libc = get_wheel_libc(fn)
except WheelToolsError:
# An unrepaired wheel uses ``linux_<arch>``, which doesn't encode
# libc. Let auditwheel infer it from the ELF binaries.
libc = None

winfo = analyze_wheel_abi(
libc,
arch,
wheel_path,
frozenset(),
disable_isa_ext_check=False,
allow_graft=False,
)
return winfo.sym_policy.name


def rename_wheel(wheel_path: Path, new_platform_tag: str) -> Path:
"""Rename the wheel in place, replacing only its platform tag."""
# Wheel filename per PEP 427:
# {distribution}-{version}(-{build})?-{python}-{abi}-{platform}.whl
# The platform tag is always the last ``-``-separated token before
# ``.whl``. Compound tags like ``manylinux_2_31_x86_64`` use ``_`` as the
# internal separator, so ``-``-splitting is unambiguous.
parts = wheel_path.stem.split("-")
if len(parts) < 5:
raise ValueError(f"Unrecognised wheel filename: {wheel_path.name}")
parts[-1] = new_platform_tag
new_path = wheel_path.with_name("-".join(parts) + ".whl")
if new_path != wheel_path:
wheel_path.rename(new_path)
return new_path


def main() -> int:
parser = argparse.ArgumentParser(
description="Detect a wheel's manylinux platform tag with "
"auditwheel and rename the wheel in place."
)
parser.add_argument(
"wheel",
type=Path,
help="Path to the wheel to inspect and rename.",
)
args = parser.parse_args()

wheel_path: Path = args.wheel
if not wheel_path.is_file():
print(f"error: {wheel_path} is not a file", file=sys.stderr)
return 1

# Catch the things that ``analyze_wheel_abi`` and ``rename_wheel`` can
# raise: any subclass of ``AuditwheelError`` (pure-Python wheels,
# invalid libc, malformed wheels), filesystem errors, or our own
# ``ValueError`` for an unrecognised wheel filename. Print a single
# ``ERROR_TYPE: message`` line to stderr instead of a Python
# traceback, which is much friendlier in CI logs.
try:
new_tag = detect_platform_tag(wheel_path)
print(f"detected platform tag: {new_tag}", file=sys.stderr)
new_path = rename_wheel(wheel_path, new_tag)
except (AuditwheelError, ValueError, OSError) as e:
print(
f"error: failed to retag {wheel_path.name}: {type(e).__name__}: {e}",
file=sys.stderr,
)
return 2

if new_path != wheel_path:
print(f"renamed {wheel_path.name} -> {new_path.name}", file=sys.stderr)
else:
print(f"wheel already tagged {new_tag}", file=sys.stderr)

print(new_path)
return 0


if __name__ == "__main__":
sys.exit(main())
Loading
Loading