Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
162 commits
Select commit Hold shift + click to select a range
956b14b
Fix hybrid Mamba attention KV cache allocation
lesj0610 May 2, 2026
c408fdd
[Fix] Sync gemma4 chat template from hf (#39570)
FredericOdermatt May 2, 2026
964a4bc
[MM][CG] Support ViT CG for Qwen2.5-VL (#40830)
johncalesp May 2, 2026
3e49479
Limit concurrency on `test_transcription_api_correctness.py` (#41478)
ekagra-ranjan May 2, 2026
e33ff55
Handle mixed KV override for cudagraph profiling
lesj0610 May 2, 2026
d58c42e
[vLLM IR] 2/N fused_add_rms_norm and maybe_inplace overload (#36823)
ProExpertProg May 2, 2026
c293ccc
[ROCm][Bugfix] Fix init-time bias dtype cast when gate.out_dtype is N…
rbrugaro-amd May 2, 2026
ae3b4de
[Doc] Add Codex usage example (#41358)
chaunceyjiang May 2, 2026
8586369
Refactor Step3Text loading to use AutoWeightsLoader (#41492)
mcsantiago May 2, 2026
c3ad791
[Bugfix][Gemma 4] Clamp soft-token estimate to max_soft_tokens (#40796)
hnt2601 May 2, 2026
efccac8
Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr
lesj0610 May 2, 2026
47af837
Fix request-constant KV review edge cases
lesj0610 May 2, 2026
291b063
Restore non-divisible KV page-size test expectation
lesj0610 May 2, 2026
1ba1571
Allow full cudagraph with request-constant KV pools
lesj0610 May 2, 2026
cfd2573
[Build] Switch CUDA 13.0 wheel builds to PyTorch manylinux_2_28 base …
mgoin May 2, 2026
0a9362d
Revert "[Build] Make bundled DeepGEMM wheel portable across Python ve…
mgoin May 2, 2026
4f7309f
[CI] Add ci-fetch-log.sh helper for Buildkite job logs (#41517)
mgoin May 2, 2026
1c607d7
[DSV4] Guard megamoe flag with Pure TP (#41522)
zyongye May 2, 2026
856ec48
[DSv4] Tune default value of `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD`…
ywang96 May 3, 2026
08834cc
[Quantization] add humming mxfp4 moe backend (#41083)
jinzhen-lin May 3, 2026
e6ff3e9
[MRV2] Add shutdown() method (#41297)
WoosukKwon May 3, 2026
e52ba7a
Fix legacy KV cache metadata for connector tests
lesj0610 May 3, 2026
f4119a7
Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr
lesj0610 May 3, 2026
54dc64d
[Doc] Add Qwen3-30B-A3B-Thinking-2507-FP8 to batch invariance verifie…
taneem-ibrahim May 3, 2026
da970f0
Merge branch 'main' into lesj/gdn-kv-mamba-attn-kv-fix-pr
lesj0610 May 3, 2026
c51df43
Disable flashinfer autotune temporarily due to correctness issues (#4…
wzhao18 May 3, 2026
cb03fee
[Bugfix][Ray] Fix RayExecutorV2 actor name collision with DP > 1 (#40…
tomeras91 May 3, 2026
db9a84e
[Bugfix] Fix FP8 Bias Loading (#41424)
alex-jw-brooks May 3, 2026
66dfee7
[Bugfix] Fix degenerate KV cache stride causing TMA cudaErrorIllegalI…
the-david-oy May 3, 2026
894a025
[Bench] Forward --seed to CustomDataset and CustomMMDataset shuffle (…
Aktsvigun May 4, 2026
67058ca
[CI] Clean up remote servers on pytest parent exit (#41570)
AndreasKaratzas May 4, 2026
c103c02
[Transformers v5] Vendor HCXVisionConfig for compatibility (#38447)
HanFa May 4, 2026
01d4d1a
[ROCm][CI] Align spec decode logprob test prefill settings (#41335)
AndreasKaratzas May 4, 2026
6ec9bbe
[CI] Stabilize cpu offload compressed tensors test (#41102)
AndreasKaratzas May 4, 2026
6f53753
[Bugfix] Apply ruff-format to hyperclovax.py (#41620)
stecasta May 4, 2026
62ba751
Revert "[Doc] Fix RTD build: pytorch.org/docs/stable/objects.inv retu…
stecasta May 4, 2026
8decbfa
Test nemotron nano-v2 and nemotron nano-v3 separately, disable super-…
netanel-haber May 4, 2026
3e1ad44
[Bug] Fix `tests/compile/test_config.py` AttributeError: 'NoneType' o…
yewentao256 May 4, 2026
321fa2d
Limit gpu utils and lower max BS on test_transcription_api_correctnes…
ekagra-ranjan May 4, 2026
712ad02
[Bugfix] KimiK2ReasoningParser: guard against buffered end-token in s…
JasonKeyiL May 4, 2026
e724b0e
[ROCm] ROCm7.2.2 + profiler fix + AITER 0.1.12.post2 (#41386)
gshtras May 4, 2026
8c78094
Fix Nano Nemotron text-only weight loading (#41205)
Baekpica May 4, 2026
422dd02
[bugfix] Fix prompt logprobs on request eviction during chunked pref…
joa-stdn May 4, 2026
844df54
feat: update xgrammar==0.2.0 to use structural tags for strict tool c…
Seven-Streams May 4, 2026
9c07342
[NVFP4][fix] Fix `layer.weight` -> `w13` typo in NVFP4 MOE emulation …
fxmarty-amd May 4, 2026
be5983b
[Docs] Add non-causal support to attention backend docs (#41643)
MatthewBonanni May 4, 2026
1cb0838
[ROCm][CI] Fix MLA prefill scale for DeepSeek GSM8K (#41569)
AndreasKaratzas May 4, 2026
577b962
[Bug] Fix status update address for non-MOE model within external dp …
yewentao256 May 4, 2026
4f2af1a
[Feature] TurboQuant: support hybrid models and uniform quantization …
JartX May 5, 2026
e1e4646
[Model Runner V2] Rebuild attn metadata between draft decode steps (#…
TheEpicDolphin May 5, 2026
685bf81
[XPU] enable is_act_and_mul for xpu (#37481)
xuechendi May 5, 2026
416f9cd
[Perf][2/n] Eliminate GPU<->CPU syncs in pooling code (#41433)
njhill May 5, 2026
1e95004
[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle (#39136)
BowenBao May 5, 2026
420b0a5
[Hardware][Power]Add Power VSX Attention Backend and fix l2 Cache Cra…
Akashcodes732 May 5, 2026
f04fd16
[Ray] Enable RayExecutorV2 by default (#41421)
jeffreywang-anyscale May 5, 2026
eaec7be
[BugFix] Preserve max_seq_len in ubatch metadata during CUDA graph ca…
czhu-cohere May 5, 2026
6bb924b
[Model] Fix Gemma4 MoE activation mismatch (#41574)
lucianommartins May 5, 2026
0c620d2
[Model] Use AutoWeightsLoader for CohereMoe (#41690)
bittoby May 5, 2026
4845aee
[Benchmark] Add --trust-remote-code flag to multi-turn benchmark (#41…
Dao007forever May 5, 2026
27cc676
[Model] Use AutoWeightsLoader for Plamo2 (#41699)
bittoby May 5, 2026
bee1261
[P/D][Mooncake] Add KVConnectorStats for transfer observability (#40414)
zhewenl May 5, 2026
2ceea42
[XPU] use xpu topk topp sample kernel (#39285)
jikunshang May 5, 2026
8b9ea2f
[Feature] Add Triton kernel JIT compilation monitor for inference (#4…
arpera May 5, 2026
0a201b6
[Model] support Qianfan-OCR model (#40136)
marvinzh May 5, 2026
b0765be
Fix DeepSeek-OCR for Transformers v4 (#41460)
hmellor May 5, 2026
98661fe
[Bugfix][KVConnector] Support DCP/PCP in OffloadingConnector (#41549)
Etelis May 5, 2026
6fca518
[BugFix][MyPy]: Module has no attribute "sched_getaffinity" [attr-de…
hickeyma May 5, 2026
20dcd98
[Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe H…
yzong-rh May 5, 2026
b786ec8
[Bugfix] Suggest upgrading Transformers for tokenizer class errors (#…
Lidang-Jiang May 5, 2026
84bd8a3
Remove unnecessary runtime asserts from linear layers (#41729)
hmellor May 5, 2026
2228fe6
[Attention] Move FA3→FA4 upgrade into get_flash_attn_version() (#40815)
gcanlin May 5, 2026
628c436
[New Model][ROCm] Add AMD support for DeepSeek V4 (#40871)
whx-sjtu May 5, 2026
c6235ed
[BUGFIX] Support streamed_args_for_tool in MistralToolParser (#41730)
juliendenize May 5, 2026
48954de
Fix DeepGEMM ep_scatter output address overflow (#39213)
S1ro1 May 5, 2026
79246b5
[Spec Decode] Fix max_model_len logging in speculative config for dra…
liulanze May 5, 2026
8c57b6e
Bump model-hosting-container-standards to >= 0.1.14 (#39755)
Dhruvilbhatt May 5, 2026
01b9b5a
[Attention] Minor refactor: layer takes ownership of the MLA prefill …
MatthewBonanni May 5, 2026
1333864
[CI] Automate Docker Hub release image publishing (#40415)
khluu May 6, 2026
4a8ae26
[ROCm][CI] Use vLLM generation defaults for DeepSeek prefetch-offload…
AndreasKaratzas May 6, 2026
f653761
[CI] Route part of B200 jobs to b200-k8s (#41453)
khluu May 6, 2026
c7aa186
[Frontend] Supports resubmitting output items with missing fields in …
chaunceyjiang May 6, 2026
16e3364
[Mistral Tokenizer] allow more leniency in apply_chat_template (#41658)
juliendenize May 6, 2026
aee190a
[Build] Fall back to system libgomp when torch has no vendored copy (…
lyd1992 May 6, 2026
e47c98e
[Fix] Add missing stubs from cpu fp8 attention changes (#41387)
tianmu-li May 6, 2026
91740ca
[ROCm][CI] Refine gating tests (#37243)
AndreasKaratzas May 6, 2026
2d7d6cf
[Spec Decode] Allow multimodal models with a warning (#41752)
laviier May 6, 2026
b53c507
[Bugfix] Skip PP sampled-token receive on last rank during async sche…
wi-adam May 6, 2026
809b98e
[CPU] Add FP8 W8A16 linear support (#41186)
yuwenzho May 6, 2026
e87e09a
[Feat] dnnl build for AVX2 W8A8 Int8 (#41318)
tianmu-li May 6, 2026
213f10b
[Bugfix] Fix codegen for unqualified names (#40726)
Lucaskabela May 6, 2026
51c1ee9
[Examples] Resettle Disaggregated examples. (#40759)
noooop May 6, 2026
1c58876
[XPU] Disable CUDA graph memory estimate on XPU platform (#41344)
chaojun-zhang May 6, 2026
66d1cc0
fix(rocm): remove workaround causing invalid argument on Qwen3.5 with…
aaab8b May 6, 2026
e43a791
[Bugfix][CI] Fix Disaggregated test area path (#41794)
NickLucche May 6, 2026
2e777d2
[Bugfix][Rocm]Aiter MoE re-uses existing tensor addresses after weigh…
yuankaichen-amd May 6, 2026
d8deb5b
Fix some legacy checkpoints with deprecated `rope_type` values (#41734)
hmellor May 6, 2026
5d0fd87
[CPU][RISC-V] Auto-bind OMP threads and harden nobind path (#40569)
lyd1992 May 6, 2026
242afc6
[MM][Gemma4] Respect max_soft_tokens in encoder budget (#41799)
lesj0610 May 6, 2026
df8e63f
nixl refactor: new transfer design (#40731)
ZhanqiuHu May 6, 2026
6467213
fix(openai): tolerate empty content in forced tool choice (#40148)
QwertyJack May 6, 2026
f39bcf1
[KV Offload] Return None from lookup() for in-flight blocks (#41795)
ronensc May 6, 2026
27e0057
[Spec Decode] Add Gemma4 MTP speculative decoding support (#41745)
lucianommartins May 6, 2026
ee38750
[Bugfix] Fix spawn_new_process_for_each_test silently swallowing test…
dzhengAP May 6, 2026
d5b31c9
[Bugfix] Account for truncate_prompt_tokens when computing max_tokens…
viktorpusTT May 6, 2026
22a3cbe
[ROCm] aiter_unified_attn fp8 q scale refactor (#38296)
divakar-amd May 6, 2026
27702f6
[Bugfix] Fix token loss in PP mode which causes degraded accuracy (#4…
starkwj May 6, 2026
38e1667
[Bugfix] Align block table for TRTLLM MLA edge-case (#39324)
benchislett May 6, 2026
ca3e62d
Upgrade tpu-inference to v0.19.0 (#41844)
jcyang43 May 6, 2026
f3f8efa
[CI] Enable gemma4 parser test on CI (#41857)
sfeng33 May 6, 2026
deb737e
[Doc] Add ModernBertForSequenceClassification to scoring.md cross-en……
JLiu4Coding May 6, 2026
50acdc5
Fix Qwen3 streaming content routing (#40820)
xy3xy3 May 6, 2026
9558286
[Bugfix] DeepSeekV32/v4: respect string='true|false' attribute andunw…
chaunceyjiang May 6, 2026
80d5e7d
[Bugfix] Fix condition to clear persistent topk so that it can be cap…
zyongye May 6, 2026
7a576e2
[ROCm][CI] Remove `TORCH_NCCL_BLOCKING_WAIT=1` After Bugfix In ROCm 7…
micah-wil May 6, 2026
5a0a8fc
[Docs] add cache directory security guidance (#38920)
russellb May 6, 2026
20cac26
[ROCm] Enable SimpleCPUOffloadConnector on ROCm (#40549)
hongxiayang May 7, 2026
51f22dc
[Feat][CPU] Enable Gated DeltaNet Attention (Qwen 3.5 / 3.6) (#41025)
fadara01 May 7, 2026
713b28b
[CPU] Add FP8 W8A16 MoE support (#41314)
yuwenzho May 7, 2026
8a4888b
[ROCm] Profiler api support for ROCm MORI toy proxy server in PD Disa…
itej89 May 7, 2026
6e6d182
[Bugfix] Fix OOM in tensorizer LoRA deserialization (#41845)
orozery May 7, 2026
d4b0048
Eliminate redundant MoE buffer copies in AITER fused experts (without…
amd-mghanimi May 7, 2026
b20731d
[CI][Arm] skip e2e model tests if HF_TOKEN is not set (#41919)
fadara01 May 7, 2026
9c0812f
[Bugfix] Fix FusedMoEWithLoRA has no attribute `runner` (#41889)
jeejeelee May 7, 2026
ffee741
[Model] Use AutoWeightsLoader for AXK1 (#41901)
wenyili May 7, 2026
b3945cc
[CPU] Bump up to the latest CPU kernels (#41924)
bigPYJ1151 May 7, 2026
f650ace
[MM][Gemma4] Use video profiling hints in encoder budget (#41837)
lesj0610 May 7, 2026
75f0d51
[Bugfix] Fix GLM4-MoE weight loading for NVFP4 quantized checkpoints …
s-yanev May 7, 2026
805e9f7
[XPU] Fix lora bugs & enable UTs under tests/lora (#38206)
chaojun-zhang May 7, 2026
2a84da3
[XPU] Implement out-of-place all-reduce functionality (#41808)
chaojun-zhang May 7, 2026
003159d
[ROCm][CI] Avoid duplicate ROCm AITER norm-quant patterns (#41534)
AndreasKaratzas May 7, 2026
2a16ece
tokenizer: Add fastokens support (#41741)
AlonKejzman May 7, 2026
06a60d3
Fix spec decode benchmark metrics (#41916)
noobHappylife May 7, 2026
7a08b34
[Model Runner V2] support qwen35 / mamba hybrid model (#35520)
izhuhaoran May 7, 2026
9d6500b
[Misc] Delay EPLB Nixl import until needed (#41805)
NickLucche May 7, 2026
8eb4011
[Refactor] Consolidate required/named tool_choice streaming into Dele…
sfeng33 May 7, 2026
8189a15
[Core] Replace routing replay with device cache and async D2H pipelin…
TomerBN-Nvidia May 7, 2026
c936548
[ROCm][DeepSeek] Enable V3.2 TP4 AITER MLA (#41835)
akii96 May 7, 2026
3af561e
[ROCm] Fix AITER AR+RMSNorm no-residual fusion (#41972)
akii96 May 7, 2026
969fbfb
Laguna xs dflash support (#41880)
MeganEFlynn May 7, 2026
c1819ca
[Compressed Tensors] Allow configs with non-explicit ignores (#41965)
kylesayrs May 7, 2026
54f548e
[Bugfix] Restore moe_forward output shape invariant on TRTLLM MXFP4 p…
stecasta May 7, 2026
10ebb40
[Core] Avoid using extra thread in `UniProcExecutor` (#40891)
njhill May 7, 2026
09a7cc5
[KV Connector] Opt DecodeBenchConnector into SupportsHMA (#41770)
liuzijing2014 May 7, 2026
50f2db2
add: LFM2/2.5 Tool Parser (#39243)
jbuchananr May 8, 2026
5f6a028
[CI][Bugfix] Fix failure CI step "PyTorch Fullgraph Smoke Test" (#41953)
haosdent May 8, 2026
57c2f72
[CI][Bugfix] Fix CI failures for "PyTorch Compilation Unit Tests" (#4…
haosdent May 8, 2026
1d694e7
[Examples][last/6] Resettle examples. (#41084)
noooop May 8, 2026
cd58e30
[Perf] Use numpy zero-copy path for embedding float response serializ…
lokashrinav May 8, 2026
989c176
[Perf][3/n] Eliminate GPU<->CPU syncs in attention impls (#41434)
njhill May 8, 2026
01b0f3a
fix: default TILELANG_CLEANUP_TEMP_FILES=1 to avoid shared /tmp confl…
ssam18 May 8, 2026
baf068d
enable persistent mla for sparse mla backend (#41990)
dllehr-amd May 8, 2026
0b99971
[Kernel][Helion] Optimize Helion config parsing latency (#40850)
gmagogsfm May 8, 2026
1acd67a
[Bugfix] Fix XPU/ROCm compatibility in spawn_new_process_for_each_tes…
dzhengAP May 8, 2026
ed582b6
[Aiter][ROCm] gdn_linear_attn kernel fusion (#40711)
tpopp May 8, 2026
77b13b9
[Docs] Reorganize examples docs. (#41082)
noooop May 8, 2026
445d747
[Bugifx] Missing Renderer for `fastokens` mode (#41984)
tjtanaa May 8, 2026
f9b9bf3
[CI][ROCm] Ship RIXL with `vllm/vllm-openai-rocm` (#41634)
simondanielsson May 8, 2026
160858c
[CI][Bugfix] Surface subprocess output in spawn_new_process_for_each_…
haosdent May 8, 2026
36b2c79
[CI][Bugfix] Drop duplicated examples/ prefix in tensorize_vllm_model…
haosdent May 8, 2026
19df11f
[CI][XPU]Ignore some lora tests from LoRA Intel CI pipeline (#42010)
chaojun-zhang May 8, 2026
630820a
Make docs environment deterministic (#41926)
hmellor May 8, 2026
c265343
Merge remote-tracking branch 'origin/main' into HEAD
lesj0610 May 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 7 additions & 2 deletions .buildkite/hardware_tests/cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,19 @@ steps:
- vllm/_custom_ops.py
- tests/kernels/attention/test_cpu_attn.py
- tests/kernels/moe/test_cpu_fused_moe.py
- tests/kernels/moe/test_cpu_fp8_fused_moe.py
- tests/kernels/test_onednn.py
- tests/kernels/test_awq_int4_to_int8.py
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
pytest -x -v -s tests/kernels/moe/test_cpu_fp8_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py"
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"

- label: CPU-Compatibility Tests
depends_on: []
Expand Down Expand Up @@ -61,6 +65,7 @@ steps:
- vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
- vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
- vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
- vllm/model_executor/layers/fused_moe/experts/cpu_moe.py
- tests/quantization/test_compressed_tensors.py
- tests/quantization/test_cpu_wna16.py
commands:
Expand Down
20 changes: 13 additions & 7 deletions .buildkite/intel_jobs/lora_intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,18 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_layers.py &&
pytest -v -s lora/test_lora_checkpoints.py &&
(pytest -v -s lora/test_lora_functions.py --deselect="tests/lora/test_lora_functions.py::test_lora_functions_sync" --deselect="tests/lora/test_lora_functions.py::test_lora_functions_async" || true) &&
pytest -v -s lora/test_lora_functions.py &&
pytest -v -s lora/test_lora_huggingface.py &&
pytest -v -s lora/test_lora_manager.py &&
pytest -v -s lora/test_lora_utils.py &&
pytest -v -s lora/test_peft_helper.py &&
pytest -v -s lora/test_resolver.py &&
pytest -v -s lora/test_utils.py &&
(pytest -v -s lora/test_add_lora.py --deselect="tests/lora/test_add_lora.py::test_add_lora" || true) &&
(pytest -v -s lora/test_worker.py --deselect="tests/lora/test_worker.py::test_worker_apply_lora" || true)'
pytest -v -s lora/test_add_lora.py &&
pytest -v -s lora/test_worker.py'

- label: LoRA Fused/MoE Kernels
timeout_in_minutes: 45
Expand All @@ -46,6 +47,7 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_fused_moe_lora_kernel.py &&
pytest -v -s lora/test_moe_lora_align_sum.py'

Expand All @@ -65,8 +67,9 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
set -o pipefail &&
pytest -v -s lora/test_punica_ops.py --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-2-2049-64-32-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype1-2-64000-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-1-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-1-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-8-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype0-3-2049-128-8-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-8-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype1-1-2049-256-128-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-3-64256-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-2-29696-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-3-49408-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-2-16384-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-2-51328-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype1-1-102656-32-4-4]"'
pytest -v -s lora/test_punica_ops.py --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-3-43264-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype1-1-2049-64-128-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-1-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-1-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-8-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype0-3-2049-128-8-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-8-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype1-1-2049-256-128-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-3-64256-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-2-29696-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-3-49408-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-2-16384-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-2-51328-32-4-4]"'

- label: LoRA Punica FP8/XPU Ops
timeout_in_minutes: 45
Expand All @@ -84,6 +87,7 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_punica_ops_fp8.py &&
pytest -v -s lora/test_punica_xpu_ops.py'

Expand All @@ -103,10 +107,12 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
(pytest -v -s lora/test_mixtral.py --deselect="tests/lora/test_mixtral.py::test_mixtral_lora[4]" || true) &&
pytest -v -s lora/test_quant_model.py --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model0]" --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model1]" --deselect="tests/lora/test_quant_model.py::test_quant_model_tp_equality[model0]" &&
pytest -v -s lora/test_qwen35_densemodel_lora.py &&
pytest -v -s lora/test_transformers_model.py'
pytest -v -s lora/test_transformers_model.py &&
pytest -v -s lora/test_chatglm3_tp.py &&
pytest -s -v lora/test_minicpmv_tp.py'

- label: LoRA Multimodal
timeout_in_minutes: 45
Expand All @@ -124,6 +130,6 @@ steps:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_default_mm_loras.py &&
(pytest -v -s lora/test_qwen3_unembed.py || true) &&
pytest -v -s lora/test_whisper.py'
45 changes: 41 additions & 4 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ steps:
agents:
queue: arm64_cpu_queue_release
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_AARCH64}\" --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_AARCH64}\" --build-arg BUILD_OS=manylinux --build-arg BUILD_BASE_IMAGE=pytorch/manylinuxaarch64-builder:cuda13.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-nightly-wheels.sh"
Expand Down Expand Up @@ -76,7 +76,7 @@ steps:
agents:
queue: cpu_queue_release
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_X86}\" --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_X86}\" --build-arg BUILD_OS=manylinux --build-arg BUILD_BASE_IMAGE=pytorch/manylinux2_28-builder:cuda13.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-nightly-wheels.sh"
Expand Down Expand Up @@ -309,6 +309,7 @@ steps:
depends_on: ~

- label: "Build release image - x86_64 - CPU"
key: build-cpu-release-image-x86
depends_on:
- block-cpu-release-image-build
- input-release-version
Expand All @@ -327,7 +328,8 @@ steps:
depends_on: ~

- label: "Build release image - arm64 - CPU"
depends_on:
key: build-cpu-release-image-arm64
depends_on:
- block-arm64-cpu-release-image-build
- input-release-version
agents:
Expand Down Expand Up @@ -436,6 +438,41 @@ steps:
DOCKER_BUILDKIT: "1"
DOCKERHUB_USERNAME: "vllmbot"

- block: "Publish release images to DockerHub"
key: block-publish-release-images
depends_on:
- create-multi-arch-manifest
- create-multi-arch-manifest-cuda-12-9
- create-multi-arch-manifest-ubuntu2404
- create-multi-arch-manifest-cuda-12-9-ubuntu2404
- build-rocm-release-image
- input-release-version
# Wait for CPU builds if their block steps were unblocked, so publish
# doesn't race the in-progress CPU build. allow_failure lets publish
# proceed when the operator legitimately leaves the CPU block steps
# unblocked or the CPU build fails.
- step: build-cpu-release-image-x86
allow_failure: true
- step: build-cpu-release-image-arm64
allow_failure: true
if: build.env("NIGHTLY") != "1"

- label: "Publish release images to DockerHub"
depends_on:
- block-publish-release-images
key: publish-release-images-dockerhub
agents:
queue: small_cpu_queue_release
commands:
- "bash .buildkite/scripts/publish-release-images.sh"
plugins:
- docker-login#v3.0.0:
username: vllmbot
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
DOCKERHUB_USERNAME: "vllmbot"

- group: "Publish wheels"
key: "publish-wheels"
steps:
Expand Down Expand Up @@ -723,7 +760,7 @@ steps:
- "bash tools/vllm-rocm/generate-rocm-wheels-root-index.sh"
env:
S3_BUCKET: "vllm-wheels"
VARIANT: "rocm721"
VARIANT: "rocm722"

# ROCm Job 6: Build ROCm Release Docker Image
- label: ":docker: Build release image - x86_64 - ROCm"
Expand Down
94 changes: 1 addition & 93 deletions .buildkite/scripts/annotate-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ if [ -z "${RELEASE_VERSION}" ]; then
RELEASE_VERSION="1.0.0.dev"
fi

ROCM_BASE_CACHE_KEY=$(.buildkite/scripts/cache-rocm-base-wheels.sh key)

buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
To download the wheel (by commit):
\`\`\`
Expand All @@ -25,95 +23,5 @@ aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl .
\`\`\`


To download and upload the image:

\`\`\`
# Download images:

docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu129
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu129
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${ROCM_BASE_CACHE_KEY}-rocm-base
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION}
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION}

# Tag and push images:

## CUDA

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
docker push vllm/vllm-openai:latest-x86_64
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu129 vllm/vllm-openai:x86_64-cu129
docker tag vllm/vllm-openai:x86_64-cu129 vllm/vllm-openai:latest-x86_64-cu129
docker tag vllm/vllm-openai:x86_64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129
docker push vllm/vllm-openai:latest-x86_64-cu129
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker push vllm/vllm-openai:latest-aarch64
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu129 vllm/vllm-openai:aarch64-cu129
docker tag vllm/vllm-openai:aarch64-cu129 vllm/vllm-openai:latest-aarch64-cu129
docker tag vllm/vllm-openai:aarch64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129
docker push vllm/vllm-openai:latest-aarch64-cu129
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129

## ROCm

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
docker push vllm/vllm-openai-rocm:latest
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}

docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${ROCM_BASE_CACHE_KEY}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
docker push vllm/vllm-openai-rocm:latest-base
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base

## CPU

docker tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:x86_64
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:latest-x86_64
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
docker push vllm/vllm-openai-cpu:latest-x86_64
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64

docker tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:arm64
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:latest-arm64
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
docker push vllm/vllm-openai-cpu:latest-arm64
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64

# Create multi-arch manifest:

docker manifest rm vllm/vllm-openai:latest
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker manifest push vllm/vllm-openai:latest
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}

docker manifest rm vllm/vllm-openai:latest-cu129
docker manifest create vllm/vllm-openai:latest-cu129 vllm/vllm-openai:latest-x86_64-cu129 vllm/vllm-openai:latest-aarch64-cu129
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION}-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu129 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu129
docker manifest push vllm/vllm-openai:latest-cu129
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}-cu129

docker manifest rm vllm/vllm-openai-cpu:latest || true
docker manifest create vllm/vllm-openai-cpu:latest vllm/vllm-openai-cpu:latest-x86_64 vllm/vllm-openai-cpu:latest-arm64
docker manifest create vllm/vllm-openai-cpu:v${RELEASE_VERSION} vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
docker manifest push vllm/vllm-openai-cpu:latest
docker manifest push vllm/vllm-openai-cpu:v${RELEASE_VERSION}
\`\`\`
Docker images are published automatically by the "Publish release images to DockerHub" pipeline step.
EOF
55 changes: 55 additions & 0 deletions .buildkite/scripts/ci-fetch-log.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash
# Usage: ./ci-fetch-log.sh <buildkite_job_url> [output_file]
# ./ci-fetch-log.sh <build_number> <job_uuid> [output_file]
#
# Downloads the raw log for a Buildkite job from the public, unauthenticated
# /organizations/<org>/pipelines/<pipeline>/builds/<n>/jobs/<uuid>/download
# endpoint, then strips ANSI/timestamps via ci-clean-log.sh.
#
# Find <build_number> and <job_uuid> via:
# gh pr checks <PR> --repo vllm-project/vllm
# Each failing row's URL is .../builds/<build_number>#<job_uuid>.

set -euo pipefail

ORG="vllm"
PIPELINE="ci"

usage() {
echo "Usage: $0 <buildkite_job_url> [output_file]"
echo " $0 <build_number> <job_uuid> [output_file]"
exit 1
}

if [ $# -lt 1 ]; then usage; fi

if [[ "$1" == https://* ]]; then
BUILD=$(echo "$1" | sed -nE 's#.*/builds/([0-9]+).*#\1#p')
JOB=$(echo "$1" | grep -oE '[0-9a-f]{8}-[0-9a-f-]+' | head -n 1)
OUT="${2:-ci-${BUILD}-${JOB:0:8}.log}"
else
if [ $# -lt 2 ]; then usage; fi
BUILD="$1"
JOB="$2"
OUT="${3:-ci-${BUILD}-${JOB:0:8}.log}"
fi

if [ -z "$BUILD" ] || [ -z "$JOB" ]; then
echo "Could not parse build number or job UUID from: $1" >&2
usage
fi

COOKIES=$(mktemp)
trap 'rm -f "$COOKIES"' EXIT

# Buildkite issues a session cookie on first hit; subsequent /download needs it.
curl -fsSL -c "$COOKIES" -A "vllm-ci-fetch-log" \
"https://buildkite.com/${ORG}/${PIPELINE}/builds/${BUILD}" -o /dev/null

curl -fsSL -b "$COOKIES" -A "vllm-ci-fetch-log" \
"https://buildkite.com/organizations/${ORG}/pipelines/${PIPELINE}/builds/${BUILD}/jobs/${JOB}/download" \
-o "$OUT"

bash "$(dirname "$0")/ci-clean-log.sh" "$OUT"

echo "$OUT"
Loading
Loading