Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
847 commits
Select commit Hold shift + click to select a range
488f068
[Model Runner V2] Bug fix: logprob dtype int64/int32 issue (#41761)
yewentao256 May 11, 2026
7a1d2c2
Add VLLM_USE_SPINLOOP_EXT to use more efficient busy polling (#36517)
pschlan-amd May 11, 2026
c7c4907
[Bugifx] [Qwen3CoderTool] Restore supports_required_and_named for req…
chaunceyjiang May 12, 2026
afaae29
[Fix] Gemma4 Mixed-Resolution Image Co-Batching Crash (#42217)
skyloevil May 12, 2026
3efdc9a
Implement custom dataset class for ASR benchmarking (#41576)
ymoslem May 12, 2026
505ba31
[Bugfix][Performance Improvement] Improve penalties triton kernel per…
Lucaskabela May 12, 2026
5fcd162
[XPU] update dp rank w/o env-var isolation (#39856)
zhenwei-intel May 12, 2026
9468720
[Frontend] Consolidate Speech to Text entrypoints. (#42370)
noooop May 12, 2026
13efe52
[CI] Migrate more B200 jobs to b200-k8s queue (#42356)
khluu May 12, 2026
8b7e076
[Bugfix] Fix double reduce in flashinfer_nvlink_two_sided and flashin…
amitz-nv May 12, 2026
c0094d8
[Bugfix] Fix empty channel/recipient in harmony for /v1/responses (#3…
kg6-sleipnir May 12, 2026
d9409c2
[CI] Migrate remaining B200 jobs to b200-k8s with test fixes (#42387)
khluu May 12, 2026
814165c
[CI] Move DockerHub and PyPI publish steps to end of release pipeline…
khluu May 12, 2026
4990b05
[UT][XPU] fix test_parallel_sampling due to global random state (#42388)
zhenwei-intel May 12, 2026
30690b1
[CI] De-flake Language Models Test (Extended Generation) test_models(…
haosdent May 12, 2026
69b4afb
[Doc] Fix typo in llm-d documentation link (#42397)
woernfl May 12, 2026
5f3a417
[XPU] keep generator state of sycl kernel align with pytorch (#41771)
yma11 May 12, 2026
1abae64
[XPU] bump up vllm-xpu-kernels to v0.1.8 (#42410)
jikunshang May 12, 2026
3ca3dcf
[MXFP4] Support for linear layers + compressed-tensors integration (#…
dsikka May 12, 2026
70d3f8c
[Hybrid] Warmup Mamba2 SSD kernel (#39822)
tdoublep May 12, 2026
093bc04
[MoE Refactor] Move expert map related code into ExpertMapManager cla…
bnellnm May 12, 2026
aeb5d29
[MoE Refactor] Move remaining experts classes to experts directory (#…
bnellnm May 12, 2026
0491a31
[Bugfix] Fix mismatched kernel-per-logical blocks in NIXL HMA transfe…
ZhanqiuHu May 12, 2026
f101b2b
[Perf] Use 2D-grid to eliminate divmod in W8W8 group quant (#42153)
jiahanc May 12, 2026
6a048e6
[Build] Build bundled DeepGEMM `_C` per-Python so the wheel imports o…
mgoin May 12, 2026
16f35ee
[Model] Support MiniCPM-V 4.6 (#41254)
tc-mb May 12, 2026
86c5c8b
Added peagle speculators support (#41826)
shanjiaz May 12, 2026
0fc0f71
[vLLM IR] Minor improvements (#39362) (#39558)
GOavi101 May 12, 2026
2b0a971
[docs] Added one new contact to the Vulnerability Management team (#4…
jperezdealgaba May 12, 2026
23a5381
[kv_offload][BugFix] Fix store deferral (#41945)
hickeyma May 12, 2026
8d3438d
[Refactor] Clean up pooling models `build_tok_params` logic (#42341)
yewentao256 May 12, 2026
a4a5ddc
[CPU] Fix rotary embedding for CPU without flash-attn ops (#42225)
jmamou May 12, 2026
df256c2
feat(kv-events): emit KV cache metadata (#40984)
PeaBrane May 12, 2026
e4c1d6d
[Bugfix] [Frontend] Responses API, fix merging of messages (#42189)
yzong-rh May 12, 2026
b432296
[MoE Refactor] Introduce RoutedExperts alias for FusedMoE and don't s…
bnellnm May 12, 2026
17a773d
[MoE Refactor] EPLB refactoring for FusedMoE (#41055)
bnellnm May 12, 2026
df94c46
[Model][Bugfix] Fix Step3-VL image_embeds input path (#42333)
KaivalyaMDabhadkar May 12, 2026
f1512a8
[CI] Migrate 6 verified jobs from gpu_1_queue to h200_18gb MIG (#42446)
khluu May 12, 2026
b6c4b85
platforms: add uses_cpu_device() hook to Platform for DeviceConfig (#…
viktorpusTT May 12, 2026
c9a0318
[Model Runner V2] Apply synthetic mode to probabilistic rejection sam…
TheEpicDolphin May 12, 2026
1c705c2
[CI] Fix `test_async_scheduling.py` flakiness (#42455)
njhill May 12, 2026
03b71b9
[CI] Inline build artifact annotations in release pipeline (#42357)
khluu May 12, 2026
bc2732e
[Build] DeepGEMM: trim comments, add integration notes + TODOs (#42429)
mgoin May 12, 2026
717c6b6
[KV Transfer] Add MooncakeStoreConnector for KV cache offloading via …
LCAIZJ May 12, 2026
2372fea
[Perf] Optimize MLA `compute_prefill_context` memory allocation (#42460)
yewentao256 May 12, 2026
8949dd0
[PD] Bump NIXL connector dependency to 1.x (#42364)
alec-flowers May 13, 2026
67bb1dc
[MoE Refactor] Add sequence parallel tests to test_moe_layer.py (#41299)
bnellnm May 13, 2026
6fa4ae9
[Attention] Sync FA with upstream (#41052)
MatthewBonanni May 13, 2026
a86b801
[Bugfix][PD] Fix multi-node TP (TP>8) (#39907)
NickLucche May 13, 2026
3353566
[chore] Refactor pooling metadata token ID accessors (#42368)
taneem-ibrahim May 13, 2026
b8e8b42
[5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a g…
cleonard530 May 13, 2026
bdabfcc
[MM][Perf][CG] Support ViT full CUDA graph for Qwen3.5 (#42151)
shen-shanshan May 13, 2026
5223a32
Patch SlidingWindowSpec.real_page_size_bytes for nvfp4 kv (#42464)
sychen52 May 13, 2026
6d3f2af
[Bugfix][SimpleCPUOffloadBackend] Dedup in-flight CPU offload stores …
ivanium May 13, 2026
7559ecc
[Bugfix] Install nvidia-cutlass-dsl[cu13] extra on CUDA 13 platforms …
ZJY0516 May 13, 2026
45a42ff
[Feat][KVConnector] Add `bind_gpu_block_pool()` to KVConnectorBase_V1…
ivanium May 13, 2026
e09e185
[CI] Use uv with Python 3.12 for PyPI wheel upload (#42470)
khluu May 13, 2026
9618404
[Bugfix][Frontend] Default max_tokens server-side on /inference/v1/ge…
hallerite May 13, 2026
f83b5a9
[ROCm] Run AITER RMSNorm pad fusion before AR RMS fusion (#42411)
akii96 May 13, 2026
382b864
[ROCm][CI] Skip ROCm batch invalid-input test pending torch fix (#41572)
AndreasKaratzas May 13, 2026
3e7f8ef
[Bugfix] Fix scipy audio resampling ratio (#42233)
BWAAEEEK May 13, 2026
253fda8
[Bugfix][Qwen3-VL] Fix pipeline-parallel deepstack initialization (#4…
MrZ20 May 13, 2026
4d26a00
[kv_offload] Add req_id to ReqContext for per-request tracking (#42507)
ronensc May 13, 2026
9eb8e5d
Triton attention: add USE_TD constexpr for tensor descriptor Q/K/V lo…
afierka-intel May 13, 2026
9faca45
[Bugfix][Quark] Fix W8A8 INT8 garbage outputs on Step-3.5-Flash (and …
JoursBleu May 13, 2026
b4a17ab
[AMD] skip machete tests for rocm (#42326)
hissu-hyvarinen May 13, 2026
b245a9d
[CI] Re-enable Nemotron Parse parity test and switch testing to nemot…
mwawrzos May 13, 2026
5a445b1
[XPU] [CT] Enable CT W4A4MxFp4 path and add xpu kernel (#38896)
zufangzhu May 13, 2026
a1434a2
[Bugfix] [ROCm] [DSV4] [Perf] Add aiter mhc support (#41946)
tjtanaa May 13, 2026
16404b6
[kv_offload] Add multi-tier KV cache offloading framework (#40020)
ronensc May 13, 2026
90b7524
[Feature] Support compile mode for batch invariance on SM80 (#42456)
yewentao256 May 13, 2026
db6bd39
[Feature] Support custom callable proposer backend for speculative de…
CynicDora May 13, 2026
ed9b02d
[Bugfix][Model] Gemma4 MoE routing closure captures per_expert_scale,…
NoeliaBentancor May 13, 2026
513fd28
[Spec Decode] Support hybrid attention models in extract_hidden_state…
mgoin May 13, 2026
24b5742
[MM][CG] Support ViT CG for Qwen2-VL (#41736)
johncalesp May 13, 2026
7432f15
[Bugfix] Handle real-world gpt-oss tool call output in Harmony parsin…
bbrowning May 13, 2026
eb88a40
Remove verifier model type check in speculative config (#42536)
fynnsu May 13, 2026
bef83c6
[Quark] Support loading Quark NVFP4 checkpoints in vLLM (#35859)
fxmarty-amd May 13, 2026
3853a59
[ModelRunner V2] Share identical MTP weights (#42538)
njhill May 13, 2026
09a4b5a
[CI] Fix pre-commit issue (#42563)
yewentao256 May 13, 2026
a139173
[Frontend] add support for thinking_token_budget in completions (#42116)
walterbm May 13, 2026
059f83c
[PD] Fix broken NIXL EP installation (#42542)
ovidiusm May 13, 2026
e70f45c
[Quantization] Rework quantization_config to use QuantKey and allow f…
mgoin May 13, 2026
d9ab065
expose flex block size for batch invariant mode (#41252)
liangel-02 May 13, 2026
1d2640a
[Core][MM] Do not use urllib3 to parse data URLs (#42535)
lgeiger May 13, 2026
c1a6323
[Bugfix] Fix DeepSeek V4 MTP HC state handling (#42320)
mmangkad May 13, 2026
42defd6
[Bugfix] V1: support tuple model outputs in ubatch wrapper (dbo + spe…
he-yufeng May 13, 2026
a8a931d
[CI] set max transformers version for skywork model (#42104)
divakar-amd May 13, 2026
685ff29
fix(tool-parser): preserve "none"/"nil" strings as valid enum values …
ianliuy May 14, 2026
6b9ac40
[Refactor] Use shared utils in hermes tool parser (#42570)
sfeng33 May 14, 2026
bfdb539
[Bugfix] Fix Gemma4ToolParser streaming float corruption (#42128)
abinggo May 14, 2026
d4a5015
[Bugfix][Spec Decode] Wire draft_probs into probabilistic draft_model…
bedeks May 14, 2026
7fad4f2
[Feature] Add instruction support for score/rerank chat templates (#4…
KrxGu May 14, 2026
061fcd2
[XPU][CT] Support mxfp8 moe model (#41918)
jikunshang May 14, 2026
aa124db
[Bugfix] Fix EPLB initialization for VLM wrapper models (#39805)
esmeetu May 14, 2026
e643741
[Fix] Weight loading for qwen3_5 using runai_streamer (#42521)
hks-9697-v2 May 14, 2026
561339c
[Misc] Fix mypy error in parser_manager type narrowing (#42441)
Sarah-Salah May 14, 2026
ecb0caa
[CI][XPU] skip ut of offload connector (#42598)
zhenwei-intel May 14, 2026
93b0082
Use hidden_pad and intermediate_pad from vLLM #34301 (#42098)
rebklee May 14, 2026
5636c56
[MLA Attention Backend] Add TOKENSPEED_MLA backend for DSR1/Kimi K25 …
zyongye May 14, 2026
7a8517d
Revert "[Core] Replace routing replay with device cache and async D2H…
aoshen02 May 14, 2026
fd34987
Update Dockerfile.rocm for AINIC & Thor NIC (#40453)
haic0 May 14, 2026
769dfad
[CI][AMD] Skip tests where models have problems or fails on both HW t…
rasmith May 14, 2026
bcf914b
[CI][AMD][BugFix] Prevent triton compiler error when running test_moe…
rasmith May 14, 2026
a73d22b
[Bug] Fix DeepSeek V4 `AttributeError: module 'cutlass.cute.nvgpu' ha…
yewentao256 May 14, 2026
432cb6d
[XPU] Fix double-transpose in XPUFP8ScaledMMLinearKernel for W8A8 qua…
libinta May 14, 2026
fedbdd4
[Quantization][Autoround][Toolkit] Add W4A16 Support (#39778)
Zhenzhong1 May 14, 2026
242e613
[DSV4] Fuse norm and router for low latency scenario (#41263)
jeejeelee May 14, 2026
db92ba1
[Compile] Fix compile warning with topk softplus sqrt (#41261)
yewentao256 May 14, 2026
c753c38
[kv_offload] Implement `reset_cache()` for the offloading connector (…
hickeyma May 14, 2026
02395dc
[Bugfix] Fix TRTLLM ragged MLA prefill workspace warmup (#42112)
mmangkad May 14, 2026
cafdc89
[RFC] Replace shared-memory routed experts with ModelRunnerOutput tra…
xhx1022 May 14, 2026
1a77d22
PD disagg with NIXL Connector: GDN support (Qwen3.5) (#41869)
ZhanqiuHu May 14, 2026
e217cdb
[V1][DP][LB] Publish request counts at the start of each engine step …
vadiklyutiy May 14, 2026
9f547b0
[ROCm] Enable gluon paged MQA logits on gfx950 (MI355X) (#42062)
frida-andersson May 14, 2026
b2186e0
[Bugfix] Fix LM detection for Nemotron Parse (#42641)
DarkLight1337 May 14, 2026
a3d13a5
[Fix] Misc Fixes in ViT CUDA Graph (#38040)
b-mu May 14, 2026
fcc973a
[Bugfix][Multimodal] PyAV video backend returns keyframes labeled as …
WindChimeRan May 14, 2026
1b41931
[Model Runner v2] Oracle for model runner v2 - qwen3 dense model by d…
yewentao256 May 14, 2026
84a0efd
[Attention] Remove deprecated MLA prefill arguments (#42555)
MatthewBonanni May 14, 2026
a1f991b
[Aiter][ROCm] RMSNormGated+GroupedQuantFP8 fusion (#40710)
tpopp May 14, 2026
4926e7d
[CI][ROCm] Remove unsupported cases in test_fusion.py (#38680)
charlifu May 14, 2026
3be2038
[Bugfix] Add swiglu limits to deepgemm fp8 methods (#41986)
zyongye May 14, 2026
a26f954
[Model Runner V2][Bug Fix][DSV4] Ensure lazy attention state initiali…
TheEpicDolphin May 14, 2026
9cad089
[Quant] Consolidate GPTQ: rename gptq_marlin.py to auto_gptq.py (#38288)
chengyinie May 15, 2026
6248b6d
Bump llguidance to 1.7 (#42150)
ricky-chaoju May 15, 2026
138e1dc
[Bugfix] Fix incorrect chat template format for Qwen3.5 (#42660)
DarkLight1337 May 15, 2026
ba5e6d6
[CPU][RISC-V] Add RVV-optimized attention kernels for RISC-V Vector …
lyd1992 May 15, 2026
9aaf720
[Model] Support InternS2 Preview (#42705)
Isotr0py May 15, 2026
b10f503
[Bugfix] Fix inverted condition causing thinking_token_budget to be s…
JasonKeyiL May 15, 2026
5c4af07
Update Intel Xeon model list and vLLM Benchmark Suite BKMs (#42607)
louie-tsai May 15, 2026
3f81ba9
[Bugfix] Clarify CPU backend memory error messages reference shared f…
daniel-devlab May 15, 2026
5945c5b
[Deprecation] Remove old locations of `get_tokenizer` and `resolve_hf…
DarkLight1337 May 15, 2026
461daa9
[Misc] Make it simpler to replace out-of-tree layer classes with rela…
paulyu12 May 15, 2026
14cd8e8
[Core][DSV4] Skip caching SWA blocks that can never serve a prefix-ca…
ivanium May 15, 2026
50e7466
[Entrypoints] Split the pooling offline API into PoolingOfflineMixin.…
noooop May 15, 2026
e20d300
DeepSeekV4-Pro enable cuda graph full and piecewise mode (#42604)
bobofang11235 May 15, 2026
5be48ac
[ROCm][CI] Stage B gating (#42025)
AndreasKaratzas May 15, 2026
0e198dc
fix: propagate revision/code_revision pins to all artifact boundaries…
jperezdealgaba May 15, 2026
ac70dd1
gemma3 multi-gpu bug-fix (#42630)
pmaybank May 15, 2026
567dda8
[Bugfix] Ensure embeding model compilation on CPU (#42709)
bigPYJ1151 May 15, 2026
0ca4131
[Bugfix] DFlash FP8 KV-Cache (#42692)
benchislett May 15, 2026
6138a79
[Feat][RL] IPC weight sync optimizations: multigpu support and chunke…
hao-aaron May 15, 2026
3233ced
[ROCm] Widen OAI Triton MoE capability range to include gfx12 (RDNA4)…
laudney May 15, 2026
add9e5e
[Model Runner V2] Fix kv_connector `pre_forward` order (#42676)
yewentao256 May 15, 2026
4cad6cb
[Perf] Optimize MLA attention `_v_up_proj` bmm by removing additional…
yewentao256 May 15, 2026
465578e
[Bugfix] Fix DeepGEMM context lens contiguity in MLA indexer (#42135)
mmangkad May 15, 2026
ed53de2
[Perf] Set IR Op Priority Once at Worker Init (#42631)
BadrBasowid May 15, 2026
b6834a2
[ROCm][MLA] FP8 ASM prefill for AITER dense MLA backend on gfx950 (#4…
maeehart May 15, 2026
0a1eed9
[Model Runner V2] FP32 gumbel sampling. (#41775)
PatchouliTIS May 15, 2026
f15a76c
[Model Runner v2] Support reload weights (sleep mode) (#42673)
yewentao256 May 15, 2026
9dc964a
[ROCm] Widen AITER fused AR RMSNorm 1-stage gate (#42409)
akii96 May 15, 2026
f962116
[LMCacheMPConnector] Prioritize importing the lmcache_mp_connector fr…
chunxiaozheng May 15, 2026
ef70df1
[Bugfix] Fix SM121 (DGX Spark) exclusion from Marlin/CUTLASS FP8 path…
blake-snc May 15, 2026
1b2936c
[ROCm] Restore fast top_k_per_row kernels for sparse MLA when topk_to…
frida-andersson May 15, 2026
2363c08
[FlashAttn] Fix supports_kv_cache_dtype() accepting unhandled fp8 kv-…
liulanze May 15, 2026
76522eb
Add HumanEval and GSM8K benchmarks to datasets (#42648)
southfreebird May 15, 2026
ffe40c6
[Build] Switch CUDA 12.9 wheel builds to PyTorch manylinux_2_28 base …
mgoin May 15, 2026
99d8efd
[ROCm][Bugfix] Fix fused_mla_dual_rms_norm for AITER API rename _fuse…
rbrugaro-amd May 15, 2026
6f53c80
[Bugfix] Fix layerwise reload alias-buffer corruption (#42481)
rasdani May 15, 2026
37e6ee6
[Bugfix] Unwrap VLM wrappers for EPLB on Model Runner V2 (#42706)
JasonKeyiL May 15, 2026
f7cc7ee
[Kernel][UX] Add `--linear-backend` arg for linear kernel selection (…
mgoin May 16, 2026
45cd994
[Bugfix] Respect explicit --kv-cache-dtype over checkpoint kv_cache_s…
mgoin May 16, 2026
d9a8a3c
[Misc] Add common random prefix option to structured-output serving b…
viktorpusTT May 16, 2026
697e942
fix: add API key authorization to /v2 endpoints (#42594)
dusthunter May 16, 2026
1c02b79
[LoRA][Bugfix] Dedup LoRA wrapping for modules referenced from multip…
jeejeelee May 16, 2026
5fbd37c
[Docker][KVConnector] Build mooncake-transfer-engine from source (#42…
zhewenl May 16, 2026
892f117
[ROCm][CI] Removed problematic command override mechanism (#42807)
AndreasKaratzas May 16, 2026
098c565
[Experimental] Breakable CUDA graph (#42304)
ZJY0516 May 16, 2026
bbbd615
Fix: Propagate pinned model revisions into Ultravox secondary weight …
weizhoublue May 16, 2026
5fe9320
Add unit tests for pooler activation functions (#42824)
taneem-ibrahim May 16, 2026
46d6176
[KV Connector] Support disk offloading in MooncakeStoreConnector (#42…
zhewenl May 16, 2026
a91afb9
[CI/Build] Bump flashinfer to v0.6.11.post2 (#41711)
arpera May 16, 2026
8791b27
Fix Weight loading for Qwen3.5-MTP and Qwen3-VL using runai_streamer…
weizhoublue May 17, 2026
6b4b123
Support bf16 for mamba ssm cache (#41680)
qizzzh May 17, 2026
730d241
[MRV2][XPU] add Model Runner V2 log (#42710)
zhenwei-intel May 17, 2026
abc235b
[XPU] fix weight scale shape (#42725)
zufangzhu May 17, 2026
6ec2100
Refactor: Pass num_labels explicitly to PoolerClassify instead of rea…
taneem-ibrahim May 17, 2026
e347565
[ROCm] [Bugfix] Fix DeepSeek V4 Functionality and Accuracy (#42810)
tjtanaa May 17, 2026
9408d54
[torch.compile] Add patch for fullgraph compilation (#42686)
ProExpertProg May 17, 2026
4735160
[Perf] Wire silu_and_mul_per_block_quant into TritonFP8MoE (MiniMax-M…
qianlihuang May 18, 2026
dd3bd7c
[CI] Add NIXL EP import canary (#42567)
alec-flowers May 18, 2026
d942962
[MM][CG] Enable encoder Cudagraph for Step3VL (#42224)
JisoLya May 18, 2026
8fe2ab0
[ROCm][CI] Stabilize ROCm pooling and multimodal CI (#42909)
AndreasKaratzas May 18, 2026
b12bc8b
[BugFix] Kimi-K2.5: skip vision tower dtype conversion when using qua…
gaozihao-shy May 18, 2026
f57ff70
Improve logging when docs build is skipped (#42929)
hmellor May 18, 2026
8e392fd
[Bugfix] moe lora align kernel grid (#40131)
TheDuyIT May 18, 2026
59b148d
[LoRA] Support 2D and 3D MoE LoRA adapter at the same time (#42242)
jeejeelee May 18, 2026
a98522b
[Model] [Perf] Use flatten for Qwen3.5's GDN output projection (#42311)
rishaps May 18, 2026
ee39b86
Revert checkpoint specific workaround in Transformers modelling backe…
hmellor May 18, 2026
d241456
[Perf] Add do_not_specialize in fused FP8 RoPE kernel (#42849)
xyang16 May 18, 2026
9f31b8d
delete xpu ci (#42582)
wendyliu235 May 18, 2026
9c2c844
[CPU] Specify required KV cache layout for CPU attention backend (#42…
hlin99 May 18, 2026
d2f1ece
[Kernel] Pack topk id/weights triton kernel (#42527)
jeejeelee May 18, 2026
597507e
[CPU] Add fused GDN support for AMX CPU platform (#42707)
bigPYJ1151 May 18, 2026
3b6654a
[CPU Backend] Improve cpu thread utilization (#42666)
tianmu-li May 18, 2026
fd1ebf2
[CPU] Add MXFP4 W4A16 MoE support (#41922)
yuwenzho May 18, 2026
77c4f6a
fix: remove unused norm for dpskv4 (#41710)
inisis May 18, 2026
aa6c0e9
[Bugfix][KV Offload] count appended GPU blocks in store group_sizes (…
kfirtoledo May 18, 2026
5a03073
[Bugfix][Hybrid][NemotronH] Fix mamba_cache_mode=all + speculative de…
roikoren755 May 18, 2026
fa4cfdc
[MRv2] Default to MRv1 when a connector is present (#42955)
NickLucche May 18, 2026
e9c802c
[XPU][CI] Temporarily skip test_moe_lora_align_block_size_mixed_base_…
zxd1997066 May 18, 2026
b8fd237
[KV Connector][Offloading] Flush all pending jobs on last step (#42611)
liranschour May 18, 2026
4543c65
Revert "[torch.compile] Add patch for fullgraph compilation" (#42686)…
vllm-agent May 18, 2026
f85d79e
[Model Runner v2] Support update_config (#42783)
mgoin May 18, 2026
4faa0ab
Refactor AWQ Marlin MoE onto modular WNA16 oracle (#42483)
bedeks May 18, 2026
a935e7b
[Model] Add Apertus Tool Parser (#41154)
blancsw May 18, 2026
40e077f
[Bugfix] mamba: run single-token extends as decodes (#42430)
netanel-haber May 18, 2026
3103256
[Model Runner V2] Fix prompt logprobs calculation `Sizes of tensors m…
yewentao256 May 18, 2026
d056a49
Fix `--convert` passed without `--runner` on causal models (#42935)
hmellor May 18, 2026
04573a1
[Perf] Re-enable flashinfer autotune by default and cleanup (#42857)
wzhao18 May 18, 2026
ee8e096
[Bugfix] Fix DSV4 MTP after ROCm mHC integration (#42930)
mmangkad May 18, 2026
fe1cef4
[Bugfix] fix swiglu limit issue for humming backend + deepseek v4 (#4…
jinzhen-lin May 18, 2026
c7633d1
[ROCm][Quantization][3/N] Refactor quark_moe w4a4 w/ oracle (#41436)
BowenBao May 18, 2026
c826fbb
[BugFix] support PP for Cohere vision model (#42819)
czhu-cohere May 18, 2026
6f5b1e2
[Refactor] Remove dead cuda kernels (#42767)
yewentao256 May 18, 2026
fab8967
[Docs] update attribution to reflect EDEN foundation (#41666)
amitport May 18, 2026
b895235
[ROCm] Guard AITER GDN decode fast path by layout (#42880)
tuukkjs May 18, 2026
677fabd
Tier offload followup (#42529)
ronensc May 18, 2026
d0a206c
[Refactor] Remove dead code (#42889)
yewentao256 May 18, 2026
b48723c
[Perf][MLA] Enable FULL cudagraph capture for TRITON_MLA decode (#42885)
haosdent May 18, 2026
6511c9c
[Refactor] Extract shared coerce_to_schema_type utility from Minimax …
sfeng33 May 18, 2026
a0d59a3
[Perf] Padded nvfp4 quant kernel to remove additional copy, 2.4%~5.7%…
yewentao256 May 18, 2026
8cb5a5c
Add parallel drafting to v2 model runner unsupported features (#43010)
shanjiaz May 18, 2026
e668941
[CI/Build] Bump nvidia-cutlass-dsl to 4.5.1 (#42991)
arpera May 18, 2026
b06e127
[Frontend] Add --spec-method/--spec-model/--spec-tokens CLI aliases (…
mgoin May 19, 2026
e5fe2ca
[Model Refactoring] Migrate DeepSeek V4 to vllm/models/ [1/N] (#43004)
WoosukKwon May 19, 2026
acc8b06
[Bugfix] Use platform-agnostic device in example_connector load (#42926)
revit13 May 19, 2026
0bd683f
[BugFix][CPU][Spec Decode] Fix Eagle implementation on CPU backend (#…
ofirzaf May 19, 2026
b890872
[XPU] add gptq(int4) support (#37844)
jikunshang May 19, 2026
1c6531d
[UX] Add a persistent cache for FlashInfer autotuning (#42537)
mmangkad May 19, 2026
39da19a
[Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#…
NickLucche May 19, 2026
7f6fdc0
[Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` …
WoosukKwon May 19, 2026
8cc876e
add cutedsl dsv4 indexer fp8 kernel (#42899)
gnovack May 19, 2026
72183bd
[Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between P…
qyYue1389 May 19, 2026
ceff0d6
[ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
khluu May 19, 2026
5b3e3c5
fix: use keyword arguments for shard_id and expert_id in weight_loade…
junyanxu May 19, 2026
5321987
[Docs] Add SVG images for pooling models. (#42626)
gracie-guo May 19, 2026
aa70c15
[XPU] Use custom op collective behavior (#41354)
chaojun-zhang May 19, 2026
3ca2e10
[Misc] Aligning tokwise pooler heads for consistency (#43041)
taneem-ibrahim May 19, 2026
e2d1732
[Docs] Reorganize online serving docs. (#41907)
noooop May 19, 2026
ecb7cc8
[Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
noooop May 19, 2026
a25b78f
[Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] …
WoosukKwon May 19, 2026
ba4793c
[bug] AsyncScheduler drops first post-resume token after pause_genera…
hao-aaron May 19, 2026
c0aadb8
[KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
ivanium May 19, 2026
7606745
[Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
WoosukKwon May 19, 2026
1653445
[Misc][MM] Remove redundant code in CLIPAttention (#43046)
shen-shanshan May 19, 2026
8428a98
[CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
ZhanqiuHu May 19, 2026
72d4c67
[Bugfix] Fix top logprobs token placeholders in `/inference/v1/genera…
sagearc May 19, 2026
f925bd3
Move to _xpu_ops
mfylcek May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .buildkite/ci_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ run_all_patterns:
- "CMakeLists.txt"
- "requirements/common.txt"
- "requirements/cuda.txt"
- "requirements/kv_connectors.txt"
- "requirements/build/cuda.txt"
- "requirements/test/cuda.txt"
- "setup.py"
Expand Down
30 changes: 23 additions & 7 deletions .buildkite/hardware_tests/cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,19 @@ steps:
- vllm/_custom_ops.py
- tests/kernels/attention/test_cpu_attn.py
- tests/kernels/moe/test_cpu_fused_moe.py
- tests/kernels/moe/test_cpu_quant_fused_moe.py
- tests/kernels/test_onednn.py
- tests/kernels/test_awq_int4_to_int8.py
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
pytest -x -v -s tests/kernels/moe/test_cpu_quant_fused_moe.py
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py"
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"

- label: CPU-Compatibility Tests
depends_on: []
Expand Down Expand Up @@ -57,23 +61,24 @@ steps:
source_file_dependencies:
- csrc/cpu/
- vllm/model_executor/layers/quantization/cpu_wna16.py
- vllm/model_executor/layers/quantization/gptq_marlin.py
- vllm/model_executor/layers/quantization/auto_gptq.py
- vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
- vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
- vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
- vllm/model_executor/layers/fused_moe/experts/cpu_moe.py
- tests/quantization/test_compressed_tensors.py
- tests/quantization/test_cpu_wna16.py
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
pytest -x -v -s tests/quantization/test_cpu_wna16.py"

- label: CPU-Distributed Tests
- label: CPU-Distributed Tests (PP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies:
source_file_dependencies: &cpu_distributed_deps
- csrc/cpu/shm.cpp
- vllm/v1/worker/cpu_worker.py
- vllm/v1/worker/gpu_worker.py
Expand All @@ -82,10 +87,21 @@ steps:
- vllm/platforms/cpu.py
- vllm/distributed/parallel_state.py
- vllm/distributed/device_communicators/cpu_communicator.py
- .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh tp_pp"

- label: CPU-Distributed Tests (DP+TP)
depends_on: []
device: intel_cpu
no_plugin: true
source_file_dependencies: *cpu_distributed_deps
commands:
- |
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp"

- label: CPU-Multi-Modal Model Tests %N
depends_on: []
Expand Down
7 changes: 0 additions & 7 deletions .buildkite/hardware_tests/intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,3 @@ steps:
commands:
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh

- label: "Intel GPU Test"
depends_on: []
soft_fail: true
device: intel_gpu
no_plugin: true
commands:
- bash .buildkite/scripts/hardware_ci/run-xpu-test.sh
1 change: 1 addition & 0 deletions .buildkite/image_build/image_build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ export BUILDKITE_COMMIT
export PARENT_COMMIT
export IMAGE_TAG
export IMAGE_TAG_LATEST
export COMMIT="${COMMIT:-${BUILDKITE_COMMIT}}"
export CACHE_FROM
export CACHE_FROM_BASE_BRANCH
export CACHE_FROM_MAIN
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/image_build/image_build_torch_nightly.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ echo "Image not found, proceeding with build..."

# --- CUDA 13.0 for nightly builds ---
# Nightly CI uses CUDA 13.0 while regular CI stays on CUDA 12.9
NIGHTLY_CUDA_VERSION="13.0.0"
NIGHTLY_CUDA_VERSION="13.0.2"
NIGHTLY_BUILD_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-devel-ubuntu22.04"
NIGHTLY_FINAL_BASE_IMAGE="nvidia/cuda:${NIGHTLY_CUDA_VERSION}-base-ubuntu22.04"

Expand Down
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/engine_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Engine Intel
depends_on:
- image-build-xpu
steps:
- label: Engine (1 GPU)
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/v1/engine/
- tests/v1/engine/
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s v1/engine --ignore v1/engine/test_preprocess_error_handling.py'
21 changes: 21 additions & 0 deletions .buildkite/intel_jobs/kernels_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
group: Kernels Intel
depends_on:
- image-build-xpu
steps:
- label: vLLM IR Tests
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/ir
- vllm/kernels
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
pytest -v -s kernels/ir'
135 changes: 135 additions & 0 deletions .buildkite/intel_jobs/lora_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
group: LoRA Intel
depends_on:
- image-build-xpu
steps:
- label: LoRA Runtime + Utils
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_layers.py &&
pytest -v -s lora/test_lora_checkpoints.py &&
pytest -v -s lora/test_lora_functions.py &&
pytest -v -s lora/test_lora_huggingface.py &&
pytest -v -s lora/test_lora_manager.py &&
pytest -v -s lora/test_lora_utils.py &&
pytest -v -s lora/test_peft_helper.py &&
pytest -v -s lora/test_resolver.py &&
pytest -v -s lora/test_utils.py &&
pytest -v -s lora/test_add_lora.py &&
pytest -v -s lora/test_worker.py'

- label: LoRA Fused/MoE Kernels
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_fused_moe_lora_kernel.py &&
pytest -v -s lora/test_moe_lora_align_sum.py --deselect="tests/lora/test_moe_lora_align_sum.py::test_moe_lora_align_block_size_mixed_base_and_lora[1]"'

- label: LoRA Punica Kernels
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
set -o pipefail &&
pytest -v -s lora/test_punica_ops.py --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-3-43264-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype1-1-2049-64-128-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-1-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-1-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-8-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype0-3-2049-128-8-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-8-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype1-1-2049-256-128-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-3-64256-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-2-29696-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-3-49408-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-2-16384-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-2-51328-32-4-4]"'

- label: LoRA Punica FP8/XPU Ops
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_punica_ops_fp8.py &&
pytest -v -s lora/test_punica_xpu_ops.py'

- label: LoRA Models
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
(pytest -v -s lora/test_mixtral.py --deselect="tests/lora/test_mixtral.py::test_mixtral_lora[4]" || true) &&
pytest -v -s lora/test_quant_model.py --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model0]" --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model1]" --deselect="tests/lora/test_quant_model.py::test_quant_model_tp_equality[model0]" &&
pytest -v -s lora/test_transformers_model.py &&
pytest -v -s lora/test_chatglm3_tp.py &&
pytest -s -v lora/test_minicpmv_tp.py'

- label: LoRA Multimodal
timeout_in_minutes: 45
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/lora
- tests/lora
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'cd tests &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
pytest -v -s lora/test_default_mm_loras.py &&
pytest -v -s lora/test_whisper.py'
55 changes: 55 additions & 0 deletions .buildkite/intel_jobs/misc_intel.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
group: Miscellaneous Intel
depends_on:
- image-build-xpu
steps:
- label: V1 Core + KV + Metrics
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/
- tests/v1/core
- tests/v1/executor
- tests/v1/kv_offload
- tests/v1/worker
- tests/v1/kv_connector/unit
- tests/v1/metrics
- tests/entrypoints/openai/correctness/test_lmeval.py
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'pip install -r requirements/kv_connectors.txt &&
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
cd tests &&
pytest -v -s v1/executor'

- label: V1 Sample + Logits
timeout_in_minutes: 30
device: intel_gpu
no_plugin: true
working_dir: "."
env:
REGISTRY: "public.ecr.aws/q9t5s3a7"
REPO: "vllm-ci-test-repo"
VLLM_TEST_DEVICE: "xpu"
source_file_dependencies:
- vllm/
- tests/v1/sample
- tests/v1/logits_processors
- tests/v1/test_oracle.py
- tests/v1/test_request.py
- tests/v1/test_outputs.py
commands:
- >-
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
'export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
cd tests &&
pytest -v -s v1/logits_processors --ignore=v1/logits_processors/test_custom_online.py --ignore=v1/logits_processors/test_custom_offline.py &&
pytest -v -s v1/test_oracle.py &&
pytest -v -s v1/test_request.py &&
pytest -v -s v1/test_outputs.py'
9 changes: 6 additions & 3 deletions .buildkite/intel_jobs/test-intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,12 @@ steps:
python3 examples/basic/offline_inference/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --attention-backend=TRITON_ATTN &&
python3 examples/basic/offline_inference/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --quantization fp8 &&
python3 examples/basic/offline_inference/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --kv-cache-dtype fp8 &&
python3 examples/basic/offline_inference/generate.py --model nvidia/Llama-3.1-8B-Instruct-FP8 --block-size 64 --enforce-eager --quantization modelopt --kv-cache-dtype fp8 --attention-backend TRITON_ATTN --max-model-len 4096 &&
python3 examples/basic/offline_inference/generate.py --model superjob/Qwen3-4B-Instruct-2507-GPTQ-Int4 --block-size 64 --enforce-eager --max-model-len 8192 &&
python3 examples/basic/offline_inference/generate.py --model ibm-research/PowerMoE-3b --block-size 64 --enforce-eager -tp 2 &&
python3 examples/basic/offline_inference/generate.py --model ibm-research/PowerMoE-3b --block-size 64 --enforce-eager -tp 2 --enable-expert-parallel'
python3 examples/basic/offline_inference/generate.py --model ibm-research/PowerMoE-3b --block-size 64 --enforce-eager -tp 2 --enable-expert-parallel &&
python3 examples/basic/offline_inference/generate.py --model superjob/Qwen3-4B-Instruct-2507-GPTQ-Int4 --max-model-len 8192
'
- label: "XPU V1 test"
depends_on:
- image-build-xpu
Expand All @@ -61,5 +64,5 @@ steps:
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py --ignore=v1/worker/test_worker_memory_snapshot.py &&
pytest -v -s v1/structured_output &&
pytest -v -s v1/test_serial_utils.py &&
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py --ignore=v1/spec_decode/test_acceptance_length.py &&
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_example_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py --ignore=v1/kv_connector/unit/test_hf3fs_client.py --ignore=v1/kv_connector/unit/test_hf3fs_connector.py --ignore=v1/kv_connector/unit/test_hf3fs_metadata_server.py'
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_speculators_eagle3.py --ignore=v1/spec_decode/test_acceptance_length.py &&
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_example_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py --ignore=v1/kv_connector/unit/test_hf3fs_client.py --ignore=v1/kv_connector/unit/test_hf3fs_connector.py --ignore=v1/kv_connector/unit/test_hf3fs_metadata_server.py --ignore=v1/kv_connector/unit/test_offloading_connector.py'
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install "lm-eval[api]>=0.4.11"
# pip install "lm-eval[api]>=0.4.12"

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install "lm-eval[api]>=0.4.11"
# pip install "lm-eval[api]>=0.4.12"

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install "lm-eval[api]>=0.4.11"
# pip install "lm-eval[api]>=0.4.12"

usage() {
echo``
Expand Down
Loading